CN108229352B

CN108229352B - Standing detection method based on deep learning

Info

Publication number: CN108229352B
Application number: CN201711397963.XA
Authority: CN
Inventors: 邵奔驰; 姜飞; 申瑞民
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2021-09-07
Anticipated expiration: 2037-12-21
Also published as: CN108229352A

Abstract

The invention relates to a standing detection method based on deep learning, which comprises the following steps: 1) collecting samples, wherein each sample comprises a sample picture and a corresponding annotation file; 2) establishing a standing detection model, wherein the standing detection model is based on a convolutional neural network structure and is trained by an R-FCN target detection algorithm based on the sample, and the standing detection model comprises a high-grade standing detection model and a low-grade standing detection model; 3) and performing standing detection on the video to be detected by using the trained standing detection model. Compared with the prior art, the invention has the advantages of high detection total rate and accuracy, suitability for complex classroom environment and the like.

Description

Standing detection method based on deep learning

Technical Field

The invention relates to an image processing technology, in particular to a standing detection method based on deep learning.

Background

In recording and broadcasting classrooms of traditional teaching and similar indoor monitoring, a system is needed that can automatically detect standing behavior in order to be able to assess the overall atmosphere and the participation level of the participants. However, the characteristics of the standing behavior are closely related to the height of the individual, the height of the individual in the same scene is different, and the height distribution in different scenes is also different, so that the detection of the standing behavior of the person in the traditional classroom environment is still a difficult task.

A student tracking and positioning method based on master-slave cameras uses two slave cameras and a master camera with a cloud platform device. The secondary camera can automatically or manually generate an interested area, detect whether students enter or leave the interested area by using a background difference method, and send a detection result to the primary camera; the main camera judges the number of standing students according to the information transmitted from the secondary camera, and selects a panoramic recording mode or a positioning recording mode according to the number of standing students, wherein the positioning recording mode detects the outlines of all moving objects by using an interframe difference method, and judges that the object with the highest outline center point is the standing student, and the flow chart is shown in fig. 1. Although the method has certain positioning accuracy, the following defects exist:

1. according to the method, the cameras are arranged on two sides of the blackboard, and the installation height is equal to the leveling position of the top of the head of the student after the student sits down, so that the cameras are basically positioned in front of the sight of the student, and psychological pressure is easily generated on the student. Also at this height, the student may touch the camera, either intentionally or unintentionally, causing a deviation in the results.

2. One master and one slave are required to complete the standing detection.

3. This method is less effective on pupils of lower grades because the pupils of lower grades do not differ much in standing height and sitting height.

Disclosure of Invention

The invention provides a standing detection method based on deep learning to overcome the defects in the prior art.

One of the purposes of the invention is to realize standing detection by only one camera.

The second purpose of the invention is to improve the detection effect on the students in the lower grades.

The invention also aims to serially connect the same standing behavior in different frames to avoid repeated counting of the same standing behavior in different frames.

The purpose of the invention can be realized by the following technical scheme:

a method for standing detection based on deep learning, the method comprising:

1) collecting samples, wherein each sample comprises a sample picture and a corresponding annotation file;

2) establishing a standing detection model, wherein the standing detection model is based on a convolutional neural network structure and is trained by an R-FCN target detection algorithm based on the sample, and the standing detection model comprises a high-grade standing detection model and a low-grade standing detection model;

3) and performing standing detection on the video to be detected by using the trained standing detection model.

The information of the markup document comprises the type of the standing person.

The standing person types include senior students, junior students, and teachers.

The establishment of the standing detection model specifically comprises the following steps:

201) training a basic standing model by using all samples;

202) and respectively carrying out further training on the basis of the basic standing model by using samples with the marks of the senior students and samples with the marks of the junior students to obtain a senior standing detection model and a junior standing detection model.

The method further comprises the steps of:

4) and tracking the standing according to the standing detection result of the previous frame and the current standing detection result.

The tracking specifically comprises:

401) acquiring a first image frame and detected coordinates of standing frames, wherein a tracklet array is correspondingly established for each standing frame, and the state is initialized to ALIVE;

402) acquiring a next image frame, judging whether the lens is changed or not, if so, changing the states of all the tracklet arrays into DEAD, reestablishing a new tracklet array, returning to the step 402), and if not, executing the step 403);

403) traversing all standing frames detected by the current image frame, and selecting a tracklet array which is best matched for each standing frame by utilizing a tracking algorithm;

404) and judging whether the state of the tracklet array which is not matched under the current image frame is ALIVE or not, if so, modifying the state to WAIT, if not, modifying the state to DEAD, and returning to the step 402) until all the image frames are processed.

The specific steps for judging whether the lens is changed are as follows:

acquiring two adjacent image frames, judging whether the difference value of the gray values between the two image frames is greater than a first threshold value, and if so, judging that the pixel points of the two image frames are changed;

and judging whether the proportion of the number of the changed pixel points between the two image frames to the total pixel points is larger than a second threshold value, if so, judging that the lens visual angle conversion occurs, and if not, judging that the lens visual angle conversion does not occur.

The selecting of the tracklet array with the best match is specifically as follows:

and selecting a tracklet closest to the standing frame, calculating the sum of the width difference and the length difference of the border of the tracklet and the standing frame, and judging that the standing frame is best matched with the tracklet when the sum is less than one third of the width of the standing frame and the coincidence ratio of the border of the tracklet and the standing frame is more than 0.3.

The method further comprises the steps of:

5) the number of standings obtained by tracking is counted.

Compared with the prior art, the invention has the following beneficial effects:

1) the invention marks the collected samples, subdivides the standing behavior into teachers and students, and the low-grade and high-grade, and effectively improves the reliability of the model.

2) The invention adopts a target detection algorithm based on deep learning to extract a large amount of data (such as twenty thousand samples) from real videos in a classroom, and can obtain a better detection result in a complex classroom environment.

3) The invention trains the high-grade and low-grade samples respectively to obtain a high-grade standing detection model and a low-grade standing detection model, thereby effectively solving the problem of large difference between high-grade standing and low-grade standing. Tests show that the full detection rate can reach 90% in the senior level and the subordinate level, and the accuracy rate is 80%.

4) The invention adopts the tracking algorithm to track the standing action and can serially connect the same standing action of different frames, thereby obtaining the data of the real standing times and providing a basis for further analysis and evaluation.

Drawings

Fig. 1 is a schematic flowchart of a conventional student tracking and positioning method;

FIG. 2 is a schematic flow chart of the present invention;

FIG. 3 is a schematic diagram of low grade stance behavior;

FIG. 4 is a schematic diagram of high grade stance behavior;

fig. 5 is a schematic diagram illustrating a tracking process of standing movement according to the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

As shown in fig. 2, the present invention provides a standing detection method based on deep learning, which includes the following steps:

1) collecting a sample

The samples were prepared in the format of a PASCAL VOC dataset, which provides a whole set of standardized excellent datasets for image identification and classification, using a label tool, LabelImg. The method mainly focuses on the standing of students, and the teacher also judges that the students stand when walking in a classroom, so that the standing behavior in the classroom is subdivided into the standing of the students and the standing of the teacher, and the standing of the students and the standing of the teacher are respectively marked when the samples are made.

In this embodiment, the number of samples used is 2 ten thousand. Each sample comprises a sample picture and a corresponding annotation file, the information of the annotation file comprises the type of a standing person, frames which are intercepted from a video and contain standing are stored in a JPEGImaps folder, and the corresponding annotation file is stored in exceptions. The standing person types include senior students, junior students, and teachers. The grade division of high and low grade is based on 1-3 grade as low grade, and grade division of four grade and above is high grade.

2) Establishing standing detection model

As shown in fig. 3 and 4, which are schematic diagrams of the standing behaviors of the senior students and the senior students, respectively, it can be seen that the standing and sitting behaviors of the senior students are not much different, and the standing and sitting behaviors of the senior students are much different. Therefore, if two situations are combined in one model, if the standing is detected basically in the lower grade, a lot of false detection is added to the video in the higher grade. The present invention divides and handles these two cases separately.

201) training a base standing model by using an R-FCN model based on ResNet-101 and utilizing all samples of high and low grades together;

202) and respectively carrying out further training on the basis of the basic standing model by using samples with the marks of the senior students and samples with the marks of the junior students to obtain a senior standing detection model and a junior standing detection model. In the present embodiment, the fine-tuning is used to obtain a model for senior and a model for junior.

The frame used for training is cafe, the iteration number during training is 30000 times, and the training parameters are as follows:

base_lr:0.001

lr_policy:"step"

gamma:0.1

stepsize:10000

display:20

momentum:0.9

weight_decay:0.0005

and performing standing detection on the video to be detected by using the trained standing detection model.

In certain embodiments, the method further comprises the steps of:

And the judgment of lens change is needed when the standing tracking is carried out. This is because in the surveillance video of the whole classroom, the lens will rotate, zoom in, zoom out, and the like, and the same standing position in different frames will change greatly, so that the matching cannot be performed effectively.

The algorithm for judging whether the change of the lens occurs is concretely as follows: converting the current frame and the previous frame into a gray-scale image, if the difference value of the gray-scale values between the two frames is greater than a first threshold thres0, judging that the pixel point changes, and when the proportion of the changed pixel point to the total pixel point exceeds a second threshold thres1, judging that the lens changes. Otherwise, judging that the shot is not changed. The number of thres0 used in this example is 20, and the number of thres1 is 0.2 (20%).

As shown in fig. 5, the specific process of standing tracking is as follows:

401) and acquiring a first image frame and detected coordinates of the standing frames, wherein a tracklet array is correspondingly established for each standing frame, the state is initialized to ALIVE, and the tracklet is used for recording tracking information.

402) And acquiring the next image frame, judging whether the lens is transformed, if so, changing the states of all the tracklet arrays into DEAD, reestablishing a new tracklet array, returning to the step 402), and if not, executing the step 403).

403) And traversing all the detected standing frames of the current image frame, and selecting a tracklet array with the best matching for each standing frame by using a tracking algorithm.

The specific process of selecting a tracklet array with the best match is as follows:

firstly, finding the nearest tracklet to each standing frame, calculating the sum of the width difference and the length difference between each standing frame and the nearest tracklet frame, and judging that the standing frame is matched with the current tracklet when the sum is less than one third of the width of the standing frame and the coincidence ratio of the tracklet frame to the standing frame is more than 0.3, otherwise, the standing frame has no matched tracklet. If the tracklet has been matched, the most recent tracklet of the stand frame is updated, and the stand frame is marked to determine whether the tracklet matches the most recent tracklet.

404) And judging whether the state of the tracklet array which is not matched under the current image frame is ALIVE or not, if so, modifying the state to WAIT, because one frame of tracking information disappears, the standing behavior of the frame is possibly not detected, otherwise, modifying the state to DEAD, marking the end of the tracking of the standing behavior, and returning to the step 402) until all the image frames are processed.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A standing detection method based on deep learning is characterized by comprising the following steps:

1) collecting samples, wherein each sample comprises a sample picture and a corresponding label file, the information of the label file comprises standing personnel types, the standing personnel types comprise senior students, junior students and teachers, the first to third grades are junior grades, and the grades of four grades and above are divided into senior grades;

3) performing standing detection on the video to be detected by using the trained standing detection model;

201) training a basic standing model by using all samples;

2. The deep learning based standing detection method according to claim 1, further comprising the steps of:

3. The method for detecting standing based on deep learning of claim 2, wherein the tracking is specifically:

4. The method for detecting standing according to claim 3, wherein the determining whether the shot is transformed specifically comprises:

5. The deep learning based standing detection method according to claim 3, wherein the selecting the best matching one tracklet array is specifically:

and selecting a tracklet closest to the standing frame, calculating the sum of the width difference and the length difference between the border of the tracklet and the standing frame, and judging that the standing frame is best matched with the tracklet when the sum is less than one third of the width of the standing frame and the coincidence ratio of the border of the tracklet to the standing frame is more than 0.3.

6. The deep learning based standing detection method according to claim 2, further comprising the steps of:

5) the number of standings obtained by tracking is counted.