CN117315770A - Human behavior recognition method, device and storage medium based on skeleton points - Google Patents
Human behavior recognition method, device and storage medium based on skeleton points Download PDFInfo
- Publication number
- CN117315770A CN117315770A CN202310978530.2A CN202310978530A CN117315770A CN 117315770 A CN117315770 A CN 117315770A CN 202310978530 A CN202310978530 A CN 202310978530A CN 117315770 A CN117315770 A CN 117315770A
- Authority
- CN
- China
- Prior art keywords
- human body
- network
- behavior recognition
- human
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 230000006399 behavior Effects 0.000 claims abstract description 86
- 238000001514 detection method Methods 0.000 claims abstract description 64
- 238000012549 training Methods 0.000 claims abstract description 37
- 238000012360 testing method Methods 0.000 claims abstract description 27
- 238000012545 processing Methods 0.000 claims abstract description 21
- 238000012795 verification Methods 0.000 claims description 15
- 238000010586 diagram Methods 0.000 claims description 12
- 230000007246 mechanism Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 6
- 210000000988 bone and bone Anatomy 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 5
- 238000000137 annealing Methods 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 4
- 238000007667 floating Methods 0.000 claims description 3
- 238000012805 post-processing Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 abstract description 3
- 238000013135 deep learning Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 5
- 238000011176 pooling Methods 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 101100153586 Caenorhabditis elegans top-1 gene Proteins 0.000 description 3
- 101100370075 Mus musculus Top1 gene Proteins 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000005286 illumination Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 210000003423 ankle Anatomy 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 210000002310 elbow joint Anatomy 0.000 description 2
- 210000003127 knee Anatomy 0.000 description 2
- 210000000707 wrist Anatomy 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 206010037180 Psychiatric symptoms Diseases 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- QXAITBQSYVNQDR-UHFFFAOYSA-N amitraz Chemical compound C=1C=C(C)C=C(C)C=1N=CN(C)C=NC1=CC=C(C)C=C1C QXAITBQSYVNQDR-UHFFFAOYSA-N 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a human behavior recognition method, device and storage medium based on skeleton points, which are applied to the technical field of behavior recognition. The method comprises the following steps: s1, data acquisition; s2, constructing a human body detection and posture estimation network; s3, training and testing a human body detection and posture estimation network; s4, human body detection and posture estimation are carried out; s5, tracking multiple targets of the human body; s6, repeating the steps S4 and S5 until the processing of all the video frames is completed, and finally obtaining a human body key point coordinate stacking array of all the video frames; s7, constructing a human behavior recognition network; s8, training and testing a human behavior recognition network; s9, identifying human behaviors. The human body behavior recognition is performed based on the skeleton points, the behavior of the target can be effectively understood by extracting the position transformation information of the skeleton points, and the accuracy is high; meanwhile, the data processing capacity is reduced, and the real-time performance is high.
Description
Technical Field
The invention relates to the technical field of behavior recognition, in particular to a human behavior recognition method, device and storage medium based on skeleton points.
Background
With the continuous development of computer technology, many excellent research results have been achieved in the field of computer vision, so that algorithms such as image processing and target detection are promoted to continuously advance on a development road aiming at meeting the actual scene demands. In recent decades, deep learning technology is continuously broken through, and a new research climax in the fields of target detection, behavior recognition and the like is led, so that the method gradually replaces the traditional method and becomes a common processing framework of computer vision tasks.
Behavior recognition is one of the most challenging tasks of computer vision when it comes to, and is also the basis for subsequent tasks such as intent and trajectory prediction. The behavior recognition is defined as classifying human body actions detected in an input video sequence, so that a computer can read and understand the behaviors of a human body target, for example, judge whether the target is walking, jumping, running and the like, and the method has certain application significance and prospect in the fields of automatic driving, intelligent monitoring and the like. At present, a traditional method or a deep learning method is generally adopted for processing the behavior recognition task.
The traditional behavior recognition method is to manually design features capable of representing behaviors through naked eye observation of a person, obtain feature vectors, and then input the feature vectors into a design classifier to obtain behavior classification results. However, the traditional method has slower detection speed and higher complexity, and meanwhile, the model generalization is not high due to the fact that manual extraction is seriously relied on, and the traditional method is easily affected by the background such as illumination, so that most of traditional methods cannot meet the actual requirements.
In recent years, with the development of deep learning technology, a behavior recognition algorithm based on deep learning gradually goes into the field of view of researchers due to the advantages that the behavior recognition algorithm is not easily affected by the background, and the characteristics do not need to be manually extracted. The behavior recognition method based on deep learning can be largely classified into a bone point-based method and an RGB video-based method according to whether human body key points are required for input. Since human skeletal points can provide a lot of important information for human motion recognition. Even if the detailed characteristics such as background, RGB color, human appearance and the like in the video are not available, the behavior of the target can be effectively understood by extracting the position transformation information of the skeleton points, and the robustness is realized for illumination change and scene change, so that the behavior recognition method based on the skeleton points has less calculated amount and stronger adaptability to different scenes compared with the method based on the RGB video, is more suitable for real-time recognition of behaviors, and gets more attention.
However, when the behavior recognition method is applied in a real scene, people often have high requirements on recognition accuracy and speed, and therefore, computing equipment with high computing capacity and large storage space are required, and huge cost consumption is brought. Therefore, the lightweight human behavior recognition method based on the skeleton points, which is light in weight and can keep recognition accuracy and real-time, has a certain research value.
Therefore, a method, a device and a storage medium for identifying human behaviors based on skeletal points are provided to solve the difficulties existing in the prior art, which are the problems to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides a method, apparatus and storage medium for identifying human behavior based on skeletal points, which are used for solving the technical problems existing in the prior art.
In order to achieve the above object, the present invention provides the following technical solutions:
a human behavior recognition method based on skeleton points comprises the following steps:
s1, data acquisition: shooting video data in a real scene by using a camera;
s2, constructing a human body detection and posture estimation network: based on a lightweight improved YOLO-Pose algorithm, constructing a human body detection and posture estimation network consisting of a trunk network, a BiFPN neck network and a coupled-Head decoupling detection Head, wherein the trunk network is a SheffeNetV2_k5;
s3, training and testing a human body detection and posture estimation network: dividing the COCO-Keypints data set into a training set, a testing set and a verification set, training the human body detection and posture estimation network constructed in the S2 by using the training set, verifying by using the verification set, and finally testing the trained and verified human body detection and posture estimation network by using the testing set;
s4, human body detection and posture estimation are carried out: processing the acquired video by using a lightweight improved YOLO-else algorithm to obtain human body detection frames, 17 human body key point coordinates and corresponding confidence coefficients thereof;
s5, human body multi-target tracking: inputting the human body detection frame obtained in the step S4, 17 human body key point coordinates and the corresponding confidence coefficient thereof into a BoT-SORT multi-target tracking algorithm, endowing each detected human body with a unique identity sequence number, matching the same human body targets in the previous frame and the current frame, and sequencing the human body key point coordinates in the current frame according to the identity sequence numbers;
s6, repeating the steps S4 and S5 until the processing of all the video frames is completed, and finally obtaining a human body key point coordinate stacking array of all the video frames;
s7, constructing a human behavior recognition network: based on the PoseC3D algorithm, the original backbone network of the PoseC3D algorithm is replaced by using a SheffleNetV2_k5 network fused with a CBAM attention mechanism, so that a human behavior recognition network is obtained;
s8, training and testing a human behavior recognition network: dividing the NTU-RGB+D60 XSub data set into a training set, a testing set and a verification set, training the human behavior recognition network of S7 by using the training set, verifying by using the verification set, and finally testing the human behavior recognition network after training and verification by using the testing set;
s9, human behavior recognition: and (3) processing the human body key point coordinate stacking array result obtained in the step (S6) by using a lightweight improved PoseC3D algorithm, and finally obtaining a human body behavior recognition result and a confidence coefficient.
Optionally, S2 is specifically:
s21, a main network is composed of a light Focus downsampling module and a shuffle v2 k5 module, and four scale feature graphs which are 4, 8, 16 and 32 times of downsampling output by the main network then enter a neck network to perform feature fusion;
s22, the neck network is composed of a BiFPN network fused with a CBAM attention mechanism module, wherein 1X 1 convolution is replaced by a GSConv module, and a C3 module is replaced by a lightweight C2f module;
s23, connecting the output characteristic diagram of the neck network with an improved lightweight decoupling head, obtaining a prediction result and performing post-processing.
Optionally, the Focus downsampling module firstly slices the input into four parts and then splices the four parts again, converts the high-resolution picture into a low-resolution image, finally obtains output through the CBR module, and reduces floating point operands while reducing losses caused by the downsampling process through the Focus downsampling module.
Alternatively, the CBR module consists of a two-dimensional convolution layer, BN layer, and ReLU activation function.
Optionally, when training the human body detection and posture estimation network in S3, a Mosaic and Mixup data enhancement method is adopted, an SGD optimizer is used, an initial learning rate is 0.01, and a cosine annealing strategy is adopted to adjust the learning rate.
Optionally, S5 is specifically:
s51, detecting the human body position of the current frame and predicting the position of the human body of the next frame by using a Kalman filtering algorithm;
s52, carrying out data association and matching on the detected target and the predicted result;
s53, updating the successfully matched track;
s54, the BoT-SORT performs re-matching on the track which is failed to be matched with the detection frame, if the track is successfully matched, the track is updated, and if the detection frame is successfully matched, a new track is created; if the track matching fails, a round of matching is performed again, and if the track still fails, the track is directly deleted.
Optionally, S7 specifically is:
s71, constructing a SheffeNet block module, wherein the SheffeNet block module is formed by connecting a building block with a step length of 1 and a step length of 2, and the building block replaces all 3X 3 depth separable convolutions in the SheffeNet v2 with a module with 5X 5 depth separable convolutions;
s72, after the CBAM module is inserted into the SheffeNet block of each stage, the CBAM module and the SheffeNet block are fused and used as a backbone network of PoseC 3D; by employing the modified shufflelenetv 2 network, which fuses the CBAM attention mechanisms, as a backup, the number of parameters and computational effort is reduced.
Optionally, S9 is specifically:
s91, processing the human body key point coordinate stacking array obtained in the S6, wherein the number of video frames is assumed to be F, the maximum identity serial number is assumed to be N, the number of skeleton points is assumed to be K, and finally, an NxFxKx2 array for storing skeleton point coordinates and an NxFxKx1 array for storing skeleton point confidence coefficient are respectively obtained;
s92, obtaining a binary Gaussian distribution heat map of the 2D key points by taking coordinates of each bone point as a center and taking confidence coefficient as a maximum value, wherein the calculation method comprises the following steps of
Wherein J is a key point thermodynamic diagram, i is a variable abscissa, x k Is the abscissa of the key point, j is the ordinate of the variable, y k Is the ordinate of the key point, sigma is the variance, c k K is the key point label, e is a natural constant;
s93, stacking the 2D heat maps obtained in the step S92 to obtain a 3D heat map, and taking the 3D heat map as input of subsequent behavior recognition; in order to reduce redundancy in time dimension, a uniform sampling method is adopted to extract key frames, namely, N key frames are supposed to be needed, firstly, video frames are uniformly divided into N sections, then, one frame is randomly selected from each section, and key frame extraction is completed;
s94, finally, inputting the n names into a PoseC3D network for reasoning, and obtaining n names and scores before the behavior according to actual requirements to complete identification.
A computer device as claimed in any one of the preceding claims, comprising a memory and one or more processors, wherein the memory has executable code stored therein, which when executed by the processors is adapted to carry out the steps of a bone-point-based human behavior recognition method as described above.
A computer readable storage medium according to any one of the preceding claims, the readable storage medium having a program which, when executed by a processor, is adapted to carry out the steps of the above-mentioned method for identifying human behavior based on skeletal points.
Compared with the prior art, the invention discloses a human behavior recognition method, device and storage medium based on skeleton points, which has the beneficial effects that:
1) The invention combines the technologies of target detection, gesture estimation, multi-target tracking and behavior recognition, and provides a lightweight human behavior recognition method with recognition precision and speed competitiveness on the premise of keeping the precision;
2) The requirements can be met through the common camera instead of the professional depth camera, and the equipment cost is reduced;
3) The method has the advantages that the YOLO-Pose single-stage algorithm with light weight improvement is used for carrying out combined human body detection and gesture estimation, the coordinates of the human body detection frame and the key points can be obtained through one-time feature extraction, and compared with the two-stage method for carrying out gesture estimation on the basis of target detection, the method has the advantages that steps are simplified, and the speed is improved. Meanwhile, the calculation and storage cost is reduced while the precision is kept through light weight improvement;
4) Human body behavior recognition is carried out based on skeleton points, the effect of illumination and background is not easy to influence, the behavior of a target can be effectively understood only by extracting position transformation information of the skeleton points, and high accuracy is achieved. Meanwhile, the data processing capacity is reduced, and the real-time performance is high;
5) The human body behavior recognition is carried out by using the PoseC3D algorithm which is improved by light weight, a three-dimensional convolutional neural network is adopted as a basic framework, and compared with the conventional graph convolution method in the current behavior recognition algorithm based on skeleton points, the method has stronger robustness, compatibility and expandability, and better recognition effect can be obtained by using fewer data volumes. Meanwhile, the calculation and storage cost is reduced while the precision is kept through light weight improvement.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a human behavior recognition method based on skeleton points provided by the invention;
FIG. 2 is a flow chart of a human behavior recognition method based on skeleton points provided by the invention;
FIG. 3 is a schematic diagram of the improved YOLO-Pose human detection and Pose estimation network provided by the present invention;
FIG. 4 is a schematic diagram of the structure of the backbone network and the neck network in the improved YOLO-else human body detection and Pose estimation network provided by the present invention;
FIG. 5 is a schematic diagram of a decoupling head in the improved YOLO-Pose human detection and Pose estimation network provided by the present invention;
FIG. 6 is a graph of experimental results performance of the improved YOLO-phase human detection and posture estimation network provided by the present invention on a COCO-Keypoint dataset;
FIG. 7 is a schematic diagram of 17 key points of a human body detected by the improved YOLO-Pose human body detection and posture estimation network provided by the invention;
FIG. 8 is a graph of top-1 versus top-5 accuracy parameters of the improved PoseC3D behavior recognition algorithm provided by the present invention on an NTU-RGB+D60 XSub dataset.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1 and 2, the invention discloses a human behavior recognition method based on skeleton points, which comprises the following steps:
s1, data acquisition: shooting video data in a real scene by using a camera;
s2, constructing a human body detection and posture estimation network: based on a lightweight improved YOLO-Pose algorithm, constructing a human body detection and posture estimation network consisting of a trunk network, a BiFPN neck network and a coupled-Head decoupling detection Head, wherein the trunk network is a SheffeNetV2_k5;
s3, training and testing a human body detection and posture estimation network: dividing the COCO-Keypints data set into a training set, a testing set and a verification set, training the human body detection and posture estimation network constructed in the S2 by using the training set, verifying by using the verification set, and finally testing the trained and verified human body detection and posture estimation network by using the testing set;
s4, human body detection and posture estimation are carried out: processing the acquired video by using a lightweight improved YOLO-else algorithm to obtain human body detection frames, 17 human body key point coordinates and corresponding confidence coefficients thereof;
s5, human body multi-target tracking: inputting the human body detection frame obtained in the step S4, 17 human body key point coordinates and the corresponding confidence coefficient thereof into a BoT-SORT multi-target tracking algorithm, endowing each detected human body with a unique identity sequence number, matching the same human body targets in the previous frame and the current frame, and sequencing the human body key point coordinates in the current frame according to the identity sequence numbers;
s6, repeating the steps S4 and S5 until the processing of all the video frames is completed, and finally obtaining a human body key point coordinate stacking array of all the video frames;
s7, constructing a human behavior recognition network: based on the PoseC3D algorithm, the original backbone network of the PoseC3D algorithm is replaced by using a SheffleNetV2_k5 network fused with a CBAM attention mechanism, so that a human behavior recognition network is obtained;
s8, training and testing a human behavior recognition network: dividing the NTU-RGB+D60 XSub data set into a training set, a testing set and a verification set, training the human behavior recognition network of S7 by using the training set, verifying by using the verification set, and finally testing the human behavior recognition network after training and verification by using the testing set;
s9, human behavior recognition: and (3) processing the human body key point coordinate stacking array result obtained in the step (S6) by using a lightweight improved PoseC3D algorithm, and finally obtaining a human body behavior recognition result and a confidence coefficient.
Specifically, the specific structure of the human body detection and posture estimation network in S2 is shown in fig. 3.
Specifically, the human body detection frame in S4 is the maximum circumscribed rectangle of the human body, and is used for identifying the position of the human body. Confidence refers to the probability that the algorithm calculates for the human body the target contained in the detection box. The 17 human body key points obtained through the posture estimation are respectively a nose, a left eye, a right eye, a left ear, a right ear, a left shoulder, a right shoulder, a left elbow joint, a right elbow joint, a left wrist, a right wrist, a left hip, a right hip, a left knee, a right knee, a left ankle and a right ankle, and the 17 human body key points can better represent the human body posture, as shown in fig. 7.
Further, S2 is specifically:
s21, a main network is composed of a light Focus downsampling module and a shuffle v2 k5 module, and four scale feature graphs which are 4, 8, 16 and 32 times of downsampling output by the main network then enter a neck network to perform feature fusion;
s22, the neck network is composed of a BiFPN network fused with a CBAM attention mechanism module, wherein 1X 1 convolution is replaced by a GSConv module, and a C3 module is replaced by a lightweight C2f module;
s23, connecting the output characteristic diagram of the neck network with an improved lightweight decoupling head, obtaining a prediction result and performing post-processing.
In particular, CBAM is a lightweight attention mechanism module that can be plug and play into any convolutional neural network. Through the combination of the channel attention and the space attention, the connection of each feature on the channel and the space is improved on the premise of hardly increasing the calculated amount and the parameter amount, and the extraction of the effective features of the targets is facilitated, so that the model performance is effectively enhanced. Wherein the channel attention module focuses on the content useful in the image. The characteristics of the two channels are sent into a shared neural network after the maximum pooling and average pooling operation to obtain two characteristic elements, and finally the output, namely the channel attention characteristic diagram, is obtained through product addition and an activation layer. The spatial attention module focuses on the location of valuable information. The input of the module integrates the information first by max pooling and average pooling, followed by the convolution layer and by activation to get the spatial attention profile. Finally, the output and input characteristics of the module are weighted to obtain new refining characteristics with channel and space attention weights.
BiFPN is proposed by Google team, the network introduces a learnable weight method for distinguishing features of different scales, and the features repeatedly pass through a top-down and bottom-up structure, so that the multi-scale features can be fused better and more efficiently. The location and structure of the bipin network in use of the actual algorithm is shown in fig. 4.
The BiFPN uses a rapid normalization fusion method when performing weighted feature fusion operation, and uses a numerical value which is normalized to the range of [0,1] and has a value larger than 0 through a ReLU activation function to perform weighting, so that the method is rapid and stable. The calculation formula is that
Wherein Output is the Output; w is a weight; epsilon is 0.0001 minimum value to prevent numerical instability; input is Input;
the C2f module is composed of CBR, bottleNeck modules, and the C2f module can obtain more gradient information while being light in weight.
In particular, for classification and positioning tasks in target detection, different branches are used for calculation, which is beneficial to the improvement of detection precision, and is a basic idea of decoupling the detection head. Improvement is carried out on the basis of the EdgeYolo detection head: the output characteristic diagram of the neck network is input to three parallel branches of positioning, classifying and key points after being compressed by a 1X 1 convolution layer, each branch is provided with a 3X 3 convolution layer, then the branches are connected with the 1X 1 convolution layer, and finally the implicit expression layer is integrated into the convolution layers of the branches by referring to the implicit knowledge idea, so that the decoupling detection head with high parallelism, low cost and low delay as shown in figure 5 is formed.
All elements of the one-time prediction output of YOLO-Pose are
Wherein C is x Is the abscissa of the upper left corner of the bounding box, C y Is the ordinate of the upper left corner of the bounding box; w is the width of the bounding box; h is the bounding box height; boxconf is bounding box confidence; classconf is the confidence of the classification result; k (K) x As the abscissa of the key point, K y Is the ordinate of the key point; kconf is the key point confidence.
CIoU loss with scale invariance and considering the boundary box center point and aspect ratio is used as the loss function of the boundary box in YOLO-else, and the calculation formula is that
Wherein Lbox is a boundary box loss, s is a scale, i and j are positions, and k is an anchor box sequence number; box is the true value of the scale s of the kth anchor frame at the position (i, j); boxpred is the predicted value of the kth anchor box at position (i, j) with a scale s.
For Pose estimation, the IoU idea of the YOLO-else analogy bounding box loses, with OKS loss being used to supervise the keypoints. OKS loss is scale-invariant and uses different weights to distinguish the importance of each keypoint, calculated as
Wherein L is kpts Is a key point loss; d is Euclidean distance between true value and predicted value of key pointThe method comprises the steps of carrying out a first treatment on the surface of the s is the target scale; k is the weight given to different key points; delta (v) n ) A flag is visible for the keypoint.
When the true value is matched with a certain preset anchor frame, the YOLO-else predicts each key point of the human body through the central point of the anchor frame, calculates OKS for each key point and adds the key points to obtain the final key point loss.
Meanwhile, YOLO-else uses BCE loss to learn the keypoint confidence for determining whether a keypoint belongs to a person in the bounding box. The calculation formula is that
Wherein L is kpt_conf Confidence loss for key points; delta (v) n ) A sign, here considered as a true value, whether or not the keypoint is visible; pkts is the confidence of the keypoint prediction.
Finally, the loss formula for YOLO-else is:
wherein lambda is cls To classify the super-parameters of the loss, a value of 0.5, lambda is assigned box The super-parameter for the boundary box loss is assigned to be 0.05, lambda kpts Assigning a value of 0.1 lambda to the superparameter of the key point loss kpts_conf Assigning 0.5 for the super-parameters of the confidence loss of the key points; l (L) total Is the integral loss; l (L) cls Is a classification loss; l (L) box Loss for bounding box; l (L) kpts Is a key point loss; l (L) kpts_conf Is a loss of confidence for the keypoint.
Further, the Focus downsampling module firstly slices the input into four parts and then splices the four parts again, converts the high-resolution picture into a low-resolution picture, finally obtains output through the CBR module, and reduces floating point operands while reducing losses caused by the downsampling process through the Focus downsampling module.
Further, the CBR module consists of a two-dimensional convolution layer, a BN layer, and a ReLU activation function.
Further, when training the human body detection and posture estimation network in the step S3, a Mosaic and Mixup data enhancement method is adopted, an SGD optimizer is used, the initial learning rate is 0.01, and a cosine annealing strategy is adopted to adjust the learning rate.
Specifically, the Weight Decay Weight is set to 0.0005, the batch size is set to 64, the target class is set to 1, the number of key points is set to 17, 300 rounds of epochs are trained with 640×640 as the input image size, and the trained model weights of each round are saved.
The experimental results are shown in fig. 6, where "precision" is accuracy, "recovery" is recall, "mAP" refers to average accuracy mean, "[email protected]" is the mAP value at IoU threshold value of 0.5, "[email protected]: the mAP mean values at the thresholds of 0.95' IoU are respectively 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9 and 0.95.
In addition, a part of the current mainstream lightweight target detection algorithm is selected to carry out a test experiment on the COCO data set by taking 640 image resolution as input, and the model complexity and performance are compared, and specific data are shown in table 1. The comparison result shows that the algorithm provided by the invention has small competitiveness in indexes such as average precision, parameter quantity and the like.
Table 1: improved YOLO-else human body detection and gesture estimation network and performance comparison graph of other lightweight human body detection networks
Further, S5 is specifically:
s51, detecting the human body position of the current frame and predicting the position of the human body of the next frame by using a Kalman filtering algorithm;
s52, carrying out data association and matching on the detected target and the predicted result;
s53, updating the successfully matched track;
s54, the BoT-SORT performs re-matching on the track which is failed to be matched with the detection frame, if the track is successfully matched, the track is updated, and if the detection frame is successfully matched, a new track is created; if the track matching fails, a round of matching is performed again, and if the track still fails, the track is directly deleted.
Specifically, the BoT-SORT multi-target tracking algorithm in S5 specifically comprises the following steps: firstly, carrying out target detection on each frame of image to obtain the boundary frame coordinates of all targets, and finally, calculating the similarity of the targets between adjacent frames to match.
The problems of excessive number of identity information changes, missed detection and incontinuous track under the background of shielding, blurring and the like are solved by using the improved Kalman filtering, camera motion compensation and pedestrian re-recognition technology;
BoT-SORT uses 8 tuples,
as a kalman filter vector. Where the direct prediction of width and height rather than the prediction of aspect ratio is chosen, a more accurate bounding box that fully encloses the human body can be obtained.
Wherein x is k Is a state vector; x is x c The abscissa of the center point of the boundary frame; y is c Is the ordinate of the center point of the boundary frame; w is the bounding box width; h is the bounding box height;is the first derivative of the corresponding variable.
Further, S7 is specifically:
s71, constructing a SheffeNet block module, wherein the SheffeNet block module is formed by connecting a building block with a step length of 1 and a step length of 2, and the building block replaces all 3X 3 depth separable convolutions in the SheffeNet v2 with a module with 5X 5 depth separable convolutions;
s72, after the CBAM module is inserted into the SheffeNet block of each stage, the CBAM module and the SheffeNet block are fused and used as a backbone network of PoseC 3D; by employing the modified shufflelenetv 2 network, which fuses the CBAM attention mechanisms, as a backup, the number of parameters and computational effort is reduced.
Specifically, an Adam optimizer is used during training in S8, the initial learning rate is 0.001, the learning rate is adjusted by adopting a cosine annealing strategy, and the weight decay is set to 0.0001. Batch size (batch-size) was set to 64, training was performed for 200 rounds (epochs), and training logs and models were saved for each round. The remaining non-mentioned hyper-parameters remain highly consistent with the training parameter PoseC3D default values.
As shown in FIG. 8, the "top-1 acc" refers to the accuracy of ranking the prediction results from high to low according to probability, and the top classification is consistent with the true classification; "top-5 acc" means that the prediction results are ranked from high to low according to probability, and the top five categories contain the accuracy of the actual category;
in addition, the comparison results of the improved algorithm and the hot behavior recognition algorithm STGCN, 2s-AGCN based on bone points are shown in Table 2. Compared with the two algorithms, the improved model is lighter and has higher accuracy and certain competitiveness.
Table 2: performance comparison table on NTU-RGB+D60 XSub data set with other bone point-based behavior recognition algorithms
Method | Input modality | Top-1 accuracy | Top-5 accuracy | Model size |
STGCN | Skeletal points | 88.9% | 98.7% | 11.90MB |
2s-AGCN | Skeletal points | 88.6% | 98.5% | 13.40MB |
Improved PoseC3D | Skeletal points | 90.3% | 98.5% | 7.09MB |
The behavior classes that can be identified by the behavior identification algorithm obtained by training are shown in table 3.
Table 3: identifiable human behavior class table
Further, S9 is specifically:
s91, processing the human body key point coordinate stacking array obtained in the S6, wherein the number of video frames is assumed to be F, the maximum identity serial number is assumed to be N, the number of skeleton points is assumed to be K, and finally, an NxFxKx2 array for storing skeleton point coordinates and an NxFxKx1 array for storing skeleton point confidence coefficient are respectively obtained;
s92, obtaining a binary Gaussian distribution heat map of the 2D key points by taking coordinates of each bone point as a center and taking confidence coefficient as a maximum value, wherein the calculation method comprises the following steps of
Wherein J is a key point thermodynamic diagram, i is a variable abscissa, x k Is the abscissa of the key point, j is the ordinate of the variable, y k Is the ordinate of the key point, sigma is the variance, c k K is the key point label, e is a natural constant;
s93, stacking the 2D heat maps obtained in the step S92 to obtain a 3D heat map, and taking the 3D heat map as input of subsequent behavior recognition; in order to reduce redundancy in time dimension, a uniform sampling method is adopted to extract key frames, namely, N key frames are supposed to be needed, firstly, video frames are uniformly divided into N sections, then, one frame is randomly selected from each section, and key frame extraction is completed;
s94, finally, inputting the n names into a PoseC3D network for reasoning, and obtaining n names and scores before the behavior according to actual requirements to complete identification.
A computer device as claimed in any one of the preceding claims, comprising a memory and one or more processors, wherein the memory has executable code stored therein, which when executed by the processors is adapted to carry out the steps of a bone-point-based human behavior recognition method as described above.
Specifically, the computer device is any device or apparatus having data processing capabilities.
A computer readable storage medium as claimed in any one of the preceding claims, the readable storage medium having a program which, when executed by a processor, is adapted to carry out the steps of a bone-point-based human behavior recognition method as described above.
Specifically, the computer readable storage medium may be an internal storage unit of any device or apparatus having data processing capability, such as a hard disk or a memory, or may be an external storage device of any device having data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. The human behavior recognition method based on the skeleton points is characterized by comprising the following steps of:
s1, data acquisition: shooting video data in a real scene by using a camera;
s2, constructing a human body detection and posture estimation network: based on a lightweight improved YOLO-Pose algorithm, constructing a human body detection and posture estimation network consisting of a trunk network, a BiFPN neck network and a coupled-Head decoupling detection Head, wherein the trunk network is a SheffeNetV2_k5;
s3, training and testing a human body detection and posture estimation network: dividing the COCO-Keypints data set into a training set, a testing set and a verification set, training the human body detection and posture estimation network constructed in the S2 by using the training set, verifying by using the verification set, and finally testing the trained and verified human body detection and posture estimation network by using the testing set;
s4, human body detection and posture estimation are carried out: processing the acquired video by using a lightweight improved YOLO-else algorithm to obtain human body detection frames, 17 human body key point coordinates and corresponding confidence coefficients thereof;
s5, human body multi-target tracking: inputting the human body detection frame obtained in the step S4, 17 human body key point coordinates and the corresponding confidence coefficient thereof into a BoT-SORT multi-target tracking algorithm, endowing each detected human body with a unique identity sequence number, matching the same human body targets in the previous frame and the current frame, and sequencing the human body key point coordinates in the current frame according to the identity sequence numbers;
s6, repeating the steps S4 and S5 until the processing of all the video frames is completed, and finally obtaining a human body key point coordinate stacking array of all the video frames;
s7, constructing a human behavior recognition network: based on the PoseC3D algorithm, the original backbone network of the PoseC3D algorithm is replaced by using a SheffleNetV2_k5 network fused with a CBAM attention mechanism, so that a human behavior recognition network is obtained;
s8, training and testing a human behavior recognition network: dividing the NTU-RGB+D60 XSub data set into a training set, a testing set and a verification set, training the human behavior recognition network of S7 by using the training set, verifying by using the verification set, and finally testing the human behavior recognition network after training and verification by using the testing set;
s9, human behavior recognition: and (3) processing the human body key point coordinate stacking array result obtained in the step (S6) by using a lightweight improved PoseC3D algorithm, and finally obtaining a human body behavior recognition result and a confidence coefficient.
2. The human behavior recognition method based on skeleton points according to claim 1, wherein S2 specifically is:
s21, a main network is composed of a light Focus downsampling module and a shuffle v2 k5 module, and four scale feature graphs which are 4, 8, 16 and 32 times of downsampling output by the main network then enter a neck network to perform feature fusion;
s22, the neck network is composed of a BiFPN network fused with a CBAM attention mechanism module, wherein 1X 1 convolution is replaced by a GSConv module, and a C3 module is replaced by a lightweight C2f module;
s23, connecting the output characteristic diagram of the neck network with an improved lightweight decoupling head, obtaining a prediction result and performing post-processing.
3. The human behavior recognition method based on skeleton points according to claim 2, wherein the Focus downsampling module firstly slices an input into four parts, then splices the four parts, converts a high-resolution picture into a low-resolution image, finally obtains output through the CBR module, and reduces floating point operands while reducing losses caused by a downsampling process through the Focus downsampling module.
4. A method of identifying human behavior based on skeletal points according to claim 3, wherein the CBR module consists of a two-dimensional convolution layer, a BN layer, and a ReLU activation function.
5. The human behavior recognition method based on skeleton points according to claim 1, wherein when training the human detection and posture estimation network in S3, a Mosaic and Mixup data enhancement method is adopted, an SGD optimizer is used, an initial learning rate is 0.01, and a cosine annealing strategy is adopted to adjust the learning rate.
6. The human behavior recognition method based on skeleton points according to claim 1, wherein S5 specifically comprises:
s51, detecting the human body position of the current frame and predicting the position of the human body of the next frame by using a Kalman filtering algorithm;
s52, carrying out data association and matching on the detected target and the predicted result;
s53, updating the successfully matched track;
s54, the BoT-SORT performs re-matching on the track which is failed to be matched with the detection frame, if the track is successfully matched, the track is updated, and if the detection frame is successfully matched, a new track is created; if the track matching fails, a round of matching is performed again, and if the track still fails, the track is directly deleted.
7. The human behavior recognition method based on skeleton points according to claim 1, wherein S7 specifically comprises:
s71, constructing a SheffeNet block module, wherein the SheffeNet block module is formed by connecting a building block with a step length of 1 and a step length of 2, and the building block replaces all 3X 3 depth separable convolutions in the SheffeNet v2 with a module with 5X 5 depth separable convolutions;
s72, after the CBAM module is inserted into the SheffeNet block of each stage, the CBAM module and the SheffeNet block are fused and used as a backbone network of PoseC 3D; by employing the modified shufflelenetv 2 network, which fuses the CBAM attention mechanisms, as a backup, the number of parameters and computational effort is reduced.
8. The human behavior recognition method based on skeleton points according to claim 1, wherein S9 specifically comprises:
s91, processing the human body key point coordinate stacking array obtained in the S6, wherein the number of video frames is assumed to be F, the maximum identity serial number is assumed to be N, the number of skeleton points is assumed to be K, and finally, an NxFxKx2 array for storing skeleton point coordinates and an NxFxKx1 array for storing skeleton point confidence coefficient are respectively obtained;
s92, obtaining a binary Gaussian distribution heat map of the 2D key points by taking coordinates of each bone point as a center and taking confidence coefficient as a maximum value, wherein the calculation method comprises the following steps of
Wherein J is a key point thermodynamic diagram, i is a variable abscissa, x k Is the abscissa of the key point, j is the ordinate of the variable, y k Is the ordinate of the key point, sigma is the variance, c k K is the key point label, e is a natural constant;
s93, stacking the 2D heat maps obtained in the step S92 to obtain a 3D heat map, and taking the 3D heat map as input of subsequent behavior recognition; in order to reduce redundancy in time dimension, a uniform sampling method is adopted to extract key frames, namely, N key frames are supposed to be needed, firstly, video frames are uniformly divided into N sections, then, one frame is randomly selected from each section, and key frame extraction is completed;
s94, finally, inputting the n names into a PoseC3D network for reasoning, and obtaining n names and scores before the behavior according to actual requirements to complete identification.
9. A computer device as claimed in any one of claims 1 to 8, comprising a memory and one or more processors, wherein the memory has executable code stored therein, which when executed by the processors is adapted to carry out the steps of the above-mentioned method for identifying human behaviour based on skeletal points.
10. A computer readable storage medium according to any one of claims 1-8, characterized in that the readable storage medium has a program which, when executed by a processor, is adapted to carry out the steps of the above-mentioned method for identifying human behavior based on skeletal points.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310978530.2A CN117315770A (en) | 2023-08-04 | 2023-08-04 | Human behavior recognition method, device and storage medium based on skeleton points |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310978530.2A CN117315770A (en) | 2023-08-04 | 2023-08-04 | Human behavior recognition method, device and storage medium based on skeleton points |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117315770A true CN117315770A (en) | 2023-12-29 |
Family
ID=89272604
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310978530.2A Pending CN117315770A (en) | 2023-08-04 | 2023-08-04 | Human behavior recognition method, device and storage medium based on skeleton points |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117315770A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118397700A (en) * | 2024-04-29 | 2024-07-26 | 北京科技大学 | Human body drowning detection method and detection system based on ST-GCN |
-
2023
- 2023-08-04 CN CN202310978530.2A patent/CN117315770A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118397700A (en) * | 2024-04-29 | 2024-07-26 | 北京科技大学 | Human body drowning detection method and detection system based on ST-GCN |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Localization guided learning for pedestrian attribute recognition | |
CN111709311B (en) | Pedestrian re-identification method based on multi-scale convolution feature fusion | |
Soomro et al. | Action recognition in realistic sports videos | |
Tu | Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering | |
Perlin et al. | Extracting human attributes using a convolutional neural network approach | |
CN114220124A (en) | Near-infrared-visible light cross-modal double-flow pedestrian re-identification method and system | |
Carmona et al. | Human action recognition by means of subtensor projections and dense trajectories | |
CN111178251A (en) | Pedestrian attribute identification method and system, storage medium and terminal | |
CN110163117B (en) | Pedestrian re-identification method based on self-excitation discriminant feature learning | |
Yu et al. | An object-based visual attention model for robotic applications | |
CN110222718B (en) | Image processing method and device | |
CN111814705B (en) | Pedestrian re-identification method based on batch blocking shielding network | |
Zhang et al. | A coarse to fine indoor visual localization method using environmental semantic information | |
CN111104911A (en) | Pedestrian re-identification method and device based on big data training | |
CN112541421B (en) | Pedestrian reloading and reloading recognition method for open space | |
Pang et al. | Analysis of computer vision applied in martial arts | |
CN117315770A (en) | Human behavior recognition method, device and storage medium based on skeleton points | |
Chen et al. | A multi-scale fusion convolutional neural network for face detection | |
Hdioud et al. | Facial expression recognition of masked faces using deep learning | |
Cho et al. | Learning local attention with guidance map for pose robust facial expression recognition | |
CN115223239A (en) | Gesture recognition method and system, computer equipment and readable storage medium | |
Thi et al. | Integrating local action elements for action analysis | |
Rao et al. | Detecting human behavior from a silhouette using convolutional neural networks | |
Yangyang et al. | A Flower Image Classification Algorithm Based on Saliency Map and PCANet | |
CN112487232B (en) | Face retrieval method and related products |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |