CN117315770A - Human behavior recognition method, device and storage medium based on skeleton points - Google Patents

Human behavior recognition method, device and storage medium based on skeleton points Download PDF

Info

Publication number
CN117315770A
CN117315770A CN202310978530.2A CN202310978530A CN117315770A CN 117315770 A CN117315770 A CN 117315770A CN 202310978530 A CN202310978530 A CN 202310978530A CN 117315770 A CN117315770 A CN 117315770A
Authority
CN
China
Prior art keywords
human body
network
behavior recognition
human
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310978530.2A
Other languages
Chinese (zh)
Inventor
卢波翰
从镕
张婷茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202310978530.2A priority Critical patent/CN117315770A/en
Publication of CN117315770A publication Critical patent/CN117315770A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a human behavior recognition method, device and storage medium based on skeleton points, which are applied to the technical field of behavior recognition. The method comprises the following steps: s1, data acquisition; s2, constructing a human body detection and posture estimation network; s3, training and testing a human body detection and posture estimation network; s4, human body detection and posture estimation are carried out; s5, tracking multiple targets of the human body; s6, repeating the steps S4 and S5 until the processing of all the video frames is completed, and finally obtaining a human body key point coordinate stacking array of all the video frames; s7, constructing a human behavior recognition network; s8, training and testing a human behavior recognition network; s9, identifying human behaviors. The human body behavior recognition is performed based on the skeleton points, the behavior of the target can be effectively understood by extracting the position transformation information of the skeleton points, and the accuracy is high; meanwhile, the data processing capacity is reduced, and the real-time performance is high.

Description

Human behavior recognition method, device and storage medium based on skeleton points
Technical Field
The invention relates to the technical field of behavior recognition, in particular to a human behavior recognition method, device and storage medium based on skeleton points.
Background
With the continuous development of computer technology, many excellent research results have been achieved in the field of computer vision, so that algorithms such as image processing and target detection are promoted to continuously advance on a development road aiming at meeting the actual scene demands. In recent decades, deep learning technology is continuously broken through, and a new research climax in the fields of target detection, behavior recognition and the like is led, so that the method gradually replaces the traditional method and becomes a common processing framework of computer vision tasks.
Behavior recognition is one of the most challenging tasks of computer vision when it comes to, and is also the basis for subsequent tasks such as intent and trajectory prediction. The behavior recognition is defined as classifying human body actions detected in an input video sequence, so that a computer can read and understand the behaviors of a human body target, for example, judge whether the target is walking, jumping, running and the like, and the method has certain application significance and prospect in the fields of automatic driving, intelligent monitoring and the like. At present, a traditional method or a deep learning method is generally adopted for processing the behavior recognition task.
The traditional behavior recognition method is to manually design features capable of representing behaviors through naked eye observation of a person, obtain feature vectors, and then input the feature vectors into a design classifier to obtain behavior classification results. However, the traditional method has slower detection speed and higher complexity, and meanwhile, the model generalization is not high due to the fact that manual extraction is seriously relied on, and the traditional method is easily affected by the background such as illumination, so that most of traditional methods cannot meet the actual requirements.
In recent years, with the development of deep learning technology, a behavior recognition algorithm based on deep learning gradually goes into the field of view of researchers due to the advantages that the behavior recognition algorithm is not easily affected by the background, and the characteristics do not need to be manually extracted. The behavior recognition method based on deep learning can be largely classified into a bone point-based method and an RGB video-based method according to whether human body key points are required for input. Since human skeletal points can provide a lot of important information for human motion recognition. Even if the detailed characteristics such as background, RGB color, human appearance and the like in the video are not available, the behavior of the target can be effectively understood by extracting the position transformation information of the skeleton points, and the robustness is realized for illumination change and scene change, so that the behavior recognition method based on the skeleton points has less calculated amount and stronger adaptability to different scenes compared with the method based on the RGB video, is more suitable for real-time recognition of behaviors, and gets more attention.
However, when the behavior recognition method is applied in a real scene, people often have high requirements on recognition accuracy and speed, and therefore, computing equipment with high computing capacity and large storage space are required, and huge cost consumption is brought. Therefore, the lightweight human behavior recognition method based on the skeleton points, which is light in weight and can keep recognition accuracy and real-time, has a certain research value.
Therefore, a method, a device and a storage medium for identifying human behaviors based on skeletal points are provided to solve the difficulties existing in the prior art, which are the problems to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides a method, apparatus and storage medium for identifying human behavior based on skeletal points, which are used for solving the technical problems existing in the prior art.
In order to achieve the above object, the present invention provides the following technical solutions:
a human behavior recognition method based on skeleton points comprises the following steps:
s1, data acquisition: shooting video data in a real scene by using a camera;
s2, constructing a human body detection and posture estimation network: based on a lightweight improved YOLO-Pose algorithm, constructing a human body detection and posture estimation network consisting of a trunk network, a BiFPN neck network and a coupled-Head decoupling detection Head, wherein the trunk network is a SheffeNetV2_k5;
s3, training and testing a human body detection and posture estimation network: dividing the COCO-Keypints data set into a training set, a testing set and a verification set, training the human body detection and posture estimation network constructed in the S2 by using the training set, verifying by using the verification set, and finally testing the trained and verified human body detection and posture estimation network by using the testing set;
s4, human body detection and posture estimation are carried out: processing the acquired video by using a lightweight improved YOLO-else algorithm to obtain human body detection frames, 17 human body key point coordinates and corresponding confidence coefficients thereof;
s5, human body multi-target tracking: inputting the human body detection frame obtained in the step S4, 17 human body key point coordinates and the corresponding confidence coefficient thereof into a BoT-SORT multi-target tracking algorithm, endowing each detected human body with a unique identity sequence number, matching the same human body targets in the previous frame and the current frame, and sequencing the human body key point coordinates in the current frame according to the identity sequence numbers;
s6, repeating the steps S4 and S5 until the processing of all the video frames is completed, and finally obtaining a human body key point coordinate stacking array of all the video frames;
s7, constructing a human behavior recognition network: based on the PoseC3D algorithm, the original backbone network of the PoseC3D algorithm is replaced by using a SheffleNetV2_k5 network fused with a CBAM attention mechanism, so that a human behavior recognition network is obtained;
s8, training and testing a human behavior recognition network: dividing the NTU-RGB+D60 XSub data set into a training set, a testing set and a verification set, training the human behavior recognition network of S7 by using the training set, verifying by using the verification set, and finally testing the human behavior recognition network after training and verification by using the testing set;
s9, human behavior recognition: and (3) processing the human body key point coordinate stacking array result obtained in the step (S6) by using a lightweight improved PoseC3D algorithm, and finally obtaining a human body behavior recognition result and a confidence coefficient.
Optionally, S2 is specifically:
s21, a main network is composed of a light Focus downsampling module and a shuffle v2 k5 module, and four scale feature graphs which are 4, 8, 16 and 32 times of downsampling output by the main network then enter a neck network to perform feature fusion;
s22, the neck network is composed of a BiFPN network fused with a CBAM attention mechanism module, wherein 1X 1 convolution is replaced by a GSConv module, and a C3 module is replaced by a lightweight C2f module;
s23, connecting the output characteristic diagram of the neck network with an improved lightweight decoupling head, obtaining a prediction result and performing post-processing.
Optionally, the Focus downsampling module firstly slices the input into four parts and then splices the four parts again, converts the high-resolution picture into a low-resolution image, finally obtains output through the CBR module, and reduces floating point operands while reducing losses caused by the downsampling process through the Focus downsampling module.
Alternatively, the CBR module consists of a two-dimensional convolution layer, BN layer, and ReLU activation function.
Optionally, when training the human body detection and posture estimation network in S3, a Mosaic and Mixup data enhancement method is adopted, an SGD optimizer is used, an initial learning rate is 0.01, and a cosine annealing strategy is adopted to adjust the learning rate.
Optionally, S5 is specifically:
s51, detecting the human body position of the current frame and predicting the position of the human body of the next frame by using a Kalman filtering algorithm;
s52, carrying out data association and matching on the detected target and the predicted result;
s53, updating the successfully matched track;
s54, the BoT-SORT performs re-matching on the track which is failed to be matched with the detection frame, if the track is successfully matched, the track is updated, and if the detection frame is successfully matched, a new track is created; if the track matching fails, a round of matching is performed again, and if the track still fails, the track is directly deleted.
Optionally, S7 specifically is:
s71, constructing a SheffeNet block module, wherein the SheffeNet block module is formed by connecting a building block with a step length of 1 and a step length of 2, and the building block replaces all 3X 3 depth separable convolutions in the SheffeNet v2 with a module with 5X 5 depth separable convolutions;
s72, after the CBAM module is inserted into the SheffeNet block of each stage, the CBAM module and the SheffeNet block are fused and used as a backbone network of PoseC 3D; by employing the modified shufflelenetv 2 network, which fuses the CBAM attention mechanisms, as a backup, the number of parameters and computational effort is reduced.
Optionally, S9 is specifically:
s91, processing the human body key point coordinate stacking array obtained in the S6, wherein the number of video frames is assumed to be F, the maximum identity serial number is assumed to be N, the number of skeleton points is assumed to be K, and finally, an NxFxKx2 array for storing skeleton point coordinates and an NxFxKx1 array for storing skeleton point confidence coefficient are respectively obtained;
s92, obtaining a binary Gaussian distribution heat map of the 2D key points by taking coordinates of each bone point as a center and taking confidence coefficient as a maximum value, wherein the calculation method comprises the following steps of
Wherein J is a key point thermodynamic diagram, i is a variable abscissa, x k Is the abscissa of the key point, j is the ordinate of the variable, y k Is the ordinate of the key point, sigma is the variance, c k K is the key point label, e is a natural constant;
s93, stacking the 2D heat maps obtained in the step S92 to obtain a 3D heat map, and taking the 3D heat map as input of subsequent behavior recognition; in order to reduce redundancy in time dimension, a uniform sampling method is adopted to extract key frames, namely, N key frames are supposed to be needed, firstly, video frames are uniformly divided into N sections, then, one frame is randomly selected from each section, and key frame extraction is completed;
s94, finally, inputting the n names into a PoseC3D network for reasoning, and obtaining n names and scores before the behavior according to actual requirements to complete identification.
A computer device as claimed in any one of the preceding claims, comprising a memory and one or more processors, wherein the memory has executable code stored therein, which when executed by the processors is adapted to carry out the steps of a bone-point-based human behavior recognition method as described above.
A computer readable storage medium according to any one of the preceding claims, the readable storage medium having a program which, when executed by a processor, is adapted to carry out the steps of the above-mentioned method for identifying human behavior based on skeletal points.
Compared with the prior art, the invention discloses a human behavior recognition method, device and storage medium based on skeleton points, which has the beneficial effects that:
1) The invention combines the technologies of target detection, gesture estimation, multi-target tracking and behavior recognition, and provides a lightweight human behavior recognition method with recognition precision and speed competitiveness on the premise of keeping the precision;
2) The requirements can be met through the common camera instead of the professional depth camera, and the equipment cost is reduced;
3) The method has the advantages that the YOLO-Pose single-stage algorithm with light weight improvement is used for carrying out combined human body detection and gesture estimation, the coordinates of the human body detection frame and the key points can be obtained through one-time feature extraction, and compared with the two-stage method for carrying out gesture estimation on the basis of target detection, the method has the advantages that steps are simplified, and the speed is improved. Meanwhile, the calculation and storage cost is reduced while the precision is kept through light weight improvement;
4) Human body behavior recognition is carried out based on skeleton points, the effect of illumination and background is not easy to influence, the behavior of a target can be effectively understood only by extracting position transformation information of the skeleton points, and high accuracy is achieved. Meanwhile, the data processing capacity is reduced, and the real-time performance is high;
5) The human body behavior recognition is carried out by using the PoseC3D algorithm which is improved by light weight, a three-dimensional convolutional neural network is adopted as a basic framework, and compared with the conventional graph convolution method in the current behavior recognition algorithm based on skeleton points, the method has stronger robustness, compatibility and expandability, and better recognition effect can be obtained by using fewer data volumes. Meanwhile, the calculation and storage cost is reduced while the precision is kept through light weight improvement.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a human behavior recognition method based on skeleton points provided by the invention;
FIG. 2 is a flow chart of a human behavior recognition method based on skeleton points provided by the invention;
FIG. 3 is a schematic diagram of the improved YOLO-Pose human detection and Pose estimation network provided by the present invention;
FIG. 4 is a schematic diagram of the structure of the backbone network and the neck network in the improved YOLO-else human body detection and Pose estimation network provided by the present invention;
FIG. 5 is a schematic diagram of a decoupling head in the improved YOLO-Pose human detection and Pose estimation network provided by the present invention;
FIG. 6 is a graph of experimental results performance of the improved YOLO-phase human detection and posture estimation network provided by the present invention on a COCO-Keypoint dataset;
FIG. 7 is a schematic diagram of 17 key points of a human body detected by the improved YOLO-Pose human body detection and posture estimation network provided by the invention;
FIG. 8 is a graph of top-1 versus top-5 accuracy parameters of the improved PoseC3D behavior recognition algorithm provided by the present invention on an NTU-RGB+D60 XSub dataset.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1 and 2, the invention discloses a human behavior recognition method based on skeleton points, which comprises the following steps:
s1, data acquisition: shooting video data in a real scene by using a camera;
s2, constructing a human body detection and posture estimation network: based on a lightweight improved YOLO-Pose algorithm, constructing a human body detection and posture estimation network consisting of a trunk network, a BiFPN neck network and a coupled-Head decoupling detection Head, wherein the trunk network is a SheffeNetV2_k5;
s3, training and testing a human body detection and posture estimation network: dividing the COCO-Keypints data set into a training set, a testing set and a verification set, training the human body detection and posture estimation network constructed in the S2 by using the training set, verifying by using the verification set, and finally testing the trained and verified human body detection and posture estimation network by using the testing set;
s4, human body detection and posture estimation are carried out: processing the acquired video by using a lightweight improved YOLO-else algorithm to obtain human body detection frames, 17 human body key point coordinates and corresponding confidence coefficients thereof;
s5, human body multi-target tracking: inputting the human body detection frame obtained in the step S4, 17 human body key point coordinates and the corresponding confidence coefficient thereof into a BoT-SORT multi-target tracking algorithm, endowing each detected human body with a unique identity sequence number, matching the same human body targets in the previous frame and the current frame, and sequencing the human body key point coordinates in the current frame according to the identity sequence numbers;
s6, repeating the steps S4 and S5 until the processing of all the video frames is completed, and finally obtaining a human body key point coordinate stacking array of all the video frames;
s7, constructing a human behavior recognition network: based on the PoseC3D algorithm, the original backbone network of the PoseC3D algorithm is replaced by using a SheffleNetV2_k5 network fused with a CBAM attention mechanism, so that a human behavior recognition network is obtained;
s8, training and testing a human behavior recognition network: dividing the NTU-RGB+D60 XSub data set into a training set, a testing set and a verification set, training the human behavior recognition network of S7 by using the training set, verifying by using the verification set, and finally testing the human behavior recognition network after training and verification by using the testing set;
s9, human behavior recognition: and (3) processing the human body key point coordinate stacking array result obtained in the step (S6) by using a lightweight improved PoseC3D algorithm, and finally obtaining a human body behavior recognition result and a confidence coefficient.
Specifically, the specific structure of the human body detection and posture estimation network in S2 is shown in fig. 3.
Specifically, the human body detection frame in S4 is the maximum circumscribed rectangle of the human body, and is used for identifying the position of the human body. Confidence refers to the probability that the algorithm calculates for the human body the target contained in the detection box. The 17 human body key points obtained through the posture estimation are respectively a nose, a left eye, a right eye, a left ear, a right ear, a left shoulder, a right shoulder, a left elbow joint, a right elbow joint, a left wrist, a right wrist, a left hip, a right hip, a left knee, a right knee, a left ankle and a right ankle, and the 17 human body key points can better represent the human body posture, as shown in fig. 7.
Further, S2 is specifically:
s21, a main network is composed of a light Focus downsampling module and a shuffle v2 k5 module, and four scale feature graphs which are 4, 8, 16 and 32 times of downsampling output by the main network then enter a neck network to perform feature fusion;
s22, the neck network is composed of a BiFPN network fused with a CBAM attention mechanism module, wherein 1X 1 convolution is replaced by a GSConv module, and a C3 module is replaced by a lightweight C2f module;
s23, connecting the output characteristic diagram of the neck network with an improved lightweight decoupling head, obtaining a prediction result and performing post-processing.
In particular, CBAM is a lightweight attention mechanism module that can be plug and play into any convolutional neural network. Through the combination of the channel attention and the space attention, the connection of each feature on the channel and the space is improved on the premise of hardly increasing the calculated amount and the parameter amount, and the extraction of the effective features of the targets is facilitated, so that the model performance is effectively enhanced. Wherein the channel attention module focuses on the content useful in the image. The characteristics of the two channels are sent into a shared neural network after the maximum pooling and average pooling operation to obtain two characteristic elements, and finally the output, namely the channel attention characteristic diagram, is obtained through product addition and an activation layer. The spatial attention module focuses on the location of valuable information. The input of the module integrates the information first by max pooling and average pooling, followed by the convolution layer and by activation to get the spatial attention profile. Finally, the output and input characteristics of the module are weighted to obtain new refining characteristics with channel and space attention weights.
BiFPN is proposed by Google team, the network introduces a learnable weight method for distinguishing features of different scales, and the features repeatedly pass through a top-down and bottom-up structure, so that the multi-scale features can be fused better and more efficiently. The location and structure of the bipin network in use of the actual algorithm is shown in fig. 4.
The BiFPN uses a rapid normalization fusion method when performing weighted feature fusion operation, and uses a numerical value which is normalized to the range of [0,1] and has a value larger than 0 through a ReLU activation function to perform weighting, so that the method is rapid and stable. The calculation formula is that
Wherein Output is the Output; w is a weight; epsilon is 0.0001 minimum value to prevent numerical instability; input is Input;
the C2f module is composed of CBR, bottleNeck modules, and the C2f module can obtain more gradient information while being light in weight.
In particular, for classification and positioning tasks in target detection, different branches are used for calculation, which is beneficial to the improvement of detection precision, and is a basic idea of decoupling the detection head. Improvement is carried out on the basis of the EdgeYolo detection head: the output characteristic diagram of the neck network is input to three parallel branches of positioning, classifying and key points after being compressed by a 1X 1 convolution layer, each branch is provided with a 3X 3 convolution layer, then the branches are connected with the 1X 1 convolution layer, and finally the implicit expression layer is integrated into the convolution layers of the branches by referring to the implicit knowledge idea, so that the decoupling detection head with high parallelism, low cost and low delay as shown in figure 5 is formed.
All elements of the one-time prediction output of YOLO-Pose are
Wherein C is x Is the abscissa of the upper left corner of the bounding box, C y Is the ordinate of the upper left corner of the bounding box; w is the width of the bounding box; h is the bounding box height; boxconf is bounding box confidence; classconf is the confidence of the classification result; k (K) x As the abscissa of the key point, K y Is the ordinate of the key point; kconf is the key point confidence.
CIoU loss with scale invariance and considering the boundary box center point and aspect ratio is used as the loss function of the boundary box in YOLO-else, and the calculation formula is that
Wherein Lbox is a boundary box loss, s is a scale, i and j are positions, and k is an anchor box sequence number; box is the true value of the scale s of the kth anchor frame at the position (i, j); boxpred is the predicted value of the kth anchor box at position (i, j) with a scale s.
For Pose estimation, the IoU idea of the YOLO-else analogy bounding box loses, with OKS loss being used to supervise the keypoints. OKS loss is scale-invariant and uses different weights to distinguish the importance of each keypoint, calculated as
Wherein L is kpts Is a key point loss; d is Euclidean distance between true value and predicted value of key pointThe method comprises the steps of carrying out a first treatment on the surface of the s is the target scale; k is the weight given to different key points; delta (v) n ) A flag is visible for the keypoint.
When the true value is matched with a certain preset anchor frame, the YOLO-else predicts each key point of the human body through the central point of the anchor frame, calculates OKS for each key point and adds the key points to obtain the final key point loss.
Meanwhile, YOLO-else uses BCE loss to learn the keypoint confidence for determining whether a keypoint belongs to a person in the bounding box. The calculation formula is that
Wherein L is kpt_conf Confidence loss for key points; delta (v) n ) A sign, here considered as a true value, whether or not the keypoint is visible; pkts is the confidence of the keypoint prediction.
Finally, the loss formula for YOLO-else is:
wherein lambda is cls To classify the super-parameters of the loss, a value of 0.5, lambda is assigned box The super-parameter for the boundary box loss is assigned to be 0.05, lambda kpts Assigning a value of 0.1 lambda to the superparameter of the key point loss kpts_conf Assigning 0.5 for the super-parameters of the confidence loss of the key points; l (L) total Is the integral loss; l (L) cls Is a classification loss; l (L) box Loss for bounding box; l (L) kpts Is a key point loss; l (L) kpts_conf Is a loss of confidence for the keypoint.
Further, the Focus downsampling module firstly slices the input into four parts and then splices the four parts again, converts the high-resolution picture into a low-resolution picture, finally obtains output through the CBR module, and reduces floating point operands while reducing losses caused by the downsampling process through the Focus downsampling module.
Further, the CBR module consists of a two-dimensional convolution layer, a BN layer, and a ReLU activation function.
Further, when training the human body detection and posture estimation network in the step S3, a Mosaic and Mixup data enhancement method is adopted, an SGD optimizer is used, the initial learning rate is 0.01, and a cosine annealing strategy is adopted to adjust the learning rate.
Specifically, the Weight Decay Weight is set to 0.0005, the batch size is set to 64, the target class is set to 1, the number of key points is set to 17, 300 rounds of epochs are trained with 640×640 as the input image size, and the trained model weights of each round are saved.
The experimental results are shown in fig. 6, where "precision" is accuracy, "recovery" is recall, "mAP" refers to average accuracy mean, "[email protected]" is the mAP value at IoU threshold value of 0.5, "[email protected]: the mAP mean values at the thresholds of 0.95' IoU are respectively 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9 and 0.95.
In addition, a part of the current mainstream lightweight target detection algorithm is selected to carry out a test experiment on the COCO data set by taking 640 image resolution as input, and the model complexity and performance are compared, and specific data are shown in table 1. The comparison result shows that the algorithm provided by the invention has small competitiveness in indexes such as average precision, parameter quantity and the like.
Table 1: improved YOLO-else human body detection and gesture estimation network and performance comparison graph of other lightweight human body detection networks
Further, S5 is specifically:
s51, detecting the human body position of the current frame and predicting the position of the human body of the next frame by using a Kalman filtering algorithm;
s52, carrying out data association and matching on the detected target and the predicted result;
s53, updating the successfully matched track;
s54, the BoT-SORT performs re-matching on the track which is failed to be matched with the detection frame, if the track is successfully matched, the track is updated, and if the detection frame is successfully matched, a new track is created; if the track matching fails, a round of matching is performed again, and if the track still fails, the track is directly deleted.
Specifically, the BoT-SORT multi-target tracking algorithm in S5 specifically comprises the following steps: firstly, carrying out target detection on each frame of image to obtain the boundary frame coordinates of all targets, and finally, calculating the similarity of the targets between adjacent frames to match.
The problems of excessive number of identity information changes, missed detection and incontinuous track under the background of shielding, blurring and the like are solved by using the improved Kalman filtering, camera motion compensation and pedestrian re-recognition technology;
BoT-SORT uses 8 tuples,
as a kalman filter vector. Where the direct prediction of width and height rather than the prediction of aspect ratio is chosen, a more accurate bounding box that fully encloses the human body can be obtained.
Wherein x is k Is a state vector; x is x c The abscissa of the center point of the boundary frame; y is c Is the ordinate of the center point of the boundary frame; w is the bounding box width; h is the bounding box height;is the first derivative of the corresponding variable.
Further, S7 is specifically:
s71, constructing a SheffeNet block module, wherein the SheffeNet block module is formed by connecting a building block with a step length of 1 and a step length of 2, and the building block replaces all 3X 3 depth separable convolutions in the SheffeNet v2 with a module with 5X 5 depth separable convolutions;
s72, after the CBAM module is inserted into the SheffeNet block of each stage, the CBAM module and the SheffeNet block are fused and used as a backbone network of PoseC 3D; by employing the modified shufflelenetv 2 network, which fuses the CBAM attention mechanisms, as a backup, the number of parameters and computational effort is reduced.
Specifically, an Adam optimizer is used during training in S8, the initial learning rate is 0.001, the learning rate is adjusted by adopting a cosine annealing strategy, and the weight decay is set to 0.0001. Batch size (batch-size) was set to 64, training was performed for 200 rounds (epochs), and training logs and models were saved for each round. The remaining non-mentioned hyper-parameters remain highly consistent with the training parameter PoseC3D default values.
As shown in FIG. 8, the "top-1 acc" refers to the accuracy of ranking the prediction results from high to low according to probability, and the top classification is consistent with the true classification; "top-5 acc" means that the prediction results are ranked from high to low according to probability, and the top five categories contain the accuracy of the actual category;
in addition, the comparison results of the improved algorithm and the hot behavior recognition algorithm STGCN, 2s-AGCN based on bone points are shown in Table 2. Compared with the two algorithms, the improved model is lighter and has higher accuracy and certain competitiveness.
Table 2: performance comparison table on NTU-RGB+D60 XSub data set with other bone point-based behavior recognition algorithms
Method Input modality Top-1 accuracy Top-5 accuracy Model size
STGCN Skeletal points 88.9% 98.7% 11.90MB
2s-AGCN Skeletal points 88.6% 98.5% 13.40MB
Improved PoseC3D Skeletal points 90.3% 98.5% 7.09MB
The behavior classes that can be identified by the behavior identification algorithm obtained by training are shown in table 3.
Table 3: identifiable human behavior class table
Further, S9 is specifically:
s91, processing the human body key point coordinate stacking array obtained in the S6, wherein the number of video frames is assumed to be F, the maximum identity serial number is assumed to be N, the number of skeleton points is assumed to be K, and finally, an NxFxKx2 array for storing skeleton point coordinates and an NxFxKx1 array for storing skeleton point confidence coefficient are respectively obtained;
s92, obtaining a binary Gaussian distribution heat map of the 2D key points by taking coordinates of each bone point as a center and taking confidence coefficient as a maximum value, wherein the calculation method comprises the following steps of
Wherein J is a key point thermodynamic diagram, i is a variable abscissa, x k Is the abscissa of the key point, j is the ordinate of the variable, y k Is the ordinate of the key point, sigma is the variance, c k K is the key point label, e is a natural constant;
s93, stacking the 2D heat maps obtained in the step S92 to obtain a 3D heat map, and taking the 3D heat map as input of subsequent behavior recognition; in order to reduce redundancy in time dimension, a uniform sampling method is adopted to extract key frames, namely, N key frames are supposed to be needed, firstly, video frames are uniformly divided into N sections, then, one frame is randomly selected from each section, and key frame extraction is completed;
s94, finally, inputting the n names into a PoseC3D network for reasoning, and obtaining n names and scores before the behavior according to actual requirements to complete identification.
A computer device as claimed in any one of the preceding claims, comprising a memory and one or more processors, wherein the memory has executable code stored therein, which when executed by the processors is adapted to carry out the steps of a bone-point-based human behavior recognition method as described above.
Specifically, the computer device is any device or apparatus having data processing capabilities.
A computer readable storage medium as claimed in any one of the preceding claims, the readable storage medium having a program which, when executed by a processor, is adapted to carry out the steps of a bone-point-based human behavior recognition method as described above.
Specifically, the computer readable storage medium may be an internal storage unit of any device or apparatus having data processing capability, such as a hard disk or a memory, or may be an external storage device of any device having data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. The human behavior recognition method based on the skeleton points is characterized by comprising the following steps of:
s1, data acquisition: shooting video data in a real scene by using a camera;
s2, constructing a human body detection and posture estimation network: based on a lightweight improved YOLO-Pose algorithm, constructing a human body detection and posture estimation network consisting of a trunk network, a BiFPN neck network and a coupled-Head decoupling detection Head, wherein the trunk network is a SheffeNetV2_k5;
s3, training and testing a human body detection and posture estimation network: dividing the COCO-Keypints data set into a training set, a testing set and a verification set, training the human body detection and posture estimation network constructed in the S2 by using the training set, verifying by using the verification set, and finally testing the trained and verified human body detection and posture estimation network by using the testing set;
s4, human body detection and posture estimation are carried out: processing the acquired video by using a lightweight improved YOLO-else algorithm to obtain human body detection frames, 17 human body key point coordinates and corresponding confidence coefficients thereof;
s5, human body multi-target tracking: inputting the human body detection frame obtained in the step S4, 17 human body key point coordinates and the corresponding confidence coefficient thereof into a BoT-SORT multi-target tracking algorithm, endowing each detected human body with a unique identity sequence number, matching the same human body targets in the previous frame and the current frame, and sequencing the human body key point coordinates in the current frame according to the identity sequence numbers;
s6, repeating the steps S4 and S5 until the processing of all the video frames is completed, and finally obtaining a human body key point coordinate stacking array of all the video frames;
s7, constructing a human behavior recognition network: based on the PoseC3D algorithm, the original backbone network of the PoseC3D algorithm is replaced by using a SheffleNetV2_k5 network fused with a CBAM attention mechanism, so that a human behavior recognition network is obtained;
s8, training and testing a human behavior recognition network: dividing the NTU-RGB+D60 XSub data set into a training set, a testing set and a verification set, training the human behavior recognition network of S7 by using the training set, verifying by using the verification set, and finally testing the human behavior recognition network after training and verification by using the testing set;
s9, human behavior recognition: and (3) processing the human body key point coordinate stacking array result obtained in the step (S6) by using a lightweight improved PoseC3D algorithm, and finally obtaining a human body behavior recognition result and a confidence coefficient.
2. The human behavior recognition method based on skeleton points according to claim 1, wherein S2 specifically is:
s21, a main network is composed of a light Focus downsampling module and a shuffle v2 k5 module, and four scale feature graphs which are 4, 8, 16 and 32 times of downsampling output by the main network then enter a neck network to perform feature fusion;
s22, the neck network is composed of a BiFPN network fused with a CBAM attention mechanism module, wherein 1X 1 convolution is replaced by a GSConv module, and a C3 module is replaced by a lightweight C2f module;
s23, connecting the output characteristic diagram of the neck network with an improved lightweight decoupling head, obtaining a prediction result and performing post-processing.
3. The human behavior recognition method based on skeleton points according to claim 2, wherein the Focus downsampling module firstly slices an input into four parts, then splices the four parts, converts a high-resolution picture into a low-resolution image, finally obtains output through the CBR module, and reduces floating point operands while reducing losses caused by a downsampling process through the Focus downsampling module.
4. A method of identifying human behavior based on skeletal points according to claim 3, wherein the CBR module consists of a two-dimensional convolution layer, a BN layer, and a ReLU activation function.
5. The human behavior recognition method based on skeleton points according to claim 1, wherein when training the human detection and posture estimation network in S3, a Mosaic and Mixup data enhancement method is adopted, an SGD optimizer is used, an initial learning rate is 0.01, and a cosine annealing strategy is adopted to adjust the learning rate.
6. The human behavior recognition method based on skeleton points according to claim 1, wherein S5 specifically comprises:
s51, detecting the human body position of the current frame and predicting the position of the human body of the next frame by using a Kalman filtering algorithm;
s52, carrying out data association and matching on the detected target and the predicted result;
s53, updating the successfully matched track;
s54, the BoT-SORT performs re-matching on the track which is failed to be matched with the detection frame, if the track is successfully matched, the track is updated, and if the detection frame is successfully matched, a new track is created; if the track matching fails, a round of matching is performed again, and if the track still fails, the track is directly deleted.
7. The human behavior recognition method based on skeleton points according to claim 1, wherein S7 specifically comprises:
s71, constructing a SheffeNet block module, wherein the SheffeNet block module is formed by connecting a building block with a step length of 1 and a step length of 2, and the building block replaces all 3X 3 depth separable convolutions in the SheffeNet v2 with a module with 5X 5 depth separable convolutions;
s72, after the CBAM module is inserted into the SheffeNet block of each stage, the CBAM module and the SheffeNet block are fused and used as a backbone network of PoseC 3D; by employing the modified shufflelenetv 2 network, which fuses the CBAM attention mechanisms, as a backup, the number of parameters and computational effort is reduced.
8. The human behavior recognition method based on skeleton points according to claim 1, wherein S9 specifically comprises:
s91, processing the human body key point coordinate stacking array obtained in the S6, wherein the number of video frames is assumed to be F, the maximum identity serial number is assumed to be N, the number of skeleton points is assumed to be K, and finally, an NxFxKx2 array for storing skeleton point coordinates and an NxFxKx1 array for storing skeleton point confidence coefficient are respectively obtained;
s92, obtaining a binary Gaussian distribution heat map of the 2D key points by taking coordinates of each bone point as a center and taking confidence coefficient as a maximum value, wherein the calculation method comprises the following steps of
Wherein J is a key point thermodynamic diagram, i is a variable abscissa, x k Is the abscissa of the key point, j is the ordinate of the variable, y k Is the ordinate of the key point, sigma is the variance, c k K is the key point label, e is a natural constant;
s93, stacking the 2D heat maps obtained in the step S92 to obtain a 3D heat map, and taking the 3D heat map as input of subsequent behavior recognition; in order to reduce redundancy in time dimension, a uniform sampling method is adopted to extract key frames, namely, N key frames are supposed to be needed, firstly, video frames are uniformly divided into N sections, then, one frame is randomly selected from each section, and key frame extraction is completed;
s94, finally, inputting the n names into a PoseC3D network for reasoning, and obtaining n names and scores before the behavior according to actual requirements to complete identification.
9. A computer device as claimed in any one of claims 1 to 8, comprising a memory and one or more processors, wherein the memory has executable code stored therein, which when executed by the processors is adapted to carry out the steps of the above-mentioned method for identifying human behaviour based on skeletal points.
10. A computer readable storage medium according to any one of claims 1-8, characterized in that the readable storage medium has a program which, when executed by a processor, is adapted to carry out the steps of the above-mentioned method for identifying human behavior based on skeletal points.
CN202310978530.2A 2023-08-04 2023-08-04 Human behavior recognition method, device and storage medium based on skeleton points Pending CN117315770A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310978530.2A CN117315770A (en) 2023-08-04 2023-08-04 Human behavior recognition method, device and storage medium based on skeleton points

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310978530.2A CN117315770A (en) 2023-08-04 2023-08-04 Human behavior recognition method, device and storage medium based on skeleton points

Publications (1)

Publication Number Publication Date
CN117315770A true CN117315770A (en) 2023-12-29

Family

ID=89272604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310978530.2A Pending CN117315770A (en) 2023-08-04 2023-08-04 Human behavior recognition method, device and storage medium based on skeleton points

Country Status (1)

Country Link
CN (1) CN117315770A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118397700A (en) * 2024-04-29 2024-07-26 北京科技大学 Human body drowning detection method and detection system based on ST-GCN

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118397700A (en) * 2024-04-29 2024-07-26 北京科技大学 Human body drowning detection method and detection system based on ST-GCN

Similar Documents

Publication Publication Date Title
Liu et al. Localization guided learning for pedestrian attribute recognition
CN111709311B (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
Soomro et al. Action recognition in realistic sports videos
Tu Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering
Perlin et al. Extracting human attributes using a convolutional neural network approach
CN114220124A (en) Near-infrared-visible light cross-modal double-flow pedestrian re-identification method and system
Carmona et al. Human action recognition by means of subtensor projections and dense trajectories
CN111178251A (en) Pedestrian attribute identification method and system, storage medium and terminal
CN110163117B (en) Pedestrian re-identification method based on self-excitation discriminant feature learning
Yu et al. An object-based visual attention model for robotic applications
CN110222718B (en) Image processing method and device
CN111814705B (en) Pedestrian re-identification method based on batch blocking shielding network
Zhang et al. A coarse to fine indoor visual localization method using environmental semantic information
CN111104911A (en) Pedestrian re-identification method and device based on big data training
CN112541421B (en) Pedestrian reloading and reloading recognition method for open space
Pang et al. Analysis of computer vision applied in martial arts
CN117315770A (en) Human behavior recognition method, device and storage medium based on skeleton points
Chen et al. A multi-scale fusion convolutional neural network for face detection
Hdioud et al. Facial expression recognition of masked faces using deep learning
Cho et al. Learning local attention with guidance map for pose robust facial expression recognition
CN115223239A (en) Gesture recognition method and system, computer equipment and readable storage medium
Thi et al. Integrating local action elements for action analysis
Rao et al. Detecting human behavior from a silhouette using convolutional neural networks
Yangyang et al. A Flower Image Classification Algorithm Based on Saliency Map and PCANet
CN112487232B (en) Face retrieval method and related products

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination