CN111931603A - Human body action recognition system and method based on double-current convolution network of competitive combination network - Google Patents
Human body action recognition system and method based on double-current convolution network of competitive combination network Download PDFInfo
- Publication number
- CN111931603A CN111931603A CN202010710147.5A CN202010710147A CN111931603A CN 111931603 A CN111931603 A CN 111931603A CN 202010710147 A CN202010710147 A CN 202010710147A CN 111931603 A CN111931603 A CN 111931603A
- Authority
- CN
- China
- Prior art keywords
- network
- convolutional
- video
- convolution
- optical flow
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000002860 competitive effect Effects 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000009471 action Effects 0.000 title claims abstract description 24
- 230000003068 static effect Effects 0.000 claims abstract description 85
- 230000003287 optical effect Effects 0.000 claims abstract description 70
- 230000011218 segmentation Effects 0.000 claims abstract description 39
- 238000012706 support-vector machine Methods 0.000 claims abstract description 20
- 230000004927 fusion Effects 0.000 claims abstract description 19
- 238000000605 extraction Methods 0.000 claims abstract description 16
- 238000012549 training Methods 0.000 claims description 20
- 238000011176 pooling Methods 0.000 claims description 12
- 239000013598 vector Substances 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 5
- 239000013257 coordination network Substances 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 4
- 230000000452 restraining effect Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 description 40
- 238000010586 diagram Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 3
- 230000004438 eyesight Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Psychiatry (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a human body action recognition system and method based on a double-current convolution network of a competition network. The method comprises the following steps: inputting a video into a network with a double-flow architecture, wherein a time flow part carries out dynamic and static pixel segmentation on a video frame through a competitive network, outputting an optical flow image with a static area removed, inputting the optical flow image into a medium-scale convolution network, and carrying out feature extraction on the optical flow image; the spatial stream part inputs a plurality of frames of video images into a convolution 3D network and performs feature extraction on each frame of video image; and respectively carrying out feature classification on the extracted features at softmax layers of the medium-scale convolutional network and the convolutional 3D network, and then carrying out score fusion on classification results by adopting a multi-class support vector machine to obtain correct recognition of human body actions, thereby reducing errors caused by external differences such as environment and the like and improving the accuracy of action recognition.
Description
Technical Field
The invention relates to the field of computer vision, in particular to analysis and research of human body action recognition, and more particularly relates to a human body action recognition system and method of a double-current convolution network based on a competition network architecture.
Background
In the information society, picture information and video information account for a large proportion of all sensory information. Computer vision is a subject that uses cameras and computers to acquire data and information about a photographed object, and can automatically and efficiently process picture information and video information, so that the field of computer vision receives more and more attention, and in the field, human motion recognition is an important research direction, and no matter intelligent monitoring, video security and virtual reality technologies are required to be supported by human motion recognition technologies. In many human body motion recognition technologies, a double-current convolutional network not only extracts spatial information of an image but also understands time information in a video frame sequence by simulating visual perception of a human body and understanding video information.
The human body action recognition method comprises the following general steps: firstly, moving object detection is carried out on a video, then feature extraction is carried out on the detected moving object, and finally the extracted features are classified and identified. The traditional motion recognition method is usually researched from two dimensions, but the effect is not ideal. In recent years, the development of deep learning and the application thereof in human body motion recognition have proposed a plurality of methods for automatically extracting features, and the recognition rate of the algorithm is further improved. Three solutions are provided for solving the problem of motion recognition in a video by using a deep learning method, wherein the solutions are respectively as follows: the dual stream method, the C3D (Convolitional 3D Network) method, and the CNN-LSTM (CNN-Long Short-Term Memory). The double-flow method is to apply a double-flow convolution network to process space dimension and time dimension information in a video respectively, wherein the double-flow convolution network is divided into a space flow convolution network and a time flow convolution network. The RGB image is used as the input of a spatial flow convolution network, the optical flow image is used as the input of a time flow convolution network, and overfitting is removed by combining a data set through a multitask training method, so that the identification accuracy is improved.
However, the optical flow field of the double-flow convolution network is greatly influenced by environmental differences such as shielding, multiple viewing angles, illumination, background and the like, and the accuracy of motion recognition is influenced. Meanwhile, the optical flow image is the processing of the whole video, and a static area in the video can generate noise interference on the time flow convolution network. The Competitive fusion network can just solve the problem, and the Competitive fusion network connects the depth estimation network, the camera motion network, the optical flow network and the motion segmentation network together for network improvement, so that the problem of segmentation of a motion area and a static area can be solved.
Therefore, the invention provides a method for applying a competitive network, which can effectively remove the interference of static pixels in video frames to dynamic pixels, reduce errors caused by external differences such as environment and the like, and accurately identify human body actions in videos.
Disclosure of Invention
The invention provides a human body motion recognition system and method based on a double-current convolution network of a competition network. The optical flow network in the competition network outputs an optical flow graph only containing dynamic pixels as the input of the time flow convolution network so as to remove environmental noise and improve the human body action recognition accuracy. Meanwhile, a static area network consisting of a depth network and a camera motion network in the competition network and a dynamic area network represented by an optical flow network are combined to train the motion segmentation network, so that the motion segmentation network is more accurate in segmenting the dynamic and static areas, and errors caused by external differences such as environment are reduced.
In order to achieve the above object, the present invention provides a human body motion recognition system based on a dual-current convolutional network of a race-joining network, which comprises:
the video input part comprises a video multi-frame image sequence of a video to be identified and a video single-frame + video multi-frame image sequence;
the characteristic extraction part is connected with the video input part, comprises a spatial stream convolution network and a time stream convolution network, and respectively extracts and classifies the characteristics of the dense optical flow between the frames which realize the segmentation of the dynamic and static pixels in the video multi-frame image sequence and the video single-frame + video multi-frame image sequence;
the result fusion part is connected with the characteristic extraction part and comprises a fusion network, and the result fusion is carried out on the classification results output by the time flow convolution network and the space flow convolution network;
wherein, still include:
and the competitive combination network is included in the characteristic extraction part and connected with the time flow convolution network, four networks included in the competitive combination network respectively train the video single-frame and video multi-frame image sequence, realize the division of moving and static pixels and output an optical flow image sequence only including moving pixels.
In an embodiment of the present invention, the bidding network includes a static area network, a dynamic area network and a motion segmentation network, the static area network includes a depth estimation network and a camera motion network, and the dynamic area network is an optical flow network;
in an embodiment of the present invention, wherein the time-stream convolutional network is a medium-scale convolutional network, the medium-scale convolutional network comprises 5 convolutional layers, 2 fully-connected layers and one layer of softmax, the input image size is 224 × 224, the convolutional kernel size of the first convolutional layer is 7 × 7, the convolution step size is 2, the convolutional kernel size of the second convolutional layer is 5 × 5, the convolution step size is 2, the convolutional kernel sizes of the third layer to the fifth convolutional layer are all 3 × 3, and the convolution step size is 1;
in an embodiment of the present invention, the spatial stream convolutional network is a convolutional 3D network, the convolutional 3D network has 8 convolutional layers, 5 pooling layers, two full-connected layers, and one softmax output layer, all 3D convolutional filters are 3 × 3 × 3, the step size is 1 × 1 × 1, the kernel size of pooling layer 1 is 1 × 2 × 2, the step size is 1 × 2 × 2, all the remaining 3D pooling layers are 2 × 2 × 2, the step size is 2 × 2 × 2, and each full-connected layer has 4096 output units;
in an embodiment of the present invention, the fusion network is a multi-class support vector machine, and the multi-class support vector machine adds regularization penalty of L2 paradigm in calculation of its loss function to eliminate ambiguity of specific weight, and the regularization penalty of L2 paradigm suppresses large-value weight by performing element-by-element square penalty on all parameters, and the formula is as follows:
wherein W is weight, k represents row vector of element in W, l represents column vector of element in W;
the overall loss function of the multi-class support vector machine is as follows:
in the formula, xiFor the image features contained in the ith data, yiLabels representing the correct class, f (x)iW) is a linear scoring function to calculate scores for different classification categories, the score belonging to the j-th category is f (x)i,W)jN is the number of training samples, λ is the hyperparameter, Δ is the correct class yiIs always higher than the boundary value of the score on the incorrect classification j, the max function is used to take the larger of the two.
The invention also provides a human body action recognition method of the double-current convolution network based on the competitive union network, which comprises the following steps:
s1, inputting a video into the double-current convolution network based on the competition network and the competition network;
s2, the competitive combination network carries out dynamic and static pixel segmentation on the frame of the video and outputs an optical flow image sequence which eliminates static area pixels and only comprises moving pixels;
s3, inputting the optical flow image sequence output by the S2 into the medium-scale convolution network, and extracting the features of the optical flow image;
s4, inputting the video of the S1 into a convolution 3D network, and extracting the characteristics of each frame of image in the video;
s5, performing feature classification on the extracted features at a softmax layer of the medium-scale convolutional network S3 and the convolutional 3D network S4 respectively;
s6, performing score fusion on the feature classification result by using a multi-class support vector machine so as to obtain correct recognition of the human body action; the multi-class support vector machine adds regularization punishment of an L2 paradigm in the calculation of a loss function of the multi-class support vector machine, wherein the regularization punishment of the L2 paradigm is a weight for restraining a large value by carrying out element-by-element square punishment on all parameters, and the formula is as follows:
wherein W is weight, k represents row vector of element in W, l represents column vector of element in W;
the overall loss function of the multi-class support vector machine is as follows:
wherein x isiFor the image features contained in the ith data, yiLabels representing the correct class, f (x)iW) is a linear scoring function to calculate scores for different classification categories, the score belonging to the j-th category is f (x)i,W)jN is the number of training samples, λ is the hyperparameter, Δ is the correct class yiThe score of (d) is always higher than the boundary value of the score on the incorrect classification j, and the max function is used for taking the larger value of the two;
and S7, outputting a final recognition result.
In an embodiment of the present invention, the concrete steps of the competitive network in S2 for performing dynamic and static pixel segmentation on the video frame include:
s21, estimating a static area optical flow by a static area network through the depth estimation network and the camera motion network, so as to predict static area pixels;
s22, estimating an optical flow by a dynamic area network through a video multi-frame image so as to predict dynamic area pixels;
s23, competing pixels in the training video frame image by using the static area pixels predicted by S21 and the dynamic area pixels predicted by S22;
s24, coordinating the competition relationship between the static area network and the dynamic area network by the motion segmentation network, and removing static area pixels from the dynamic area network so as to generate a synthesized optical flow on the whole video multi-frame image;
s25, co-training the static area network, the dynamic area network and the motion segmentation network by using the loss of the synthetic optical flow;
s26, alternately dividing the static area network, the dynamic area network and the motion segmentation network into dynamic and static areas in a training period, and outputting an optical flow image which eliminates static area pixels and only comprises motion pixels;
wherein the training cycle of S26 includes a first phase and a second phase:
the first stage is as follows: the motion segmentation network is used as a coordination network to train two competition networks consisting of the static area network and the dynamic area network, and an energy function is minimized;
and a second stage: the two competing networks cooperate to train the coordinated network, minimizing the energy function.
The invention divides the dynamic and static pixels by introducing the competition network into the double-current network, so that the optical flow network outputs the optical flow graph with static area pixels removed, and the optical flow graph is used as the input of the time flow convolution network, thereby reducing the error caused by external differences such as environment and the like in the prior art and improving the action recognition accuracy. Meanwhile, the double-flow network is combined with a spatial flow convolution network formed by a convolution 3D network to form the double-flow network, so that the time information and the spatial information in the video are fully utilized.
Drawings
FIG. 1 is a block diagram of a human body action recognition method of a double-current convolution network based on a race-joining network architecture according to the present invention;
FIG. 2 is an original dual-flow network model;
FIG. 3 is a schematic diagram of a bidding network;
FIG. 4 is a diagram of a CNN-M network architecture;
FIG. 5 is a schematic diagram of a difference between a 2D convolutional network and a 3D convolutional network;
fig. 6 is a schematic diagram of two stages of a bidding network.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
FIG. 1 is a block diagram of a human body motion recognition method of a double-current convolution network based on a competition network architecture, and as shown in FIG. 1, a video is input into the network of the double-current architecture, wherein a time stream part carries out dynamic and static pixel segmentation on a video frame through the competition network, an optical flow image with a static area removed is output and input into a medium-scale convolution network, and the characteristic extraction is carried out on the optical flow image; the spatial stream part inputs a plurality of frames of video images into a convolution 3D network and performs feature extraction on each frame of video image; and respectively carrying out feature classification on the extracted features at softmax layers of the medium-scale convolutional network and the convolutional 3D network, and then carrying out score fusion on classification results by using a multi-class support vector machine so as to obtain correct recognition of human body actions.
The following describes a human body motion recognition system and method based on a dual-current convolutional network of a race-joining network according to a specific embodiment.
The human body motion recognition system of the double-current convolutional network based on the competitive network provided by the embodiment of the invention is characterized in that the double-current network shown in fig. 2 is used as a network architecture for feature extraction, wherein each video single frame, namely an RGB image can be used as a space component and is input into a space-flow convolutional network, dense optical flow between video frames is input into a time-flow convolutional network, the two convolutional networks are respectively subjected to feature extraction and classification recognition, and finally, results are fused, as shown in fig. 1 and fig. 2, the human body motion recognition system of the double-current convolutional network based on the competitive network comprises:
the video input part comprises a video multi-frame image sequence of a video to be identified and a video single-frame + video multi-frame image sequence;
the characteristic extraction part is connected with the video input part, comprises a spatial stream convolution network and a time stream convolution network, and respectively extracts and classifies the characteristics of the dense optical flow between the frames which realize the segmentation of the dynamic and static pixels in the video multi-frame image sequence and the video single-frame + video multi-frame image sequence;
the result fusion part is connected with the characteristic extraction part and comprises a fusion network, and the result fusion is carried out on the classification results output by the time flow convolution network and the space flow convolution network;
wherein, still include:
and the competitive combination network is included in the characteristic extraction part and connected with the time flow convolution network, four networks included in the competitive combination network respectively train the video single-frame and video multi-frame image sequence, realize the division of moving and static pixels and output an optical flow image sequence only including moving pixels.
In the embodiment of the present invention, the bidding network includes a static area network R, a dynamic area network F, and a motion segmentation network, as shown in fig. 3, where the static area network R includes a depth estimation network D and a camera motion network C, that is, may be represented by R ═ D (C), and the dynamic area network is an optical flow network F;
in an embodiment of the present invention, the time-flow convolutional network is a medium-scale convolutional network, and the medium-scale convolutional network (CNN-M) includes 5 convolutional layers, 2 fully-connected layers, and one layer of softmax, as shown in fig. 4, the input image size of the medium-scale convolutional network is 224 × 224, the convolution kernel size of the first convolutional layer is 7 × 7, the convolution step size is 2, the convolution kernel size of the second convolutional layer is 5 × 5, the convolution step size is 2, the convolution kernel sizes of the third layer to the fifth convolutional layers are all 3 × 3, and the convolution step size is 1. CNN-M can better find and retain the detail information of the original input image by increasing the number of filters, decreasing the size and step size of the filters.
In the embodiment of the present invention, the spatial stream convolutional network is a convolutional 3D network, as shown in fig. 1, the convolutional 3D network is a 3D convolutional network with uniformly arranged 3 × 3 × 3 convolutional kernels, and has 8 convolutional layers, 5 pooling layers, two fully-connected layers, and one softmax output layer, all 3D convolutional filters are 3 × 3 × 3 with a step size of 1 × 1 × 1, the kernel size of pooling layer 1 is set to be 1 × 2 × 2, the step size is 1 × 2 × 2, all the remaining 3D pooling layers are 2 × 2 × 2 with a step size of 2 × 2 × 2, and each fully-connected layer has 4096 output units.
In the 3D convolutional network, the convolution and pooling operations are performed spatio-temporally, while in the 2D convolutional network, they are performed only spatially, as shown in fig. 5, in which fig. 5(a) is a 2D convolutional network of multi-frame input and its output, and fig. 5(b) is a 3D convolutional network of multi-frame input and its output. Where the video clip size is defined as c × L × H × W, c is the number of channels, L is the length of the frame number, and H and W are the height and width of the frame, respectively. The 3D convolution and pooling kernel size points to D × k × k, D being the temporal depth of the kernel and k being the spatial size of the kernel. Thus, the 3D convolutional network can better model time information through 3D convolution and 3D pooling operations than a 2D convolutional network.
In an embodiment of the present invention, the fusion network is a Multiclass support vector Machine (multi-SVM) that adds regularization penalty of L2 paradigm in its computation of loss function to remove ambiguity of specific weights.
The embodiment of the invention also provides a human body action recognition method of the double-current convolution network based on the competition network, which comprises the following steps:
s1, inputting the video into the double-current convolution network based on the competition network and the competition network;
s2, the competitive combination network carries out dynamic and static pixel segmentation on the video frame and outputs an optical flow image sequence which eliminates static area pixels and only comprises moving pixels;
as shown in fig. 3, the specific steps of the competitive network performing moving and static pixel segmentation on the video frame in S2 are as follows:
s21, the static area network R estimates a static area optical flow through the depth estimation network D and the camera motion network C, and accordingly static area pixels are predicted;
s22, estimating an optical flow through a video multi-frame image by the dynamic area network F, and predicting dynamic area pixels;
s23, competing pixels in the training video frame image by using the static area pixels predicted by S21 and the dynamic area pixels predicted by S22;
s24, the motion segmentation network M coordinates the competitive relationship between the static area network R and the dynamic area network F and removes static area pixels from the dynamic area network F, so that a composite optical flow is generated on the whole video multi-frame image;
s25, training the static area network R, the dynamic area network F and the motion segmentation network M together by using the loss of the synthetic optical flow;
s26, alternately dividing the static area network R, the dynamic area network F and the motion segmentation network M in a training period to output an optical flow image which eliminates static area pixels and only contains motion pixels;
as shown in fig. 6, the training period of S26 includes a first phase and a second phase:
the first stage is a competition stage (fig. 6, left diagram), the motion segmentation network M is used as a coordination network to train two competition networks consisting of a static area network R and a dynamic area network F, and an energy function is minimized, where the energy function is:
where, the element product is represented, and the output result of M is partitionedManagement, Ω represents a set of spatial pixels, is the result of a competitor partition,for a set of unlabeled training data sets, dynamic and static region segmentation is about to be performedDivided into two disjoint numbersData set, LRAs a loss function of the static area network, LFAs a loss function for dynamic area networks, each competing network attempts to pass through the pairings during the competition processPartitioning to obtain a minimum loss function;
the second phase is a cooperation phase (fig. 6, right diagram), and the two competition networks (R and F) cooperate to train the coordination network M, so that the data can be divided more accurately in the next period, and an energy function is minimized, wherein the energy function is as follows:
in the formula, LMLOSS function (LOSS) representing consensus between competitors R, F.
In one embodiment of the present invention, D is used separatelyθ,Fψ,MχRepresenting depth estimation networks, camera motion networks, optical flow networks, and motion segmentation networks, subscriptsFor parameters relevant to each network, use I-,I,I+Three frames representing successive frames to calculate Cφ,MχWherein the depth estimate for the target image is:
d=Dθ(I) (3)
by means of respective image frames I-I+To perform camera motion estimation as:
the optical flow of a static scene only needs a camera motion network and a depth estimation network, and is generally related to a scene structure, and a segmentation mask of each pair of target related images is as follows:
m-,m+=Mχ(I-,I,I+) (5)
in the formula m-,m+∈[0,1]ΩThe final optical flow network F is the probability of a static area in the set of spatial pixels omegaψInputting two frames of images by estimating their optical flows and estimating u in their estimated forward optical flows-And an inverse optical flow estimate u+Share its weight:
u-=Fψ(I,I-),u+=Fψ(I,I+) (6)
E=λRER+λFEF+λMEM+λCEC+λSES (7)
wherein, { lambda., (a)R,λF,λM,λC,λSIs weight, ERAnd EFMinimization targets for static area networks and moving area networks, respectively, EMDetermining competition data of the two competing networks for the minimum energy function of the split network, the weight lambdaMToo large will bring more pixels into the static area. ECFor consistency loss to manage collaboration parts, ESFor smoothness item to manage smoothness, pass ERMinimizing the loss of luminosity of static scenes:
where Ω is the set of spatial pixels, ρ is the robust error function, IsTwo images adjacent to the target image, i.e. reference frames, esFor camera motion estimation of reference frames, msFor the reference frame in a static areaProbability of (a), wCTransforming from the reference frame to the target image I for a camera transformation function according to the depth estimate d and the camera motion estimate e;
likewise, by EFMinimizing the loss of luminosity in the motion area:
in the formula usOptical flow estimation, w, representing a reference framefFor an optical flow transformation function, the optical flow estimate u can be transformed from the reference frame to the target image I;
the robustness loss is calculated as:
in the formula, λρThe representative weight is a fixed value of 0.01. x, y are two pictures, respectively, where the second part is the loss of structural similarity, μx,σxIs the local mean and standard deviation, μ, around the x pixels of the picturey,σyIs the local mean and standard deviation around the y pixels of the picture, c1=0.012,c2=0.032;
Minimizing the energy function E of a motion segmentation networkMFor reducing the sum of the values at mask and λMCross entropy H between managed element vectors:
greater lambdaMThe scene is biased to be static by being more biased to the static area network R.
V (E, d) represents the optical flow generated by the camera motion estimate E and depth estimate d, using a consistency loss ECThe mask is constrained and segmented by aligning the static scene stream generated by v (e, d) with the static scene stream generated by FψThe consistency between the estimated optical flows to segment moving objects, the consistency loss is defined as follows:
in the formula (I), the compound is shown in the specification,is an indication function, the condition is correct when the value is 1, the first indication function is to allocate mask to competition network and compare the robust loss rho of static region frameRAnd robust loss of dynamic region frames ρFTo determine the loss of luminosity of the same pixel. Where ρ isR=(I,wc(Is,es,d)),ρF=(I,wf(Is,us) In a second indicator function, a threshold λCMake itIf the static scene flow generated by v (E, d) is similar to the optical flow u, then it is a static scene, the symbol V represents the logical OR (OR) between the two indicating functions, and if the photometric loss of R is lower than that of F OR the optical flow of R is similar to that of F, then the loss of consistency E isCThis pixel is classified as a static area pixel.
Final smoothness term ESTo constrain the depth estimation, segmentation and optical flow:
in the formulaIs the first derivative of the spatial direction, λeThe guaranteed smoothness is determined by the image edges.
Depth and camera motion are output directly from the network, and motion segmentation M is performed by mask network MχAnd from the optical flow network FψThe estimated scene flow and optical flow consistency output is expressed as:
wherein the first term is bound from MχThe mask probability is deduced and the second term is used from R ═ D, C and FψThe consistency between the extrapolated flows corrects the mask, finally, at (I, I)+) The complete optical flow u between the static scene and the optical flow of the moving object alone, which is expressed as:
s3, inputting the optical flow image output by the S2 into a medium-scale convolution network, and extracting the characteristics of the optical flow image;
s4, inputting the video of the S1 into a convolution 3D network, and extracting the characteristics of each frame of image in the video;
s5, performing feature classification on the extracted features at a softmax layer of the medium-scale convolutional network S3 and the convolutional 3D network S4 respectively;
as can be seen from the CNN-M network structure in fig. 5 and the convolutional 3D network architecture in fig. 1, two feature-extracted convolutional networks in the dual-flow network are connected to a softmax layer respectively, so as to perform feature classification on the extracted features.
S6, performing score fusion on the result of the feature classification by using a multi-class Support Vector Machine (multi-SVM), so as to obtain correct recognition of the human body action; the regularization penalty of the L2 paradigm is added to the calculation of the loss function of the multi-class support vector machine to eliminate the ambiguity of the specific weight, and the regularization penalty of the L2 paradigm and the loss function of the multi-SVM of the multi-class support vector machine are as follows:
wherein, let the image characteristic x included in the ith dataiAnd label y representing the correct categoryiBy a linear scoring function f (x)iW) to calculate the scores of the different classification categories, wherein W is the weight and s is the score of the j-th categoryj=f(xi,W)j. Linear scoring function s ═ f (x)iW) score S in incorrect category jjAnd the correct category yiScore S ofyiAnd comparing, predicting the accumulation of incorrect classes through a loss function, and calculating the loss function of the multi-class SVM aiming at the ith data as follows:
wherein Δ is a boundary value where the score on the correct classification is always higher than the score on the incorrect classification; max function for taking Sj-SyiGreater of +. DELTA.and 0, if LiGiven data x is illustrated as 0iThe classification is correct;
if the loss function only includes an error portion, there may be a plurality of W, and to eliminate ambiguity of a specific weight W, a regularization penalty (regularization penalty) portion is added on the basis of the error portion, where the regularization penalty adopted in this embodiment is an L2 paradigm, and the L2 paradigm suppresses a large-value weight by performing an element-by-element square penalty on all parameters, and the formula is as follows:
in the formula, k represents a row vector of an element in W, l represents a column vector of the element in W, and the regularization part is only based on weight and is irrelevant to data.
The linear scoring function is substituted to obtain the complete loss function of the multi-SVM as follows:
where N is the number of training samples and λ is the hyperparameter.
And S7, outputting a final recognition result.
In conclusion, the invention introduces the competition network into the double-current network, and utilizes the characteristic that the competition network can divide the dynamic area and the static area of the video, so that the optical flow network outputs the optical flow graph only containing the motion pixels as the input of the time flow convolution network in the double-current network, thereby greatly improving the interference problem of the static environment on the action identification and improving the accuracy of the action identification. And the video and the C3D network are combined to form a double-flow network, so that the time information and the spatial information in the video are fully utilized. In addition, the competitive network used in the embodiment of the present invention may use a label-free data set, and its competitive cooperation characteristic enables the motion segmentation network therein to be subjected to joint training of a dynamic area network represented by a static area network and an optical flow network, which are composed of a depth estimation network and a camera motion network, before performing the next dynamic and static pixel segmentation, so that the next dynamic and static pixel segmentation is more accurate, interference in the optical flow graph input as the time-flow convolution network is less, and the action recognition accuracy rate becomes high.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (7)
1. A human body action recognition system based on a double-current convolution network of a competition network comprises the following components:
the video input part comprises a video multi-frame image sequence of a video to be identified and a video single-frame + video multi-frame image sequence;
the characteristic extraction part is connected with the video input part, comprises a spatial stream convolution network and a time stream convolution network, and respectively extracts and classifies the characteristics of the dense optical flow between the frames which realize the segmentation of the dynamic and static pixels in the video multi-frame image sequence and the video single-frame + video multi-frame image sequence;
the result fusion part is connected with the characteristic extraction part and comprises a fusion network, and the result fusion is carried out on the classification results output by the time flow convolution network and the space flow convolution network;
it is characterized by also comprising:
and the competitive combination network is included in the characteristic extraction part and connected with the time flow convolution network, four networks included in the competitive combination network respectively train the video single-frame and video multi-frame image sequence, realize the division of moving and static pixels and output an optical flow image sequence only including moving pixels.
2. The system for recognizing human body actions based on the dual-current convolutional network of the competition network of claim 1, wherein the competition network comprises a static area network, a dynamic area network and a motion segmentation network, the static area network comprises a depth estimation network and a camera motion network, and the dynamic area network is an optical flow network.
3. The human body motion recognition system based on the dual-stream convolutional network of the competition network of claim 1, wherein the time-stream convolutional network is a medium-scale convolutional network, the medium-scale convolutional network comprises 5 convolutional layers, 2 fully-connected layers and one softmax, the input image size of the medium-scale convolutional network is 224 x 224, the convolutional kernel size of the first convolutional layer is 7 x 7, the convolution step size is 2, the convolutional kernel size of the second convolutional layer is 5 x 5, the convolution step size is 2, the convolutional kernel sizes of the third layer to the fifth convolutional layers are all 3 x 3, and the convolution step size is 1.
4. The human body motion recognition system based on the dual-stream convolutional network of the competitive network of claim 1, wherein the spatial stream convolutional network is a convolutional 3D network, the convolutional 3D network has 8 convolutional layers, 5 pooling layers, two fully-connected layers and one softmax output layer, all 3D convolutional filters are 3 × 3 × 3 with a step size of 1 × 1 × 1, the kernel size of pooling layer 1 is 1 × 2 × 2 with a step size of 1 × 2 × 2, all the remaining 3D pooling layers are 2 × 2 × 2 with a step size of 2 × 2 × 2, and each fully-connected layer has 4096 output units.
5. The human body motion recognition system of the dual-flow convolutional network based on the competitive network of claim 1, wherein the fusion network is a multi-class support vector machine, the multi-class support vector machine adds regularization penalty of L2 paradigm in the calculation of its loss function to eliminate ambiguity of specific weight, the regularization penalty of L2 paradigm is to suppress weight of large value by performing element-by-element square penalty on all parameters, and the formula is as follows:
wherein W is weight, k represents row vector of element in W, l represents column vector of element in W;
the overall loss function of the multi-class support vector machine is as follows:
in the formula, xiFor the image features contained in the ith data, yiLabels representing the correct class, f (x)iW) is a linear scoring function to calculate scores for different classification categories, the score belonging to the j-th category is f (x)i,W)jN is the number of training samples, λ is the hyperparameter, Δ is the correct class yiIs always higher than the boundary value of the score on the incorrect classification j, the max function is used to take the larger of the two.
6. A human body action recognition method comprising the double-current convolution network based on the race-joining network of the system of claims 1-5, and is characterized by comprising the following steps:
s1, inputting a video into the double-current convolution network based on the competition network and the competition network;
s2, the competitive combination network carries out dynamic and static pixel segmentation on the frame of the video and outputs an optical flow image sequence which eliminates static area pixels and only comprises moving pixels;
s3, inputting the optical flow image sequence output by the S2 into the medium-scale convolution network, and extracting the features of the optical flow image;
s4, inputting the video of the S1 into a convolution 3D network, and extracting the characteristics of each frame of image in the video;
s5, performing feature classification on the extracted features at a softmax layer of the medium-scale convolutional network S3 and the convolutional 3D network S4 respectively;
s6, performing score fusion on the feature classification result by using a multi-class support vector machine so as to obtain correct recognition of the human body action; the multi-class support vector machine adds regularization punishment of an L2 paradigm in the calculation of a loss function of the multi-class support vector machine, wherein the regularization punishment of the L2 paradigm is a weight for restraining a large value by carrying out element-by-element square punishment on all parameters, and the formula is as follows:
wherein W is weight, k represents row vector of element in W, l represents column vector of element in W;
the overall loss function of the multi-class support vector machine is as follows:
in the formula, xiFor the image features contained in the ith data, yiLabels representing the correct class, f (x)iW) is a linear scoring function to calculate scores for different classification categories, the score belonging to the j-th category is f (x)i,W)jN is the number of training samples, λ is the hyperparameter, Δ is the correct class yiIs always higher than the boundary of the score on incorrect classification jThe value, max function is used to take the larger of the two;
and S7, outputting a final recognition result.
7. The method as claimed in claim 6, wherein the step of performing the motion-still pixel segmentation on the video frame by the bidding network in S2 is as follows:
s21, estimating a static area optical flow by a static area network through the depth estimation network and the camera motion network, so as to predict static area pixels;
s22, estimating an optical flow by a dynamic area network through a video multi-frame image so as to predict dynamic area pixels;
s23, competing pixels in the training video frame image by using the static area pixels predicted by S21 and the dynamic area pixels predicted by S22;
s24, coordinating the competition relationship between the static area network and the dynamic area network by the motion segmentation network, and removing static area pixels from the dynamic area network so as to generate a synthesized optical flow on the whole video multi-frame image;
s25, co-training the static area network, the dynamic area network and the motion segmentation network by using the loss of the synthetic optical flow;
s26, alternately dividing the static area network, the dynamic area network and the motion segmentation network into dynamic and static areas in a training period, and outputting an optical flow image which eliminates static area pixels and only comprises motion pixels;
wherein the training cycle of S26 includes a first phase and a second phase:
the first stage is as follows: the motion segmentation network is used as a coordination network to train two competition networks consisting of the static area network and the dynamic area network, and an energy function is minimized;
and a second stage: the two competing networks cooperate to train the coordinated network, minimizing the energy function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010710147.5A CN111931603B (en) | 2020-07-22 | 2020-07-22 | Human body action recognition system and method of double-flow convolution network based on competitive network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010710147.5A CN111931603B (en) | 2020-07-22 | 2020-07-22 | Human body action recognition system and method of double-flow convolution network based on competitive network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111931603A true CN111931603A (en) | 2020-11-13 |
CN111931603B CN111931603B (en) | 2024-01-12 |
Family
ID=73315975
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010710147.5A Active CN111931603B (en) | 2020-07-22 | 2020-07-22 | Human body action recognition system and method of double-flow convolution network based on competitive network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111931603B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112507803A (en) * | 2020-11-16 | 2021-03-16 | 北京理工大学 | Gait recognition method based on double-flow network |
CN112597975A (en) * | 2021-02-26 | 2021-04-02 | 上海闪马智能科技有限公司 | Fire smoke and projectile detection method and system based on video |
CN113537232A (en) * | 2021-05-31 | 2021-10-22 | 大连民族大学 | Double-channel interactive time convolution network, close-range video motion segmentation method, computer system and medium |
WO2023159898A1 (en) * | 2022-02-25 | 2023-08-31 | 国网智能电网研究院有限公司 | Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103035006A (en) * | 2012-12-14 | 2013-04-10 | 南京大学 | High-resolution aerial image partition method based on LEGION and under assisting of LiDAR |
WO2019103188A1 (en) * | 2017-11-23 | 2019-05-31 | 주식회사 아이메디신 | System and method for evaluating traumatic brain damage through brain wave analysis |
CN110889375A (en) * | 2019-11-28 | 2020-03-17 | 长沙理工大学 | Hidden and double-flow cooperative learning network and method for behavior recognition |
CN110909658A (en) * | 2019-11-19 | 2020-03-24 | 北京工商大学 | Method for recognizing human body behaviors in video based on double-current convolutional network |
-
2020
- 2020-07-22 CN CN202010710147.5A patent/CN111931603B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103035006A (en) * | 2012-12-14 | 2013-04-10 | 南京大学 | High-resolution aerial image partition method based on LEGION and under assisting of LiDAR |
WO2019103188A1 (en) * | 2017-11-23 | 2019-05-31 | 주식회사 아이메디신 | System and method for evaluating traumatic brain damage through brain wave analysis |
CN110909658A (en) * | 2019-11-19 | 2020-03-24 | 北京工商大学 | Method for recognizing human body behaviors in video based on double-current convolutional network |
CN110889375A (en) * | 2019-11-28 | 2020-03-17 | 长沙理工大学 | Hidden and double-flow cooperative learning network and method for behavior recognition |
Non-Patent Citations (1)
Title |
---|
ANURAG RANJAN等: "Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation", 《2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》, pages 12240 - 12249 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112507803A (en) * | 2020-11-16 | 2021-03-16 | 北京理工大学 | Gait recognition method based on double-flow network |
CN112597975A (en) * | 2021-02-26 | 2021-04-02 | 上海闪马智能科技有限公司 | Fire smoke and projectile detection method and system based on video |
CN112597975B (en) * | 2021-02-26 | 2021-06-08 | 上海闪马智能科技有限公司 | Fire smoke and projectile detection method and system based on video |
CN113537232A (en) * | 2021-05-31 | 2021-10-22 | 大连民族大学 | Double-channel interactive time convolution network, close-range video motion segmentation method, computer system and medium |
CN113537232B (en) * | 2021-05-31 | 2023-08-22 | 大连民族大学 | Dual-channel interaction time convolution network, close-range video motion segmentation method, computer system and medium |
WO2023159898A1 (en) * | 2022-02-25 | 2023-08-31 | 国网智能电网研究院有限公司 | Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111931603B (en) | 2024-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110097568B (en) | Video object detection and segmentation method based on space-time dual-branch network | |
Zhu et al. | Hidden two-stream convolutional networks for action recognition | |
Melekhov et al. | Dgc-net: Dense geometric correspondence network | |
CN111311666B (en) | Monocular vision odometer method integrating edge features and deep learning | |
CN111931603B (en) | Human body action recognition system and method of double-flow convolution network based on competitive network | |
Ding et al. | Spatio-temporal recurrent networks for event-based optical flow estimation | |
CN111080675B (en) | Target tracking method based on space-time constraint correlation filtering | |
CN112836640B (en) | Single-camera multi-target pedestrian tracking method | |
Bruce et al. | Multimodal fusion via teacher-student network for indoor action recognition | |
US20170161591A1 (en) | System and method for deep-learning based object tracking | |
CN110889375B (en) | Hidden-double-flow cooperative learning network and method for behavior recognition | |
Huang et al. | Efficient image stitching of continuous image sequence with image and seam selections | |
WO2019197021A1 (en) | Device and method for instance-level segmentation of an image | |
Saribas et al. | TRAT: Tracking by attention using spatio-temporal features | |
CN113312973B (en) | Gesture recognition key point feature extraction method and system | |
US20220207679A1 (en) | Method and apparatus for stitching images | |
Maddalena et al. | Exploiting color and depth for background subtraction | |
CN107146219B (en) | Image significance detection method based on manifold regularization support vector machine | |
CN109255382A (en) | For the nerve network system of picture match positioning, method and device | |
CN111127519A (en) | Target tracking control system and method for dual-model fusion | |
CN112070181B (en) | Image stream-based cooperative detection method and device and storage medium | |
Saunders et al. | Dyna-dm: Dynamic object-aware self-supervised monocular depth maps | |
CN110866939B (en) | Robot motion state identification method based on camera pose estimation and deep learning | |
Lee et al. | Instance-wise depth and motion learning from monocular videos | |
CN108765384B (en) | Significance detection method for joint manifold sequencing and improved convex hull |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |