CN111931603A

CN111931603A - Human body action recognition system and method based on double-current convolution network of competitive combination network

Info

Publication number: CN111931603A
Application number: CN202010710147.5A
Authority: CN
Inventors: 叶青; 李汭; 张永梅
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-11-13
Anticipated expiration: 2040-07-22
Also published as: CN111931603B

Abstract

The invention discloses a human body action recognition system and method based on a double-current convolution network of a competition network. The method comprises the following steps: inputting a video into a network with a double-flow architecture, wherein a time flow part carries out dynamic and static pixel segmentation on a video frame through a competitive network, outputting an optical flow image with a static area removed, inputting the optical flow image into a medium-scale convolution network, and carrying out feature extraction on the optical flow image; the spatial stream part inputs a plurality of frames of video images into a convolution 3D network and performs feature extraction on each frame of video image; and respectively carrying out feature classification on the extracted features at softmax layers of the medium-scale convolutional network and the convolutional 3D network, and then carrying out score fusion on classification results by adopting a multi-class support vector machine to obtain correct recognition of human body actions, thereby reducing errors caused by external differences such as environment and the like and improving the accuracy of action recognition.

Description

Human body action recognition system and method based on double-current convolution network of competitive combination network

Technical Field

The invention relates to the field of computer vision, in particular to analysis and research of human body action recognition, and more particularly relates to a human body action recognition system and method of a double-current convolution network based on a competition network architecture.

Background

In the information society, picture information and video information account for a large proportion of all sensory information. Computer vision is a subject that uses cameras and computers to acquire data and information about a photographed object, and can automatically and efficiently process picture information and video information, so that the field of computer vision receives more and more attention, and in the field, human motion recognition is an important research direction, and no matter intelligent monitoring, video security and virtual reality technologies are required to be supported by human motion recognition technologies. In many human body motion recognition technologies, a double-current convolutional network not only extracts spatial information of an image but also understands time information in a video frame sequence by simulating visual perception of a human body and understanding video information.

The human body action recognition method comprises the following general steps: firstly, moving object detection is carried out on a video, then feature extraction is carried out on the detected moving object, and finally the extracted features are classified and identified. The traditional motion recognition method is usually researched from two dimensions, but the effect is not ideal. In recent years, the development of deep learning and the application thereof in human body motion recognition have proposed a plurality of methods for automatically extracting features, and the recognition rate of the algorithm is further improved. Three solutions are provided for solving the problem of motion recognition in a video by using a deep learning method, wherein the solutions are respectively as follows: the dual stream method, the C3D (Convolitional 3D Network) method, and the CNN-LSTM (CNN-Long Short-Term Memory). The double-flow method is to apply a double-flow convolution network to process space dimension and time dimension information in a video respectively, wherein the double-flow convolution network is divided into a space flow convolution network and a time flow convolution network. The RGB image is used as the input of a spatial flow convolution network, the optical flow image is used as the input of a time flow convolution network, and overfitting is removed by combining a data set through a multitask training method, so that the identification accuracy is improved.

However, the optical flow field of the double-flow convolution network is greatly influenced by environmental differences such as shielding, multiple viewing angles, illumination, background and the like, and the accuracy of motion recognition is influenced. Meanwhile, the optical flow image is the processing of the whole video, and a static area in the video can generate noise interference on the time flow convolution network. The Competitive fusion network can just solve the problem, and the Competitive fusion network connects the depth estimation network, the camera motion network, the optical flow network and the motion segmentation network together for network improvement, so that the problem of segmentation of a motion area and a static area can be solved.

Therefore, the invention provides a method for applying a competitive network, which can effectively remove the interference of static pixels in video frames to dynamic pixels, reduce errors caused by external differences such as environment and the like, and accurately identify human body actions in videos.

Disclosure of Invention

The invention provides a human body motion recognition system and method based on a double-current convolution network of a competition network. The optical flow network in the competition network outputs an optical flow graph only containing dynamic pixels as the input of the time flow convolution network so as to remove environmental noise and improve the human body action recognition accuracy. Meanwhile, a static area network consisting of a depth network and a camera motion network in the competition network and a dynamic area network represented by an optical flow network are combined to train the motion segmentation network, so that the motion segmentation network is more accurate in segmenting the dynamic and static areas, and errors caused by external differences such as environment are reduced.

In order to achieve the above object, the present invention provides a human body motion recognition system based on a dual-current convolutional network of a race-joining network, which comprises:

the video input part comprises a video multi-frame image sequence of a video to be identified and a video single-frame + video multi-frame image sequence;

the characteristic extraction part is connected with the video input part, comprises a spatial stream convolution network and a time stream convolution network, and respectively extracts and classifies the characteristics of the dense optical flow between the frames which realize the segmentation of the dynamic and static pixels in the video multi-frame image sequence and the video single-frame + video multi-frame image sequence;

the result fusion part is connected with the characteristic extraction part and comprises a fusion network, and the result fusion is carried out on the classification results output by the time flow convolution network and the space flow convolution network;

wherein, still include:

and the competitive combination network is included in the characteristic extraction part and connected with the time flow convolution network, four networks included in the competitive combination network respectively train the video single-frame and video multi-frame image sequence, realize the division of moving and static pixels and output an optical flow image sequence only including moving pixels.

In an embodiment of the present invention, the bidding network includes a static area network, a dynamic area network and a motion segmentation network, the static area network includes a depth estimation network and a camera motion network, and the dynamic area network is an optical flow network;

in an embodiment of the present invention, wherein the time-stream convolutional network is a medium-scale convolutional network, the medium-scale convolutional network comprises 5 convolutional layers, 2 fully-connected layers and one layer of softmax, the input image size is 224 × 224, the convolutional kernel size of the first convolutional layer is 7 × 7, the convolution step size is 2, the convolutional kernel size of the second convolutional layer is 5 × 5, the convolution step size is 2, the convolutional kernel sizes of the third layer to the fifth convolutional layer are all 3 × 3, and the convolution step size is 1;

in an embodiment of the present invention, the spatial stream convolutional network is a convolutional 3D network, the convolutional 3D network has 8 convolutional layers, 5 pooling layers, two full-connected layers, and one softmax output layer, all 3D convolutional filters are 3 × 3 × 3, the step size is 1 × 1 × 1, the kernel size of pooling layer 1 is 1 × 2 × 2, the step size is 1 × 2 × 2, all the remaining 3D pooling layers are 2 × 2 × 2, the step size is 2 × 2 × 2, and each full-connected layer has 4096 output units;

in an embodiment of the present invention, the fusion network is a multi-class support vector machine, and the multi-class support vector machine adds regularization penalty of L2 paradigm in calculation of its loss function to eliminate ambiguity of specific weight, and the regularization penalty of L2 paradigm suppresses large-value weight by performing element-by-element square penalty on all parameters, and the formula is as follows:

wherein W is weight, k represents row vector of element in W, l represents column vector of element in W;

the overall loss function of the multi-class support vector machine is as follows:

in the formula, x_iFor the image features contained in the ith data, y_iLabels representing the correct class, f (x)_iW) is a linear scoring function to calculate scores for different classification categories, the score belonging to the j-th category is f (x)_i,W)_jN is the number of training samples, λ is the hyperparameter, Δ is the correct class y_iIs always higher than the boundary value of the score on the incorrect classification j, the max function is used to take the larger of the two.

The invention also provides a human body action recognition method of the double-current convolution network based on the competitive union network, which comprises the following steps:

s1, inputting a video into the double-current convolution network based on the competition network and the competition network;

s2, the competitive combination network carries out dynamic and static pixel segmentation on the frame of the video and outputs an optical flow image sequence which eliminates static area pixels and only comprises moving pixels;

s3, inputting the optical flow image sequence output by the S2 into the medium-scale convolution network, and extracting the features of the optical flow image;

s4, inputting the video of the S1 into a convolution 3D network, and extracting the characteristics of each frame of image in the video;

s5, performing feature classification on the extracted features at a softmax layer of the medium-scale convolutional network S3 and the convolutional 3D network S4 respectively;

s6, performing score fusion on the feature classification result by using a multi-class support vector machine so as to obtain correct recognition of the human body action; the multi-class support vector machine adds regularization punishment of an L2 paradigm in the calculation of a loss function of the multi-class support vector machine, wherein the regularization punishment of the L2 paradigm is a weight for restraining a large value by carrying out element-by-element square punishment on all parameters, and the formula is as follows:

wherein x is_iFor the image features contained in the ith data, y_iLabels representing the correct class, f (x)_iW) is a linear scoring function to calculate scores for different classification categories, the score belonging to the j-th category is f (x)_i,W)_jN is the number of training samples, λ is the hyperparameter, Δ is the correct class y_iThe score of (d) is always higher than the boundary value of the score on the incorrect classification j, and the max function is used for taking the larger value of the two;

and S7, outputting a final recognition result.

In an embodiment of the present invention, the concrete steps of the competitive network in S2 for performing dynamic and static pixel segmentation on the video frame include:

s21, estimating a static area optical flow by a static area network through the depth estimation network and the camera motion network, so as to predict static area pixels;

s22, estimating an optical flow by a dynamic area network through a video multi-frame image so as to predict dynamic area pixels;

s23, competing pixels in the training video frame image by using the static area pixels predicted by S21 and the dynamic area pixels predicted by S22;

s24, coordinating the competition relationship between the static area network and the dynamic area network by the motion segmentation network, and removing static area pixels from the dynamic area network so as to generate a synthesized optical flow on the whole video multi-frame image;

s25, co-training the static area network, the dynamic area network and the motion segmentation network by using the loss of the synthetic optical flow;

s26, alternately dividing the static area network, the dynamic area network and the motion segmentation network into dynamic and static areas in a training period, and outputting an optical flow image which eliminates static area pixels and only comprises motion pixels;

wherein the training cycle of S26 includes a first phase and a second phase:

the first stage is as follows: the motion segmentation network is used as a coordination network to train two competition networks consisting of the static area network and the dynamic area network, and an energy function is minimized;

and a second stage: the two competing networks cooperate to train the coordinated network, minimizing the energy function.

The invention divides the dynamic and static pixels by introducing the competition network into the double-current network, so that the optical flow network outputs the optical flow graph with static area pixels removed, and the optical flow graph is used as the input of the time flow convolution network, thereby reducing the error caused by external differences such as environment and the like in the prior art and improving the action recognition accuracy. Meanwhile, the double-flow network is combined with a spatial flow convolution network formed by a convolution 3D network to form the double-flow network, so that the time information and the spatial information in the video are fully utilized.

Drawings

FIG. 1 is a block diagram of a human body action recognition method of a double-current convolution network based on a race-joining network architecture according to the present invention;

FIG. 2 is an original dual-flow network model;

FIG. 3 is a schematic diagram of a bidding network;

FIG. 4 is a diagram of a CNN-M network architecture;

FIG. 5 is a schematic diagram of a difference between a 2D convolutional network and a 3D convolutional network;

fig. 6 is a schematic diagram of two stages of a bidding network.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

FIG. 1 is a block diagram of a human body motion recognition method of a double-current convolution network based on a competition network architecture, and as shown in FIG. 1, a video is input into the network of the double-current architecture, wherein a time stream part carries out dynamic and static pixel segmentation on a video frame through the competition network, an optical flow image with a static area removed is output and input into a medium-scale convolution network, and the characteristic extraction is carried out on the optical flow image; the spatial stream part inputs a plurality of frames of video images into a convolution 3D network and performs feature extraction on each frame of video image; and respectively carrying out feature classification on the extracted features at softmax layers of the medium-scale convolutional network and the convolutional 3D network, and then carrying out score fusion on classification results by using a multi-class support vector machine so as to obtain correct recognition of human body actions.

The following describes a human body motion recognition system and method based on a dual-current convolutional network of a race-joining network according to a specific embodiment.

The human body motion recognition system of the double-current convolutional network based on the competitive network provided by the embodiment of the invention is characterized in that the double-current network shown in fig. 2 is used as a network architecture for feature extraction, wherein each video single frame, namely an RGB image can be used as a space component and is input into a space-flow convolutional network, dense optical flow between video frames is input into a time-flow convolutional network, the two convolutional networks are respectively subjected to feature extraction and classification recognition, and finally, results are fused, as shown in fig. 1 and fig. 2, the human body motion recognition system of the double-current convolutional network based on the competitive network comprises:

wherein, still include:

In the embodiment of the present invention, the bidding network includes a static area network R, a dynamic area network F, and a motion segmentation network, as shown in fig. 3, where the static area network R includes a depth estimation network D and a camera motion network C, that is, may be represented by R ═ D (C), and the dynamic area network is an optical flow network F;

in an embodiment of the present invention, the time-flow convolutional network is a medium-scale convolutional network, and the medium-scale convolutional network (CNN-M) includes 5 convolutional layers, 2 fully-connected layers, and one layer of softmax, as shown in fig. 4, the input image size of the medium-scale convolutional network is 224 × 224, the convolution kernel size of the first convolutional layer is 7 × 7, the convolution step size is 2, the convolution kernel size of the second convolutional layer is 5 × 5, the convolution step size is 2, the convolution kernel sizes of the third layer to the fifth convolutional layers are all 3 × 3, and the convolution step size is 1. CNN-M can better find and retain the detail information of the original input image by increasing the number of filters, decreasing the size and step size of the filters.

In the embodiment of the present invention, the spatial stream convolutional network is a convolutional 3D network, as shown in fig. 1, the convolutional 3D network is a 3D convolutional network with uniformly arranged 3 × 3 × 3 convolutional kernels, and has 8 convolutional layers, 5 pooling layers, two fully-connected layers, and one softmax output layer, all 3D convolutional filters are 3 × 3 × 3 with a step size of 1 × 1 × 1, the kernel size of pooling layer 1 is set to be 1 × 2 × 2, the step size is 1 × 2 × 2, all the remaining 3D pooling layers are 2 × 2 × 2 with a step size of 2 × 2 × 2, and each fully-connected layer has 4096 output units.

In the 3D convolutional network, the convolution and pooling operations are performed spatio-temporally, while in the 2D convolutional network, they are performed only spatially, as shown in fig. 5, in which fig. 5(a) is a 2D convolutional network of multi-frame input and its output, and fig. 5(b) is a 3D convolutional network of multi-frame input and its output. Where the video clip size is defined as c × L × H × W, c is the number of channels, L is the length of the frame number, and H and W are the height and width of the frame, respectively. The 3D convolution and pooling kernel size points to D × k × k, D being the temporal depth of the kernel and k being the spatial size of the kernel. Thus, the 3D convolutional network can better model time information through 3D convolution and 3D pooling operations than a 2D convolutional network.

In an embodiment of the present invention, the fusion network is a Multiclass support vector Machine (multi-SVM) that adds regularization penalty of L2 paradigm in its computation of loss function to remove ambiguity of specific weights.

The embodiment of the invention also provides a human body action recognition method of the double-current convolution network based on the competition network, which comprises the following steps:

s1, inputting the video into the double-current convolution network based on the competition network and the competition network;

s2, the competitive combination network carries out dynamic and static pixel segmentation on the video frame and outputs an optical flow image sequence which eliminates static area pixels and only comprises moving pixels;

as shown in fig. 3, the specific steps of the competitive network performing moving and static pixel segmentation on the video frame in S2 are as follows:

s21, the static area network R estimates a static area optical flow through the depth estimation network D and the camera motion network C, and accordingly static area pixels are predicted;

s22, estimating an optical flow through a video multi-frame image by the dynamic area network F, and predicting dynamic area pixels;

s24, the motion segmentation network M coordinates the competitive relationship between the static area network R and the dynamic area network F and removes static area pixels from the dynamic area network F, so that a composite optical flow is generated on the whole video multi-frame image;

s25, training the static area network R, the dynamic area network F and the motion segmentation network M together by using the loss of the synthetic optical flow;

s26, alternately dividing the static area network R, the dynamic area network F and the motion segmentation network M in a training period to output an optical flow image which eliminates static area pixels and only contains motion pixels;

as shown in fig. 6, the training period of S26 includes a first phase and a second phase:

the first stage is a competition stage (fig. 6, left diagram), the motion segmentation network M is used as a coordination network to train two competition networks consisting of a static area network R and a dynamic area network F, and an energy function is minimized, where the energy function is:

where, the element product is represented, and the output result of M is partitioned

Management, Ω represents a set of spatial pixels, is the result of a competitor partition,

for a set of unlabeled training data sets, dynamic and static region segmentation is about to be performed

Divided into two disjoint numbersData set, L_RAs a loss function of the static area network, L_FAs a loss function for dynamic area networks, each competing network attempts to pass through the pairings during the competition process

Partitioning to obtain a minimum loss function;

the second phase is a cooperation phase (fig. 6, right diagram), and the two competition networks (R and F) cooperate to train the coordination network M, so that the data can be divided more accurately in the next period, and an energy function is minimized, wherein the energy function is as follows:

in the formula, L_MLOSS function (LOSS) representing consensus between competitors R, F.

In one embodiment of the present invention, D is used separately_θ,

F_ψ,M_χRepresenting depth estimation networks, camera motion networks, optical flow networks, and motion segmentation networks, subscripts

For parameters relevant to each network, use I_-，I，I₊Three frames representing successive frames to calculate C_φ，M_χWherein the depth estimate for the target image is:

d＝D_θ(I) (3)

by means of respective image frames I_-I₊To perform camera motion estimation as:

the optical flow of a static scene only needs a camera motion network and a depth estimation network, and is generally related to a scene structure, and a segmentation mask of each pair of target related images is as follows:

m_-,m₊＝M_χ(I_-,I,I₊) (5)

in the formula m_-，m₊∈[0,1]^ΩThe final optical flow network F is the probability of a static area in the set of spatial pixels omega_ψInputting two frames of images by estimating their optical flows and estimating u in their estimated forward optical flows_-And an inverse optical flow estimate u₊Share its weight:

u_-＝F_ψ(I,I_-),u₊＝F_ψ(I,I₊) (6)

LOSS function (LOSS) aspect, network

The minimum energy function is:

E＝λ_RE_R+λ_FE_F+λ_ME_M+λ_CE_C+λ_SE_S (7)

wherein, { lambda., (a)_R,λ_F,λ_M,λ_C,λ_SIs weight, E_RAnd E_FMinimization targets for static area networks and moving area networks, respectively, E_MDetermining competition data of the two competing networks for the minimum energy function of the split network, the weight lambda_MToo large will bring more pixels into the static area. E_CFor consistency loss to manage collaboration parts, E_SFor smoothness item to manage smoothness, pass E_RMinimizing the loss of luminosity of static scenes:

where Ω is the set of spatial pixels, ρ is the robust error function, I_sTwo images adjacent to the target image, i.e. reference frames, e_sFor camera motion estimation of reference frames, m_sFor the reference frame in a static areaProbability of (a), w_CTransforming from the reference frame to the target image I for a camera transformation function according to the depth estimate d and the camera motion estimate e;

likewise, by E_FMinimizing the loss of luminosity in the motion area:

in the formula u_sOptical flow estimation, w, representing a reference frame_fFor an optical flow transformation function, the optical flow estimate u can be transformed from the reference frame to the target image I;

the robustness loss is calculated as:

in the formula, λ_ρThe representative weight is a fixed value of 0.01. x, y are two pictures, respectively, where the second part is the loss of structural similarity, μ_x,σ_xIs the local mean and standard deviation, μ, around the x pixels of the picture_y,σ_yIs the local mean and standard deviation around the y pixels of the picture, c₁＝0.012,c₂＝0.032；

Minimizing the energy function E of a motion segmentation network_MFor reducing the sum of the values at mask and λ_MCross entropy H between managed element vectors:

greater lambda_MThe scene is biased to be static by being more biased to the static area network R.

V (E, d) represents the optical flow generated by the camera motion estimate E and depth estimate d, using a consistency loss E_CThe mask is constrained and segmented by aligning the static scene stream generated by v (e, d) with the static scene stream generated by F_ψThe consistency between the estimated optical flows to segment moving objects, the consistency loss is defined as follows:

in the formula (I), the compound is shown in the specification,

is an indication function, the condition is correct when the value is 1, the first indication function is to allocate mask to competition network and compare the robust loss rho of static region frame_RAnd robust loss of dynamic region frames ρ_FTo determine the loss of luminosity of the same pixel. Where ρ is_R＝(I,w_c(I_s,e_s,d)),ρ_F＝(I,w_f(I_s,u_s) In a second indicator function, a threshold λ_CMake it

If the static scene flow generated by v (E, d) is similar to the optical flow u, then it is a static scene, the symbol V represents the logical OR (OR) between the two indicating functions, and if the photometric loss of R is lower than that of F OR the optical flow of R is similar to that of F, then the loss of consistency E is_CThis pixel is classified as a static area pixel.

Final smoothness term E_STo constrain the depth estimation, segmentation and optical flow:

in the formula

Is the first derivative of the spatial direction, λ_eThe guaranteed smoothness is determined by the image edges.

Depth and camera motion are output directly from the network, and motion segmentation M is performed by mask network M_χAnd from the optical flow network F_ψThe estimated scene flow and optical flow consistency output is expressed as:

wherein the first term is bound from M_χThe mask probability is deduced and the second term is used from R ═ D, C and F_ψThe consistency between the extrapolated flows corrects the mask, finally, at (I, I)₊) The complete optical flow u between the static scene and the optical flow of the moving object alone, which is expressed as:

s3, inputting the optical flow image output by the S2 into a medium-scale convolution network, and extracting the characteristics of the optical flow image;

as can be seen from the CNN-M network structure in fig. 5 and the convolutional 3D network architecture in fig. 1, two feature-extracted convolutional networks in the dual-flow network are connected to a softmax layer respectively, so as to perform feature classification on the extracted features.

S6, performing score fusion on the result of the feature classification by using a multi-class Support Vector Machine (multi-SVM), so as to obtain correct recognition of the human body action; the regularization penalty of the L2 paradigm is added to the calculation of the loss function of the multi-class support vector machine to eliminate the ambiguity of the specific weight, and the regularization penalty of the L2 paradigm and the loss function of the multi-SVM of the multi-class support vector machine are as follows:

wherein, let the image characteristic x included in the ith data_iAnd label y representing the correct category_iBy a linear scoring function f (x)_iW) to calculate the scores of the different classification categories, wherein W is the weight and s is the score of the j-th category_j＝f(x_i,W)_j. Linear scoring function s ═ f (x)_iW) score S in incorrect category j_jAnd the correct category y_iScore S of_yiAnd comparing, predicting the accumulation of incorrect classes through a loss function, and calculating the loss function of the multi-class SVM aiming at the ith data as follows:

wherein Δ is a boundary value where the score on the correct classification is always higher than the score on the incorrect classification; max function for taking S_j-S_yiGreater of +. DELTA.and 0, if L_iGiven data x is illustrated as 0_iThe classification is correct;

if the loss function only includes an error portion, there may be a plurality of W, and to eliminate ambiguity of a specific weight W, a regularization penalty (regularization penalty) portion is added on the basis of the error portion, where the regularization penalty adopted in this embodiment is an L2 paradigm, and the L2 paradigm suppresses a large-value weight by performing an element-by-element square penalty on all parameters, and the formula is as follows:

in the formula, k represents a row vector of an element in W, l represents a column vector of the element in W, and the regularization part is only based on weight and is irrelevant to data.

The linear scoring function is substituted to obtain the complete loss function of the multi-SVM as follows:

where N is the number of training samples and λ is the hyperparameter.

And S7, outputting a final recognition result.

In conclusion, the invention introduces the competition network into the double-current network, and utilizes the characteristic that the competition network can divide the dynamic area and the static area of the video, so that the optical flow network outputs the optical flow graph only containing the motion pixels as the input of the time flow convolution network in the double-current network, thereby greatly improving the interference problem of the static environment on the action identification and improving the accuracy of the action identification. And the video and the C3D network are combined to form a double-flow network, so that the time information and the spatial information in the video are fully utilized. In addition, the competitive network used in the embodiment of the present invention may use a label-free data set, and its competitive cooperation characteristic enables the motion segmentation network therein to be subjected to joint training of a dynamic area network represented by a static area network and an optical flow network, which are composed of a depth estimation network and a camera motion network, before performing the next dynamic and static pixel segmentation, so that the next dynamic and static pixel segmentation is more accurate, interference in the optical flow graph input as the time-flow convolution network is less, and the action recognition accuracy rate becomes high.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A human body action recognition system based on a double-current convolution network of a competition network comprises the following components:

it is characterized by also comprising:

2. The system for recognizing human body actions based on the dual-current convolutional network of the competition network of claim 1, wherein the competition network comprises a static area network, a dynamic area network and a motion segmentation network, the static area network comprises a depth estimation network and a camera motion network, and the dynamic area network is an optical flow network.

3. The human body motion recognition system based on the dual-stream convolutional network of the competition network of claim 1, wherein the time-stream convolutional network is a medium-scale convolutional network, the medium-scale convolutional network comprises 5 convolutional layers, 2 fully-connected layers and one softmax, the input image size of the medium-scale convolutional network is 224 x 224, the convolutional kernel size of the first convolutional layer is 7 x 7, the convolution step size is 2, the convolutional kernel size of the second convolutional layer is 5 x 5, the convolution step size is 2, the convolutional kernel sizes of the third layer to the fifth convolutional layers are all 3 x 3, and the convolution step size is 1.

4. The human body motion recognition system based on the dual-stream convolutional network of the competitive network of claim 1, wherein the spatial stream convolutional network is a convolutional 3D network, the convolutional 3D network has 8 convolutional layers, 5 pooling layers, two fully-connected layers and one softmax output layer, all 3D convolutional filters are 3 × 3 × 3 with a step size of 1 × 1 × 1, the kernel size of pooling layer 1 is 1 × 2 × 2 with a step size of 1 × 2 × 2, all the remaining 3D pooling layers are 2 × 2 × 2 with a step size of 2 × 2 × 2, and each fully-connected layer has 4096 output units.

5. The human body motion recognition system of the dual-flow convolutional network based on the competitive network of claim 1, wherein the fusion network is a multi-class support vector machine, the multi-class support vector machine adds regularization penalty of L2 paradigm in the calculation of its loss function to eliminate ambiguity of specific weight, the regularization penalty of L2 paradigm is to suppress weight of large value by performing element-by-element square penalty on all parameters, and the formula is as follows:

6. A human body action recognition method comprising the double-current convolution network based on the race-joining network of the system of claims 1-5, and is characterized by comprising the following steps:

in the formula, x_iFor the image features contained in the ith data, y_iLabels representing the correct class, f (x)_iW) is a linear scoring function to calculate scores for different classification categories, the score belonging to the j-th category is f (x)_i,W)_jN is the number of training samples, λ is the hyperparameter, Δ is the correct class y_iIs always higher than the boundary of the score on incorrect classification jThe value, max function is used to take the larger of the two;

and S7, outputting a final recognition result.

7. The method as claimed in claim 6, wherein the step of performing the motion-still pixel segmentation on the video frame by the bidding network in S2 is as follows:

wherein the training cycle of S26 includes a first phase and a second phase: