CN109993096B - Optical flow multilayer frame feature propagation and aggregation method for video object detection - Google Patents

Optical flow multilayer frame feature propagation and aggregation method for video object detection Download PDF

Info

Publication number
CN109993096B
CN109993096B CN201910230235.2A CN201910230235A CN109993096B CN 109993096 B CN109993096 B CN 109993096B CN 201910230235 A CN201910230235 A CN 201910230235A CN 109993096 B CN109993096 B CN 109993096B
Authority
CN
China
Prior art keywords
feature
network
layer
frame
optical flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910230235.2A
Other languages
Chinese (zh)
Other versions
CN109993096A (en
Inventor
张斌
柳波
郭军
刘晨
张娅杰
刘文凤
王馨悦
王嘉怡
李薇
陈文博
侯帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201910230235.2A priority Critical patent/CN109993096B/en
Publication of CN109993096A publication Critical patent/CN109993096A/en
Application granted granted Critical
Publication of CN109993096B publication Critical patent/CN109993096B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video object detection-oriented optical flow multi-layer frame feature propagation and aggregation method, and relates to the technical field of computer vision. The method comprises the steps of firstly extracting multilayer features of adjacent frames through a feature network, extracting an optical flow through an optical flow network, then transmitting the multilayer frame level features of a previous frame of a current frame and a next frame of the current frame to the current frame by using the optical flow, and performing up-sampling or down-sampling on the optical flow by layers with different step lengths to obtain multilayer transmission features; and then sequentially aggregating the propagation characteristics of each layer by layer, and finally generating multi-layer aggregated frame-level characteristics for the final video target detection. The optical flow multilayer frame feature propagation and aggregation method for video target detection enables the output frame level aggregation features to take the advantages of high resolution of a shallow layer network and high-dimensional semantic features of a deep layer network into account, can improve detection performance, and improves the detection performance of small targets by the multilayer feature aggregation method.

Description

Optical flow multilayer frame feature propagation and aggregation method for video object detection
Technical Field
The invention relates to the technical field of computer vision, in particular to a video object detection-oriented optical flow multi-layer frame feature propagation and aggregation method.
Background
At present, video target detection methods at home and abroad can be mainly divided into two types, one type is a frame level method, and the other type is a characteristic level method based on optical flow. In recent years, researchers pay attention to high semantic feature levels extracted by a deep neural network, model motion information between video frames through optical flows, propagate features of adjacent frames to a current frame by using the optical flows between frames, and predict or enhance the features of the current frame. Although optical flow can be used for spatial transformation of feature levels, the features between frames are propagated by using optical flow information, for example, when the features between the frames are propagated by DFF and FGFA, the features extracted by the last residual block res5 of the residual network are used, but due to the error of the optical flow network, local features are not aligned, which causes two problems: firstly, the feature resolution extracted by res5 is low, the semantic level is high, semantic information contained in each pixel point is very rich, and if the detection is directly carried out on the propagation features with errors or the detection is carried out after aggregation is carried out, the error pixel points are not corrected by some methods, so that the detection performance is directly influenced; secondly, the receptive field of each pixel point of the extracted features of the residual block res5 on the original image is large, some small targets in the video are lower than 64 × 64 resolution, the range of the feature value corresponding to the residual block res5 is lower than 4 × 4, and the influence of the error of a single pixel point on the detection of the small targets is far greater than that on the detection of large targets with the large resolution higher than 150 × 150. In the field of image target detection, multiple layers of features of a feature network are commonly used for detection at the same time to improve detection precision, particularly the detection precision of small targets, which is called as a feature pyramid.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method for propagating and aggregating optical flow multi-layer frame features for video object detection, so as to implement propagation and aggregation of optical flow features.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method for propagating and aggregating optical flow multilayer frame features for detecting video objects comprises two parts, namely a multilayer frame level feature extraction and propagation process based on optical flows and a frame level feature aggregation process based on multilayer propagation features;
the optical flow-based multi-layer frame-level feature extraction and propagation process comprises the following steps:
step S1: extracting multilayer characteristics of adjacent frames of the video;
using a residual error network ResNet-101 network as a feature network for extracting the features at the frame level; the ResNet-101 network has different step lengths on different layers, the output step length of the last three layers of the residual block res5 is modified to be 16, an expansion convolutional layer is added at the last of the network, and the dimension of the characteristics output by the residual block res5 is reduced;
step S2: extracting the optical flow of the video by adopting a FlowNet optical flow network, and carrying out post-processing on the optical flow to carry out size transformation on the features with different sizes of each layer of the feature network;
step S2.1: extracting optical flow of the video by using a Simple version of the FlowNet network; directly connecting two adjacent frames of video images in series in a channel dimension, and inputting the 6-channel images after connection in series into a FlowNet network to extract an optical flow;
step S2.2: upsampling and downsampling the optical flow in order to match the size of the features;
step S2.2.1: current frame image I of given video i And its adjacent frame image I i-t Then, the optical flow output by the FlowNet network is as follows:
Figure BDA0002006472780000021
wherein,
Figure BDA0002006472780000022
represents the current frame I i Frame I adjacent thereto i-t The superscript 8 indicates a step size of 8,
Figure BDA0002006472780000023
representing an optical flow network FlowNet;
step S2.2.2: the optical flow is up-sampled to obtain an optical flow with a corresponding feature step length of 4, as shown in the following formula:
Figure BDA0002006472780000024
wherein,
Figure BDA0002006472780000025
represents the current frame I i And its adjacent frame I i-t Superscript 4 denotes a step size of 4, upsamplample (·) denotes the nearest neighbor upsampling function;
step S2.2.3: downsampling the optical flow to obtain an optical flow with a corresponding feature step size of 16, as shown in the following formula:
Figure BDA0002006472780000026
wherein,
Figure BDA0002006472780000027
represents the current frame I i And its adjacent frame I i-t Superscript 16, a step size of 16,downsample (·), represents an average pooled downsampling;
step S2.2.4: if it is
Figure BDA0002006472780000028
Then it is corresponding to
Figure BDA0002006472780000029
Wherein C is the number of channels, and 2 is default, H and W are the height and width of the optical flow respectively; is obtained as appropriate toThe optical flow of the propagation of the multilayer features is shown by the following formula:
Figure BDA00020064727800000210
wherein s represents a feature step size;
and step S3: propagating the multi-layer frame level characteristics of the i-t frame and the i + t frame to the ith frame by using the optical flow to obtain multi-layer propagation characteristics
Figure BDA00020064727800000211
Given multi-step optical flow
Figure BDA0002006472780000031
Propagation characteristic layer number l and I-t frame image I i-t Then the final propagation characteristics are calculated by:
Figure BDA0002006472780000032
Figure BDA0002006472780000033
wherein l represents the number of layers, l is epsilon (1, n), n is the total number of layers of the characteristic network,
Figure BDA0002006472780000034
representing the l-th layer output of the feature network;
Figure BDA0002006472780000035
represents a warp mapping function that characterizes the i-t frames i-t Mapping the value of the middle position p to the position p + deltap corresponding to the current frame i, wherein deltap represents the position offset;
then the multi-layer propagation characteristics of the i + t frame are calculated by the following formula:
Figure BDA0002006472780000036
Figure BDA0002006472780000037
the frame-level feature aggregation process based on the multilayer propagation features comprises the following steps:
step C1: propagation characteristics by a first layer of a network of characteristics
Figure BDA0002006472780000038
Current frame characteristics
Figure BDA0002006472780000039
The aggregate characteristics of the first layer of the resulting characteristic network are shown in the following equation:
Figure BDA00020064727800000310
wherein,
Figure BDA00020064727800000311
is an aggregated feature of the first layer of the feature network,
Figure BDA00020064727800000312
aggregating scaled cosine similarity weights for the first layer features;
and step C2: characterization of the polymerization of step C1
Figure BDA00020064727800000313
Inputting the current frame feature into the second layer of the feature network to obtain the feature
Figure BDA00020064727800000314
Simultaneously acquiring propagation characteristics of second layer of adjacent frames
Figure BDA00020064727800000315
Aggregating the characteristics again to obtain the aggregation characteristics of the second layer of the characteristic network as the following formulaShown in the specification:
Figure BDA00020064727800000316
wherein,
Figure BDA00020064727800000317
to characterize the aggregate characteristics of the second layer of the network,
Figure BDA00020064727800000318
aggregating scaled cosine similarity weights for the second layer features;
step C3: repeating the aggregation process, aggregating the frame-level features of each layer of the feature network one by one, and taking the aggregation feature output by the previous layer as the current frame feature of the next layer until the aggregation feature of the last layer of the feature network is obtained, wherein the aggregation feature is shown in the following formula:
Figure BDA00020064727800000319
wherein,
Figure BDA0002006472780000041
to characterize the aggregate characteristics of the nth layer of the network,
Figure BDA0002006472780000042
aggregating the scaled cosine similarity weights of the n-th layer of features, wherein n is the total number of layers of the feature network;
aggregated features of the nth layer of the feature network
Figure BDA0002006472780000043
I.e. the features that are ultimately used for video object detection,
Figure BDA0002006472780000044
the time information of multiple frames is aggregated, and the multi-layer spatial information of the characteristic network is aggregated;
the calculation method of the scaled cosine similarity weight of the aggregated nth layer features comprises the following steps:
(1) Modeling the mass distribution of the optical flow using cosine similarity weights;
using a shallow mapping network
Figure BDA0002006472780000045
The features are mapped to dimensions that are specialized in computing similarity, as shown in the following equation:
Figure BDA0002006472780000046
Figure BDA0002006472780000047
wherein,
Figure BDA0002006472780000048
is characterized by f i And f i-t→i The characteristics of the image after the mapping are carried out,
Figure BDA0002006472780000049
to map the network;
given current frame feature f i And the feature f propagated by adjacent frames i-t→i Then the cosine similarity between them at spatial position p is:
Figure BDA00020064727800000410
the weights output by the formula (14) are summed along the channel, so that the dimensionality of the output weights is changed into a two-dimensional matrix, the dimensionality is W multiplied by H, and W and H are respectively the width and the height of the features, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train;
(2) Directly extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames to obtain scaling cosine similarity weights at the frame level, and taking the scaling cosine similarity weights as the frame level aggregation weights in the step 4;
given current frame characteristics f i And propagation characteristics f of the i-t frame i-t→i Then the weight scales the network
Figure BDA00020064727800000411
The output weight scaling factor is:
Figure BDA00020064727800000412
due to lambda i-t Is a channel-level vector with a cosine similarity weight w i-t→i A matrix that is a 2-dimensional plane, the two being combined by channel-level multiplication in order to obtain pixel-level weights; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:
Figure BDA00020064727800000413
wherein,
Figure BDA00020064727800000414
multiplication at the channel level;
obtaining the cosine similarity weight after scaling through formulas (14), (15) and (16);
accordingly, the weight of the i + t frame propagation feature is:
Figure BDA00020064727800000415
normalizing the weight of position p along the multiframe such that
Figure BDA00020064727800000416
The normalization operation is completed through a SoftMax function;
the mapping network and the weight scaling network share the first two layers, convolving two successive convolutional layers with 1 × 1 convolution and 3 × 3 convolution after the 1024-dimensional vector output by ResNet-101, and thenTwo branch subnets are connected at the back; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped features
Figure BDA0002006472780000051
The second branch is also 1 × 1 convolution, and then is connected with a global average pooling layer to serve as a weight scaling network to generate a 1024-dimensional feature vector corresponding to each channel of the ResNet-101 output feature vector, so as to measure the importance degree of the features and control the scaling of the feature time aggregation weight.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: according to the optical flow multilayer frame feature propagation and aggregation method for video target detection, the features are propagated on shallow output layers (res 3 layer and res4 layer) of a feature network, on one hand, the shallow network has high resolution, and the fault tolerance rate of small targets is high during feature propagation; on the other hand, the propagation error of the shallow network can be weakened through the subsequent network and even gradually corrected. Then, features are propagated simultaneously in the shallow and deep layers of the feature network and deep and shallow features are aggregated, thus utilizing the high semantic features of the deep network while retaining the high resolution of the shallow features. The output frame level aggregation characteristics have the advantages of both high resolution of a shallow network and high-dimensional semantic characteristics of a deep network, the detection performance can be improved, and the detection performance of a small target is improved by the multi-layer characteristic aggregation method.
Drawings
FIG. 1 is a flowchart of an optical flow multi-layer frame feature propagation and aggregation method for video object detection according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of optical flow-based multi-layer feature propagation and aggregation process thereof according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a FlowNet network structure (simple version) according to an embodiment of the present invention;
FIG. 4 is a comparison graph of the detection performance of different network layers according to an embodiment of the present invention;
fig. 5 is a real box area distribution histogram of the ImageNet VID validation set and the grouping thereof according to the embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention.
The implementation takes a video data set ImageNet VID as an example, and the video data is verified by adopting the optical flow multi-layer frame characteristic propagation and aggregation method facing to video target detection;
a method for propagating and aggregating optical flow multi-layer frame features for video object detection is disclosed, as shown in FIG. 1 and FIG. 2, which comprises two parts, namely a multi-layer frame-level feature extraction and propagation process based on optical flow and a frame-level feature aggregation process based on multi-layer propagation features;
the optical flow-based multi-layer frame-level feature extraction and propagation process comprises the following steps:
step S1: extracting multilayer characteristics of adjacent frames of the video;
using a residual error network ResNet-101 network as a feature network for extracting the features at the frame level; the ResNet-101 network has different step lengths on different layers, refers to the R-FCN network, modifies the output step length of the last three layers of the residual block res5 to be 16, adds an expansion convolution layer at the end of the network, and reduces the dimension of the characteristic output by res 5;
in this embodiment, a modified ResNet-101 network is used as a feature network for extracting features at the frame level, and detailed step size and spatial scale statistics for each layer are shown in Table 1.ResNet-101 has different step sizes on different layers of the network, modifies the output step size of the last three layers res5a _ relu, res5b _ relu to 16, and adds an extended convolution layer feat _ conv _3 × 3_ relu of dilate =6, kernel =3, pad =6, num _filters = 1024.
TABLE 1 ResNet-101 layer step size statistics
Numbering Layers of ResNet-101 Step size Size of
1 res2a_relu 4 1/4
2 res2b_relu 4 1/4
3 res2c_relu 4 1/4
4 res3a_relu 8 1/8
5 res3b1_relu 8 1/8
6 res3b2_relu 8 1/8
7 res3b3_relu 8 1/8
8 res4a_relu 16 1/16
9 res4b1_relu 16 1/16
10 res4b2_relu 16 1/16
30 res4b22_relu 16 1/16
31 res5a_relu 16 1/16
32 res5b_relu 16 1/16
33 feat_conv_3×3_relu 16 1/16
Due to the structural characteristics of the residual error network, only the output layer of the residual error module is counted, the internal layer is not counted and cannot be used for feature propagation, number represents the Number of the corresponding network layer, layers enumerates all network layer outputs of ResNet-101 except the first two Layers, stride represents the feature step length of the corresponding network layer output, and spatial _ scale represents the scale of the corresponding layer output/the original picture scale; in this embodiment, the res2b _ relu layer, res3b3_ relu layer, res4b22_ relu layer, and feat _ conv _3 × 3 _relulayer are used for multi-layer feature propagation.
Step S2: extracting the optical flow of the video by using a FlowNet optical flow network, and performing post-processing on the optical flow to perform size conversion on the features with different sizes of each layer of the feature network;
step S2.1: extracting optical flow of the video using a Simple version of the FlowNet network as shown in figure 3; directly connecting two adjacent frames of the video image in series in the channel dimension, and inputting the 6-channel image after the connection in series into a FlowNet network to extract an optical flow;
the FlowNet network extracts the characteristics containing high-dimensional semantic information of two frames of images through downsampling CNN;
firstly, an average pooling layer with a window size of 2 multiplied by 2 and a step length of 2 is used for reducing the size of an original input picture by half, then the abstract level of features is improved through 9 continuous convolution layers, and meanwhile, the feature size is changed into 1/32 of the original feature size;
the output characteristic diagram of the down-sampling CNN has high semantic meaning, but the resolution ratio is low, compared with the original diagram, the characteristic diagram loses detail information among a plurality of images in the process of adopting the characteristic diagram, and the optical flow effect obtained by the characteristic is poor, so that the FlowNet network introduces a refining module after the down-sampling CNN, improves the characteristic resolution ratio and learns the high-quality optical flow among the images;
the refining module is based on the FCN thought, adopts deconvolution operation similar to FCN, improves the resolution of the features, meanwhile supplements lost detail information by combining the output features of the front layer, and finally outputs a dual-channel optical flow; the network structure of the refining module is as follows: firstly, doubling the size of a feature map by deconvolution, then serially connecting the feature map with a corresponding convolution layer output feature map in a down-sampling CNN along the channel dimension to serve as the input of the next layer, wherein the same basically applies to the following process, and the difference is that a flow branch is used for learning an optical flow with a corresponding size each time, and the optical flow is serially connected to the output feature map along the channel dimension to serve as the input of the next layer;
step S2.2: upsampling and downsampling the optical flow in order to match the size of the features;
step S2.2.1: current frame image I of given video i And its adjacent frame image I i-t Then the optical flow output by the FlowNet network is as follows:
Figure BDA0002006472780000071
wherein,
Figure BDA0002006472780000072
represents the current frame I i And its adjacent frame I i-t The superscript 8 indicates a step size of 8,
Figure BDA0002006472780000079
representing an optical flow network FlowNet;
step S2.2.2: the optical flow is up-sampled to obtain an optical flow with a corresponding feature step length of 4, as shown in the following formula:
Figure BDA0002006472780000073
wherein,
Figure BDA0002006472780000074
represents the current frame I i Frame I adjacent thereto i-t The superscript 4 represents a step size of 4, supersample (·) represents a nearest neighbor upsampling function;
step S2.2.3: downsampling the optical flow to obtain an optical flow with a corresponding feature step size of 16, as shown in the following formula:
Figure BDA0002006472780000075
wherein,
Figure BDA0002006472780000076
represents the current frame I i And its adjacent frame I i-t The superscript 16 denotes a step size of 16,downsample (·) denotes an average pooling downsampling;
step S2.2.4: if it is
Figure BDA0002006472780000077
Corresponding to
Figure BDA0002006472780000078
Wherein C is the number of channels, and 2, H and W are the height and width of the optical flow by default; obtaining an optical flow suitable for multi-layer feature propagation, as shown in the following formula:
Figure BDA0002006472780000081
wherein s represents a feature step length;
and step S3: propagating the multi-layer frame level characteristics of the i-t frame and the i + t frame to the ith frame by using the optical flow to obtain multi-layer propagation characteristics
Figure BDA0002006472780000082
In this embodiment, in order to propagate the multilayer features, the same optical flow is used for each layer of the same step length; for example, passing the res4a _ relu layer to the expanded convolution layer, feat _ conv _3 × 3_ relu, is a feature that propagates with an optical flow with a step size of 16.
Given multi-step optical flow
Figure BDA0002006472780000083
Propagation characteristic layer number 1 and I-t frame image I i-t Then the final propagation characteristics are calculated by:
Figure BDA0002006472780000084
Figure BDA0002006472780000085
where l represents the Number of layers, l ∈ (1, n), n being the total Number of layers of the feature network, corresponding to the first column Number in Table 1,
Figure BDA0002006472780000086
representing the l-th layer output of the feature network;
Figure BDA0002006472780000087
represents a warp mapping function that characterizes the i-t frames i-t The value of the middle position p is mapped to the position p + deltap corresponding to the current frame i, and 6p represents the position offset;
then the multi-layer propagation characteristics of the i + t frame are calculated by the following formula:
Figure BDA0002006472780000088
Figure BDA0002006472780000089
the frame level feature aggregation process based on the multilayer propagation features comprises the following steps:
step C1: propagation characteristics by the first layer of the characteristic network
Figure BDA00020064727800000810
Current frame characteristics
Figure BDA00020064727800000811
The aggregate characteristics of the first layer of the resulting characteristic network are shown in the following equation:
Figure BDA00020064727800000812
wherein,
Figure BDA00020064727800000813
for the aggregation feature of the first layer of the feature network,
Figure BDA00020064727800000814
aggregating scaled cosine similarity weights for the first layer features;
and step C2: characterization of the polymerization of step C1
Figure BDA00020064727800000815
Inputting the current frame feature into the second layer of the feature network to obtain the feature
Figure BDA00020064727800000816
Simultaneously acquiring propagation characteristics of second layer of adjacent frames
Figure BDA0002006472780000091
And (5) aggregating the characteristics again to obtain the aggregation characteristics of the second layer of the characteristic network as shown in the following formula:
Figure BDA0002006472780000092
wherein,
Figure BDA0002006472780000093
network of featuresThe polymerized character of the second layer is such that,
Figure BDA0002006472780000094
aggregating scaled cosine similarity weights for the second layer features;
and C3: repeating the aggregation process, aggregating the frame-level features of each layer of the feature network one by one, and taking the aggregation feature output by the previous layer as the current frame feature of the next layer until the aggregation feature of the last layer of the feature network is obtained, wherein the aggregation feature is shown in the following formula:
Figure BDA0002006472780000095
wherein,
Figure BDA0002006472780000096
to characterize the aggregate characteristics of the nth layer of the network,
Figure BDA0002006472780000097
aggregating the scaled cosine similarity weight of the nth layer of features, wherein n is the total number of layers of the feature network;
aggregated characteristics of an nth layer of the characteristics network
Figure BDA0002006472780000098
I.e. the features that are ultimately used for video object detection,
Figure BDA0002006472780000099
the time information of multiple frames is aggregated, the spatial information of multiple layers of the feature network is aggregated, and the characterization capability of the current frame features is greatly enhanced.
The calculation method of the scaled cosine similarity weight of the aggregated nth layer features comprises the following steps:
(1) Modeling the mass distribution of the optical flow using cosine similarity weights;
using a shallow mapping network
Figure BDA00020064727800000910
Mapping features to dimensions that are specialized in computing similarity, as shown in the following equation:
Figure BDA00020064727800000911
Figure BDA00020064727800000912
wherein,
Figure BDA00020064727800000913
is characterized by i And f i-t→i The characteristics of the image after the mapping are carried out,
Figure BDA00020064727800000914
to map the network;
given current frame characteristics f i And features f propagated by adjacent frames i-t→i Then the cosine similarity between them at spatial position p is:
Figure BDA00020064727800000915
the weights output by the formula (14) are summed along the channel, so that the dimensionality of the output weights is changed into a two-dimensional matrix, the dimensionality is W multiplied by H, and W and H are respectively the width and the height of the features, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train;
(2) Directly extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames to obtain scaling cosine similarity weights at the frame level, and taking the scaling cosine similarity weights as the frame level aggregation weights in the step 4;
given current frame characteristics f i And propagation characteristics f of the i-t frame i-t→i Then the weight scales the network
Figure BDA00020064727800000916
The output weight scaling factor is:
Figure BDA00020064727800000917
due to lambda i-t Is a channel-level vector with a cosine similarity weight w i-t→i A matrix that is a 2-dimensional plane, the two being combined by channel-level multiplication in order to obtain pixel-level weights; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:
Figure BDA0002006472780000101
wherein,
Figure BDA0002006472780000102
multiplication at the channel level;
obtaining the cosine similarity weight after scaling through formulas (14), (15) and (16);
accordingly, the weight of the i + t frame propagation feature is:
Figure BDA0002006472780000103
normalizing the weight of position p along the multiframe such that
Figure BDA0002006472780000104
The normalization operation is completed through a SoftMax function;
the mapping network and the weight scaling network share the first two layers, two continuous convolution layers of 1 multiplied by 1 convolution and 3 multiplied by 3 convolution are used after 1024-dimensional vectors output by ResNet-101, and then two branch subnets are connected; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped features
Figure BDA0002006472780000105
The second branch is also a 1 × 1 convolution, and then a global average pooling layer is connected as a weight scalingAnd the network generates a 1024-dimensional feature vector corresponding to each channel of the ResNet-101 output feature vector and used for measuring the importance degree of the features and controlling the scaling of the feature time aggregation weight.
The present embodiment selects the output tests of the three standard blocks of ResNet-101, i.e. tests the output res3c _ relu of res3 block, the output res4b22_ relu of res4 block and the output conv _3 × 3 _featof res5 block, and the present embodiment samples every 5 layers around res3c _ relu, samples every 3 layers in res4 block, and finally samples 9 layers for testing, where the number is (2, 7, 12, 19, 21, 24, 27, 30, 33), and the average precision of the detection is as shown in fig. 4. As can be seen from FIG. 4, res4b22_ relu has the best accuracy, conv _3 × 3_feat has the second best performance, and res3c _ relu has the worst performance. And from the 17 th layer, the performance of the former layer is reduced more quickly, the difference of the average accuracy of the mean values of the later layers is reduced, and the detection accuracy reaches the highest at the 30 th layer. The method verifies that the propagation performance of the characteristics of a deeper network of a shallow network is better, but the performance can be saturated along with the reduction of the number of network layers, even the difficulty of optical flow prediction is increased due to the increase of the resolution, and the overall detection performance is reduced.
This example was tested on the ImageNet VID validation set. The characteristic propagation layer number of the FGFA is adjusted to be used as the baseline of each hierarchy, and the test result is shown in Table 2.
TABLE 2 comparison of aggregate accuracy of multilayer and monolayer propagation characteristics
Figure BDA0002006472780000106
From the experimental results of table 2, it can be seen that the feature aggregation using res4 last layer (res 4b22_ relu) propagation is better than using res5 last layer (FGFA), and thus the performance of propagating features using shallow networks and deeper networks is better. It can also be seen from the results that propagating and aggregating the features of res4 and res5 can further improve the performance of the detection (72.1 → 73.6) ↑1.5 ) And the improvement of the detection precision by the multi-layer feature aggregation is verified.
To further demonstrate the improvement of the detection performance of the multi-layer feature aggregation method on small targets, the VID verification set is divided into three groups, namely small, medium and large according to the real box area, as shown in fig. 5. The division criterion for the target size is the area between (0, 64) 2 ) The classification between is small, between (64) 2 ,150 2 ) Is medium, is greater than 150 2 Is large. The present embodiment counts the proportion distribution of each packet in the authentication set, as shown in fig. 5. As can be seen from fig. 5, the large targets in the VID validation set are the majority (60.0%), the small targets are the few targets (13.5%), and the performance comparison of a single deep (res 5 last layer) feature propagation, a single shallow (res 4 last layer) feature propagation, and a fused multi-layer (res 4+ res5 last layer) propagation feature is tested on these three groups of the ImageNet VID validation set, respectively, and the test results are shown in table 3.
TABLE 3 detection accuracy of different methods on ImageNet VID validation set for different sized targets
Method Mean average precision (%) (small) Mean average precision (%) (middle) Mean average of the extracts (%) (Large)
FGFA(res5) 26.9 51.4 83.0
FGFA(res4) 29.5 50.8 84.1
FGFA(res4+res5) 30.1 51.9 84.5
As can be seen from Table 3, the detection performance of the shallow feature aggregation for small objects is higher than that of the deep feature aggregation (26.9% → 29.5% ↑2.6% ) It is stated that errors in shallow feature propagation have less effect than errors in deep feature propagation for small target detection. And meanwhile, the characteristics of the shallow layer and the deep layer are aggregated, so that the best detection performance is obtained in all sub-parts of the verification set, the detection performance can be more comprehensively improved by fusing the characteristics of the deep layer and the shallow layer, and the respective advantages of the multilayer characteristics can be well fused by the multilayer characteristic aggregation algorithm disclosed by the invention.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit of the invention, which is defined by the claims.

Claims (4)

1. A method for spreading and aggregating optical flow multilayer frame features for video object detection is characterized in that: the method comprises a multi-layer frame-level feature extraction and propagation process based on optical flow and a frame-level feature aggregation process based on multi-layer propagation features;
the optical flow-based multi-layer frame-level feature extraction and propagation process comprises the following steps:
step S1: extracting multilayer characteristics of adjacent frames of the video;
using a residual error network ResNet-101 network as a feature network for extracting frame level features, wherein the ResNet-101 network has different step sizes on different layers, modifying the output step size of the last three layers of a residual error block res5 to be 16, adding an expansion convolutional layer at the end of the network, and reducing the dimension of the features output by the residual error block res 5;
step S2: extracting the optical flow of the video by using a FlowNet optical flow network, and performing post-processing on the optical flow to perform size conversion on the features with different sizes of each layer of the feature network;
step S2.1: extracting optical flow of the video by using a Simple version of the FlowNet network; directly connecting two adjacent frames of the video image in series in the channel dimension, and inputting the 6-channel image after the connection in series into a FlowNet network to extract an optical flow;
step S2.2: in order to match the size of the features, up-sampling and down-sampling are carried out on the optical flow to obtain the optical flow suitable for multi-layer feature propagation;
and step S3: transmitting the multi-layer frame level characteristics of the i-t frame and the i + t frame to the ith frame by utilizing the optical flow to obtain multi-layer transmission characteristics
Figure FDA0002006472770000011
The frame-level feature aggregation process based on the multilayer propagation features comprises the following steps:
step C1: propagation characteristics by the first layer of the characteristic network
Figure FDA0002006472770000012
Current frame characteristics
Figure FDA0002006472770000013
The aggregate characteristics of the first layer of the resulting characteristic network are shown in the following equation:
Figure FDA0002006472770000014
wherein,
Figure FDA0002006472770000015
for the aggregation feature of the first layer of the feature network,
Figure FDA0002006472770000016
aggregating scaled cosine similarity weights for the first layer features;
and step C2: characterization of the polymerization of step C1
Figure FDA0002006472770000017
Inputting the current frame feature into the second layer of the feature network to obtain the feature
Figure FDA0002006472770000018
Simultaneously acquiring propagation characteristics of second layer of adjacent frames
Figure FDA0002006472770000019
And (5) aggregating the characteristics again to obtain the aggregation characteristics of the second layer of the characteristic network as shown in the following formula:
Figure FDA00020064727700000110
wherein,
Figure FDA00020064727700000111
for the aggregation feature of the second layer of the feature network,
Figure FDA00020064727700000112
aggregating scaled cosine similarity weights for the second layer features;
and C3: repeating the aggregation process, aggregating the frame-level features of each layer of the feature network one by one, and taking the aggregation feature output by the previous layer as the current frame feature of the next layer until the aggregation feature of the last layer of the feature network is obtained, wherein the aggregation feature is shown in the following formula:
Figure FDA00020064727700000113
wherein,
Figure FDA0002006472770000021
to characterize the aggregate characteristics of the nth layer of the network,
Figure FDA0002006472770000022
aggregating the scaled cosine similarity weight of the nth layer of features, wherein n is the total number of layers of the feature network;
aggregated features of the nth layer of the feature network
Figure FDA0002006472770000023
I.e. the features that are ultimately used for video object detection,
Figure FDA0002006472770000024
the time information of multiple frames is aggregated, the multi-layer spatial information of the feature network is aggregated, and the characterization capability of the current frame features is greatly enhanced;
the calculation method of the scaled cosine similarity weight of the aggregated nth layer features comprises the following steps:
(1) Modeling the mass distribution of the optical flow using cosine similarity weights;
(2) And extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames, obtaining scaling cosine similarity weight at a frame level, and taking the scaling cosine similarity weight as frame level aggregation weight.
2. The method of claim 1, wherein the method comprises: the specific method of step S2.2 is:
step S2.2.1: current frame image I of given video i And its adjacent frame image I i-t Then, the optical flow output by the FlowNet network is as follows:
Figure FDA0002006472770000025
wherein,
Figure FDA0002006472770000026
represents the current frame I i And its adjacent frame I i-t The superscript 8 indicates a step size of 8,
Figure FDA0002006472770000027
representing an optical flow network FlowNet;
step S2.2.2: the optical flow is up-sampled to obtain an optical flow with a corresponding feature step length of 4, as shown in the following formula:
Figure FDA0002006472770000028
wherein,
Figure FDA0002006472770000029
represents the current frame I i Frame I adjacent thereto i-t Superscript 4 denotes a step size of 4, upsamplample (·) denotes the nearest neighbor upsampling function;
step S2.2.3: downsampling the optical flow to obtain an optical flow with a corresponding feature step size of 16, as shown in the following formula:
Figure FDA00020064727700000210
wherein,
Figure FDA00020064727700000211
represents the current frame I i And its adjacent frame I i-t The superscript 16 denotes a step size of 16,downsample (·) denotes an average pooling downsampling;
step S2.2.4: if it is
Figure FDA00020064727700000212
Then it is corresponding to
Figure FDA00020064727700000213
Wherein C is the number of channels, and 2 is default, H and W are the height and width of the optical flow respectively; obtaining an optical flow suitable for multi-layer feature propagation, as shown in the following formula:
Figure FDA0002006472770000031
where s represents the feature step size.
3. The method of claim 1, wherein the method comprises: the specific method of the step S3 comprises the following steps:
given multi-step optical flow
Figure FDA0002006472770000032
Propagation characteristic layer number l and I-t frame image I i-t Then the final propagation characteristics are calculated by:
Figure FDA0002006472770000033
Figure FDA0002006472770000034
wherein l represents the number of layers, l is the number of layers (1, n), n is the total number of layers of the characteristic network,
Figure FDA0002006472770000035
representing the l-th layer output of the feature network;
Figure FDA0002006472770000036
represents a warp mapping function that characterizes the i-t frames by f i-t Mapping the value of the middle position p to the position p + deltap corresponding to the current frame i, wherein deltap represents the position offset;
then the multi-layer propagation characteristics of the i + t frame are calculated by the following formula:
Figure FDA0002006472770000037
Figure FDA0002006472770000038
4. the method of claim 1, wherein the frame-level feature aggregation method for video object detection comprises: the specific method for modeling the quality distribution of the optical flow by using the cosine similarity weight in the step C3 comprises the following steps:
using a shallow mapping network
Figure FDA0002006472770000039
Mapping features to dimensions that are specialized in computing similarity, as shown in the following equation:
Figure FDA00020064727700000310
Figure FDA00020064727700000311
wherein,
Figure FDA00020064727700000312
is characterized by i And f i-t→i The characteristics of the image after the mapping are carried out,
Figure FDA00020064727700000313
is a mapping network;
the specific method for extracting the scaling factor from the appearance characteristics of the video frame and modeling the quality distribution of the video frame to obtain the scaling cosine similarity weight at the frame level comprises the following steps:
given current frame feature f i And the feature f propagated by adjacent frames i-t→i Then the cosine similarity between them at spatial position p is:
Figure FDA0002006472770000041
the weights output by the formula (14) are summed along the channels, so that the dimension of the output weights is changed into a two-dimensional matrix, the dimension is W multiplied by H, and W and H are respectively the width and the height of the feature, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train.
Given current frame characteristics f i And propagation characteristics f of the i-t frame i-t→i Then the weight scales the network
Figure FDA0002006472770000042
The output weight scaling factor is:
Figure FDA0002006472770000043
due to lambda i-t Is a channel-level vector, and the cosine similarity weight w i-t→i A matrix that is a 2-dimensional plane, the two are combined by multiplication at the channel level in order to obtain weights at the pixel level; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:
Figure FDA0002006472770000044
wherein,
Figure FDA0002006472770000045
multiplication at the channel level;
obtaining the cosine similarity weight after scaling through formulas (14), (15) and (16);
accordingly, the weight of the i + t frame propagation feature is:
Figure FDA0002006472770000046
normalizing the weight of position p along the multiframe such that
Figure FDA0002006472770000047
The normalization operation is completed through a SoftMax function;
the mapping network and the weight scaling network share the first two layers, two continuous convolution layers of 1 x 1 convolution and 3 x 3 convolution are used after 1024-dimensional vectors output by ResNet-101, and then two branch subnets are connected; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped features
Figure FDA0002006472770000048
The second branch is also 1 × 1 convolution, and then is connected with a global average pooling layer to serve as a weight scaling network to generate a 1024-dimensional feature vector corresponding to each channel of the ResNet-101 output feature vector, so as to measure the importance degree of the features and control the scaling of the feature time aggregation weight.
CN201910230235.2A 2019-03-26 2019-03-26 Optical flow multilayer frame feature propagation and aggregation method for video object detection Active CN109993096B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910230235.2A CN109993096B (en) 2019-03-26 2019-03-26 Optical flow multilayer frame feature propagation and aggregation method for video object detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910230235.2A CN109993096B (en) 2019-03-26 2019-03-26 Optical flow multilayer frame feature propagation and aggregation method for video object detection

Publications (2)

Publication Number Publication Date
CN109993096A CN109993096A (en) 2019-07-09
CN109993096B true CN109993096B (en) 2022-12-20

Family

ID=67131468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910230235.2A Active CN109993096B (en) 2019-03-26 2019-03-26 Optical flow multilayer frame feature propagation and aggregation method for video object detection

Country Status (1)

Country Link
CN (1) CN109993096B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110400305A (en) * 2019-07-26 2019-11-01 哈尔滨理工大学 A kind of object detection method based on deep learning
CN110852199A (en) * 2019-10-28 2020-02-28 中国石化销售股份有限公司华南分公司 Foreground extraction method based on double-frame coding and decoding model
CN110866509B (en) * 2019-11-20 2023-04-28 腾讯科技(深圳)有限公司 Action recognition method, device, computer storage medium and computer equipment
CN111144376B (en) * 2019-12-31 2023-12-05 华南理工大学 Video target detection feature extraction method
CN113673545A (en) * 2020-05-13 2021-11-19 华为技术有限公司 Optical flow estimation method, related device, equipment and computer readable storage medium
CN112307872B (en) * 2020-06-12 2024-09-24 北京京东尚科信息技术有限公司 Method and device for detecting target object
CN111860293B (en) * 2020-07-16 2023-12-22 中南民族大学 Remote sensing scene classification method, device, terminal equipment and storage medium
CN111950612B (en) * 2020-07-30 2021-06-01 中国科学院大学 FPN-based weak and small target detection method for fusion factor
CN112307889B (en) * 2020-09-22 2022-07-26 北京航空航天大学 Face detection algorithm based on small auxiliary network
CN112394356B (en) * 2020-09-30 2024-04-02 桂林电子科技大学 Small target unmanned aerial vehicle detection system and method based on U-Net
CN111968064B (en) * 2020-10-22 2021-01-15 成都睿沿科技有限公司 Image processing method and device, electronic equipment and storage medium
CN112966581B (en) * 2021-02-25 2022-05-27 厦门大学 Video target detection method based on internal and external semantic aggregation
CN113223044A (en) * 2021-04-21 2021-08-06 西北工业大学 Infrared video target detection method combining feature aggregation and attention mechanism
CN113570608B (en) * 2021-06-30 2023-07-21 北京百度网讯科技有限公司 Target segmentation method and device and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108242062A (en) * 2017-12-27 2018-07-03 北京纵目安驰智能科技有限公司 Method for tracking target, system, terminal and medium based on depth characteristic stream
CN109376611A (en) * 2018-09-27 2019-02-22 方玉明 A kind of saliency detection method based on 3D convolutional neural networks

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10152627B2 (en) * 2017-03-20 2018-12-11 Microsoft Technology Licensing, Llc Feature flow for video recognition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108242062A (en) * 2017-12-27 2018-07-03 北京纵目安驰智能科技有限公司 Method for tracking target, system, terminal and medium based on depth characteristic stream
CN109376611A (en) * 2018-09-27 2019-02-22 方玉明 A kind of saliency detection method based on 3D convolutional neural networks

Also Published As

Publication number Publication date
CN109993096A (en) 2019-07-09

Similar Documents

Publication Publication Date Title
CN109993096B (en) Optical flow multilayer frame feature propagation and aggregation method for video object detection
CN111583109B (en) Image super-resolution method based on generation of countermeasure network
CN110119780B (en) Hyper-spectral image super-resolution reconstruction method based on generation countermeasure network
CN109087273B (en) Image restoration method, storage medium and system based on enhanced neural network
CN109035142B (en) Satellite image super-resolution method combining countermeasure network with aerial image prior
WO2018161775A1 (en) Neural network model training method, device and storage medium for image processing
CN110136062B (en) Super-resolution reconstruction method combining semantic segmentation
CN111179167A (en) Image super-resolution method based on multi-stage attention enhancement network
CN107274347A (en) A kind of video super-resolution method for reconstructing based on depth residual error network
CN106204447A (en) The super resolution ratio reconstruction method with convolutional neural networks is divided based on total variance
CN109993095A (en) A kind of other characteristic aggregation method of frame level towards video object detection
CN110889895A (en) Face video super-resolution reconstruction method fusing single-frame reconstruction network
JP2019067403A (en) Learning method and learning device for image segmentation, and image segmentation method and image segmentation device using the same
CN112365514A (en) Semantic segmentation method based on improved PSPNet
CN111931857B (en) MSCFF-based low-illumination target detection method
CN105657402A (en) Depth map recovery method
CN108765282B (en) Real-time super-resolution method and system based on FPGA
CN111986085B (en) Image super-resolution method based on depth feedback attention network system
CN110136067B (en) Real-time image generation method for super-resolution B-mode ultrasound image
CN111861906A (en) Pavement crack image virtual augmentation model establishment and image virtual augmentation method
Sun et al. Learning local quality-aware structures of salient regions for stereoscopic images via deep neural networks
CN114842216A (en) Indoor RGB-D image semantic segmentation method based on wavelet transformation
CN115760814A (en) Remote sensing image fusion method and system based on double-coupling deep neural network
CN114626984A (en) Super-resolution reconstruction method for Chinese text image
Wang et al. Underwater image super-resolution and enhancement via progressive frequency-interleaved network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant