CN109993096B - Optical flow multilayer frame feature propagation and aggregation method for video object detection - Google Patents
Optical flow multilayer frame feature propagation and aggregation method for video object detection Download PDFInfo
- Publication number
- CN109993096B CN109993096B CN201910230235.2A CN201910230235A CN109993096B CN 109993096 B CN109993096 B CN 109993096B CN 201910230235 A CN201910230235 A CN 201910230235A CN 109993096 B CN109993096 B CN 109993096B
- Authority
- CN
- China
- Prior art keywords
- feature
- network
- layer
- frame
- optical flow
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000003287 optical effect Effects 0.000 title claims abstract description 91
- 230000002776 aggregation Effects 0.000 title claims abstract description 52
- 238000004220 aggregation Methods 0.000 title claims abstract description 52
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000001514 detection method Methods 0.000 title claims abstract description 40
- 230000004931 aggregating effect Effects 0.000 claims abstract description 20
- 238000005070 sampling Methods 0.000 claims abstract description 7
- 230000005540 biological transmission Effects 0.000 claims abstract 2
- 238000013507 mapping Methods 0.000 claims description 20
- 230000008569 process Effects 0.000 claims description 18
- 239000013598 vector Substances 0.000 claims description 12
- 230000000644 propagated effect Effects 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 238000012512 characterization method Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000006116 polymerization reaction Methods 0.000 claims description 3
- 238000012805 post-processing Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 239000010410 layer Substances 0.000 description 130
- 230000001902 propagating effect Effects 0.000 description 7
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 6
- 102100031706 Fibroblast growth factor 1 Human genes 0.000 description 6
- 101000846416 Homo sapiens Fibroblast growth factor 1 Proteins 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000012360 testing method Methods 0.000 description 5
- 238000010200 validation analysis Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000007670 refining Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 229920006395 saturated elastomer Polymers 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a video object detection-oriented optical flow multi-layer frame feature propagation and aggregation method, and relates to the technical field of computer vision. The method comprises the steps of firstly extracting multilayer features of adjacent frames through a feature network, extracting an optical flow through an optical flow network, then transmitting the multilayer frame level features of a previous frame of a current frame and a next frame of the current frame to the current frame by using the optical flow, and performing up-sampling or down-sampling on the optical flow by layers with different step lengths to obtain multilayer transmission features; and then sequentially aggregating the propagation characteristics of each layer by layer, and finally generating multi-layer aggregated frame-level characteristics for the final video target detection. The optical flow multilayer frame feature propagation and aggregation method for video target detection enables the output frame level aggregation features to take the advantages of high resolution of a shallow layer network and high-dimensional semantic features of a deep layer network into account, can improve detection performance, and improves the detection performance of small targets by the multilayer feature aggregation method.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a video object detection-oriented optical flow multi-layer frame feature propagation and aggregation method.
Background
At present, video target detection methods at home and abroad can be mainly divided into two types, one type is a frame level method, and the other type is a characteristic level method based on optical flow. In recent years, researchers pay attention to high semantic feature levels extracted by a deep neural network, model motion information between video frames through optical flows, propagate features of adjacent frames to a current frame by using the optical flows between frames, and predict or enhance the features of the current frame. Although optical flow can be used for spatial transformation of feature levels, the features between frames are propagated by using optical flow information, for example, when the features between the frames are propagated by DFF and FGFA, the features extracted by the last residual block res5 of the residual network are used, but due to the error of the optical flow network, local features are not aligned, which causes two problems: firstly, the feature resolution extracted by res5 is low, the semantic level is high, semantic information contained in each pixel point is very rich, and if the detection is directly carried out on the propagation features with errors or the detection is carried out after aggregation is carried out, the error pixel points are not corrected by some methods, so that the detection performance is directly influenced; secondly, the receptive field of each pixel point of the extracted features of the residual block res5 on the original image is large, some small targets in the video are lower than 64 × 64 resolution, the range of the feature value corresponding to the residual block res5 is lower than 4 × 4, and the influence of the error of a single pixel point on the detection of the small targets is far greater than that on the detection of large targets with the large resolution higher than 150 × 150. In the field of image target detection, multiple layers of features of a feature network are commonly used for detection at the same time to improve detection precision, particularly the detection precision of small targets, which is called as a feature pyramid.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method for propagating and aggregating optical flow multi-layer frame features for video object detection, so as to implement propagation and aggregation of optical flow features.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method for propagating and aggregating optical flow multilayer frame features for detecting video objects comprises two parts, namely a multilayer frame level feature extraction and propagation process based on optical flows and a frame level feature aggregation process based on multilayer propagation features;
the optical flow-based multi-layer frame-level feature extraction and propagation process comprises the following steps:
step S1: extracting multilayer characteristics of adjacent frames of the video;
using a residual error network ResNet-101 network as a feature network for extracting the features at the frame level; the ResNet-101 network has different step lengths on different layers, the output step length of the last three layers of the residual block res5 is modified to be 16, an expansion convolutional layer is added at the last of the network, and the dimension of the characteristics output by the residual block res5 is reduced;
step S2: extracting the optical flow of the video by adopting a FlowNet optical flow network, and carrying out post-processing on the optical flow to carry out size transformation on the features with different sizes of each layer of the feature network;
step S2.1: extracting optical flow of the video by using a Simple version of the FlowNet network; directly connecting two adjacent frames of video images in series in a channel dimension, and inputting the 6-channel images after connection in series into a FlowNet network to extract an optical flow;
step S2.2: upsampling and downsampling the optical flow in order to match the size of the features;
step S2.2.1: current frame image I of given video i And its adjacent frame image I i-t Then, the optical flow output by the FlowNet network is as follows:
wherein,represents the current frame I i Frame I adjacent thereto i-t The superscript 8 indicates a step size of 8,representing an optical flow network FlowNet;
step S2.2.2: the optical flow is up-sampled to obtain an optical flow with a corresponding feature step length of 4, as shown in the following formula:
wherein,represents the current frame I i And its adjacent frame I i-t Superscript 4 denotes a step size of 4, upsamplample (·) denotes the nearest neighbor upsampling function;
step S2.2.3: downsampling the optical flow to obtain an optical flow with a corresponding feature step size of 16, as shown in the following formula:
wherein,represents the current frame I i And its adjacent frame I i-t Superscript 16, a step size of 16,downsample (·), represents an average pooled downsampling;
step S2.2.4: if it isThen it is corresponding toWherein C is the number of channels, and 2 is default, H and W are the height and width of the optical flow respectively; is obtained as appropriate toThe optical flow of the propagation of the multilayer features is shown by the following formula:
wherein s represents a feature step size;
and step S3: propagating the multi-layer frame level characteristics of the i-t frame and the i + t frame to the ith frame by using the optical flow to obtain multi-layer propagation characteristics
Given multi-step optical flowPropagation characteristic layer number l and I-t frame image I i-t Then the final propagation characteristics are calculated by:
wherein l represents the number of layers, l is epsilon (1, n), n is the total number of layers of the characteristic network,representing the l-th layer output of the feature network;represents a warp mapping function that characterizes the i-t frames i-t Mapping the value of the middle position p to the position p + deltap corresponding to the current frame i, wherein deltap represents the position offset;
then the multi-layer propagation characteristics of the i + t frame are calculated by the following formula:
the frame-level feature aggregation process based on the multilayer propagation features comprises the following steps:
step C1: propagation characteristics by a first layer of a network of characteristicsCurrent frame characteristicsThe aggregate characteristics of the first layer of the resulting characteristic network are shown in the following equation:
wherein,is an aggregated feature of the first layer of the feature network,aggregating scaled cosine similarity weights for the first layer features;
and step C2: characterization of the polymerization of step C1Inputting the current frame feature into the second layer of the feature network to obtain the featureSimultaneously acquiring propagation characteristics of second layer of adjacent framesAggregating the characteristics again to obtain the aggregation characteristics of the second layer of the characteristic network as the following formulaShown in the specification:
wherein,to characterize the aggregate characteristics of the second layer of the network,aggregating scaled cosine similarity weights for the second layer features;
step C3: repeating the aggregation process, aggregating the frame-level features of each layer of the feature network one by one, and taking the aggregation feature output by the previous layer as the current frame feature of the next layer until the aggregation feature of the last layer of the feature network is obtained, wherein the aggregation feature is shown in the following formula:
wherein,to characterize the aggregate characteristics of the nth layer of the network,aggregating the scaled cosine similarity weights of the n-th layer of features, wherein n is the total number of layers of the feature network;
aggregated features of the nth layer of the feature networkI.e. the features that are ultimately used for video object detection,the time information of multiple frames is aggregated, and the multi-layer spatial information of the characteristic network is aggregated;
the calculation method of the scaled cosine similarity weight of the aggregated nth layer features comprises the following steps:
(1) Modeling the mass distribution of the optical flow using cosine similarity weights;
using a shallow mapping networkThe features are mapped to dimensions that are specialized in computing similarity, as shown in the following equation:
wherein,is characterized by f i And f i-t→i The characteristics of the image after the mapping are carried out,to map the network;
given current frame feature f i And the feature f propagated by adjacent frames i-t→i Then the cosine similarity between them at spatial position p is:
the weights output by the formula (14) are summed along the channel, so that the dimensionality of the output weights is changed into a two-dimensional matrix, the dimensionality is W multiplied by H, and W and H are respectively the width and the height of the features, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train;
(2) Directly extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames to obtain scaling cosine similarity weights at the frame level, and taking the scaling cosine similarity weights as the frame level aggregation weights in the step 4;
given current frame characteristics f i And propagation characteristics f of the i-t frame i-t→i Then the weight scales the networkThe output weight scaling factor is:
due to lambda i-t Is a channel-level vector with a cosine similarity weight w i-t→i A matrix that is a 2-dimensional plane, the two being combined by channel-level multiplication in order to obtain pixel-level weights; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:
obtaining the cosine similarity weight after scaling through formulas (14), (15) and (16);
accordingly, the weight of the i + t frame propagation feature is:
normalizing the weight of position p along the multiframe such thatThe normalization operation is completed through a SoftMax function;
the mapping network and the weight scaling network share the first two layers, convolving two successive convolutional layers with 1 × 1 convolution and 3 × 3 convolution after the 1024-dimensional vector output by ResNet-101, and thenTwo branch subnets are connected at the back; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped featuresThe second branch is also 1 × 1 convolution, and then is connected with a global average pooling layer to serve as a weight scaling network to generate a 1024-dimensional feature vector corresponding to each channel of the ResNet-101 output feature vector, so as to measure the importance degree of the features and control the scaling of the feature time aggregation weight.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: according to the optical flow multilayer frame feature propagation and aggregation method for video target detection, the features are propagated on shallow output layers (res 3 layer and res4 layer) of a feature network, on one hand, the shallow network has high resolution, and the fault tolerance rate of small targets is high during feature propagation; on the other hand, the propagation error of the shallow network can be weakened through the subsequent network and even gradually corrected. Then, features are propagated simultaneously in the shallow and deep layers of the feature network and deep and shallow features are aggregated, thus utilizing the high semantic features of the deep network while retaining the high resolution of the shallow features. The output frame level aggregation characteristics have the advantages of both high resolution of a shallow network and high-dimensional semantic characteristics of a deep network, the detection performance can be improved, and the detection performance of a small target is improved by the multi-layer characteristic aggregation method.
Drawings
FIG. 1 is a flowchart of an optical flow multi-layer frame feature propagation and aggregation method for video object detection according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of optical flow-based multi-layer feature propagation and aggregation process thereof according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a FlowNet network structure (simple version) according to an embodiment of the present invention;
FIG. 4 is a comparison graph of the detection performance of different network layers according to an embodiment of the present invention;
fig. 5 is a real box area distribution histogram of the ImageNet VID validation set and the grouping thereof according to the embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention.
The implementation takes a video data set ImageNet VID as an example, and the video data is verified by adopting the optical flow multi-layer frame characteristic propagation and aggregation method facing to video target detection;
a method for propagating and aggregating optical flow multi-layer frame features for video object detection is disclosed, as shown in FIG. 1 and FIG. 2, which comprises two parts, namely a multi-layer frame-level feature extraction and propagation process based on optical flow and a frame-level feature aggregation process based on multi-layer propagation features;
the optical flow-based multi-layer frame-level feature extraction and propagation process comprises the following steps:
step S1: extracting multilayer characteristics of adjacent frames of the video;
using a residual error network ResNet-101 network as a feature network for extracting the features at the frame level; the ResNet-101 network has different step lengths on different layers, refers to the R-FCN network, modifies the output step length of the last three layers of the residual block res5 to be 16, adds an expansion convolution layer at the end of the network, and reduces the dimension of the characteristic output by res 5;
in this embodiment, a modified ResNet-101 network is used as a feature network for extracting features at the frame level, and detailed step size and spatial scale statistics for each layer are shown in Table 1.ResNet-101 has different step sizes on different layers of the network, modifies the output step size of the last three layers res5a _ relu, res5b _ relu to 16, and adds an extended convolution layer feat _ conv _3 × 3_ relu of dilate =6, kernel =3, pad =6, num _filters = 1024.
TABLE 1 ResNet-101 layer step size statistics
Numbering | Layers of ResNet-101 | Step size | Size of |
1 | res2a_relu | 4 | 1/4 |
2 | res2b_relu | 4 | 1/4 |
3 | res2c_relu | 4 | 1/4 |
4 | res3a_relu | 8 | 1/8 |
5 | res3b1_relu | 8 | 1/8 |
6 | res3b2_relu | 8 | 1/8 |
7 | res3b3_relu | 8 | 1/8 |
8 | res4a_relu | 16 | 1/16 |
9 | res4b1_relu | 16 | 1/16 |
10 | res4b2_relu | 16 | 1/16 |
… | … | … | … |
30 | res4b22_relu | 16 | 1/16 |
31 | res5a_relu | 16 | 1/16 |
32 | res5b_relu | 16 | 1/16 |
33 | feat_conv_3×3_relu | 16 | 1/16 |
Due to the structural characteristics of the residual error network, only the output layer of the residual error module is counted, the internal layer is not counted and cannot be used for feature propagation, number represents the Number of the corresponding network layer, layers enumerates all network layer outputs of ResNet-101 except the first two Layers, stride represents the feature step length of the corresponding network layer output, and spatial _ scale represents the scale of the corresponding layer output/the original picture scale; in this embodiment, the res2b _ relu layer, res3b3_ relu layer, res4b22_ relu layer, and feat _ conv _3 × 3 _relulayer are used for multi-layer feature propagation.
Step S2: extracting the optical flow of the video by using a FlowNet optical flow network, and performing post-processing on the optical flow to perform size conversion on the features with different sizes of each layer of the feature network;
step S2.1: extracting optical flow of the video using a Simple version of the FlowNet network as shown in figure 3; directly connecting two adjacent frames of the video image in series in the channel dimension, and inputting the 6-channel image after the connection in series into a FlowNet network to extract an optical flow;
the FlowNet network extracts the characteristics containing high-dimensional semantic information of two frames of images through downsampling CNN;
firstly, an average pooling layer with a window size of 2 multiplied by 2 and a step length of 2 is used for reducing the size of an original input picture by half, then the abstract level of features is improved through 9 continuous convolution layers, and meanwhile, the feature size is changed into 1/32 of the original feature size;
the output characteristic diagram of the down-sampling CNN has high semantic meaning, but the resolution ratio is low, compared with the original diagram, the characteristic diagram loses detail information among a plurality of images in the process of adopting the characteristic diagram, and the optical flow effect obtained by the characteristic is poor, so that the FlowNet network introduces a refining module after the down-sampling CNN, improves the characteristic resolution ratio and learns the high-quality optical flow among the images;
the refining module is based on the FCN thought, adopts deconvolution operation similar to FCN, improves the resolution of the features, meanwhile supplements lost detail information by combining the output features of the front layer, and finally outputs a dual-channel optical flow; the network structure of the refining module is as follows: firstly, doubling the size of a feature map by deconvolution, then serially connecting the feature map with a corresponding convolution layer output feature map in a down-sampling CNN along the channel dimension to serve as the input of the next layer, wherein the same basically applies to the following process, and the difference is that a flow branch is used for learning an optical flow with a corresponding size each time, and the optical flow is serially connected to the output feature map along the channel dimension to serve as the input of the next layer;
step S2.2: upsampling and downsampling the optical flow in order to match the size of the features;
step S2.2.1: current frame image I of given video i And its adjacent frame image I i-t Then the optical flow output by the FlowNet network is as follows:
wherein,represents the current frame I i And its adjacent frame I i-t The superscript 8 indicates a step size of 8,representing an optical flow network FlowNet;
step S2.2.2: the optical flow is up-sampled to obtain an optical flow with a corresponding feature step length of 4, as shown in the following formula:
wherein,represents the current frame I i Frame I adjacent thereto i-t The superscript 4 represents a step size of 4, supersample (·) represents a nearest neighbor upsampling function;
step S2.2.3: downsampling the optical flow to obtain an optical flow with a corresponding feature step size of 16, as shown in the following formula:
wherein,represents the current frame I i And its adjacent frame I i-t The superscript 16 denotes a step size of 16,downsample (·) denotes an average pooling downsampling;
step S2.2.4: if it isCorresponding toWherein C is the number of channels, and 2, H and W are the height and width of the optical flow by default; obtaining an optical flow suitable for multi-layer feature propagation, as shown in the following formula:
wherein s represents a feature step length;
and step S3: propagating the multi-layer frame level characteristics of the i-t frame and the i + t frame to the ith frame by using the optical flow to obtain multi-layer propagation characteristics
In this embodiment, in order to propagate the multilayer features, the same optical flow is used for each layer of the same step length; for example, passing the res4a _ relu layer to the expanded convolution layer, feat _ conv _3 × 3_ relu, is a feature that propagates with an optical flow with a step size of 16.
Given multi-step optical flowPropagation characteristic layer number 1 and I-t frame image I i-t Then the final propagation characteristics are calculated by:
where l represents the Number of layers, l ∈ (1, n), n being the total Number of layers of the feature network, corresponding to the first column Number in Table 1,representing the l-th layer output of the feature network;represents a warp mapping function that characterizes the i-t frames i-t The value of the middle position p is mapped to the position p + deltap corresponding to the current frame i, and 6p represents the position offset;
then the multi-layer propagation characteristics of the i + t frame are calculated by the following formula:
the frame level feature aggregation process based on the multilayer propagation features comprises the following steps:
step C1: propagation characteristics by the first layer of the characteristic networkCurrent frame characteristicsThe aggregate characteristics of the first layer of the resulting characteristic network are shown in the following equation:
wherein,for the aggregation feature of the first layer of the feature network,aggregating scaled cosine similarity weights for the first layer features;
and step C2: characterization of the polymerization of step C1Inputting the current frame feature into the second layer of the feature network to obtain the featureSimultaneously acquiring propagation characteristics of second layer of adjacent framesAnd (5) aggregating the characteristics again to obtain the aggregation characteristics of the second layer of the characteristic network as shown in the following formula:
wherein,network of featuresThe polymerized character of the second layer is such that,aggregating scaled cosine similarity weights for the second layer features;
and C3: repeating the aggregation process, aggregating the frame-level features of each layer of the feature network one by one, and taking the aggregation feature output by the previous layer as the current frame feature of the next layer until the aggregation feature of the last layer of the feature network is obtained, wherein the aggregation feature is shown in the following formula:
wherein,to characterize the aggregate characteristics of the nth layer of the network,aggregating the scaled cosine similarity weight of the nth layer of features, wherein n is the total number of layers of the feature network;
aggregated characteristics of an nth layer of the characteristics networkI.e. the features that are ultimately used for video object detection,the time information of multiple frames is aggregated, the spatial information of multiple layers of the feature network is aggregated, and the characterization capability of the current frame features is greatly enhanced.
The calculation method of the scaled cosine similarity weight of the aggregated nth layer features comprises the following steps:
(1) Modeling the mass distribution of the optical flow using cosine similarity weights;
using a shallow mapping networkMapping features to dimensions that are specialized in computing similarity, as shown in the following equation:
wherein,is characterized by i And f i-t→i The characteristics of the image after the mapping are carried out,to map the network;
given current frame characteristics f i And features f propagated by adjacent frames i-t→i Then the cosine similarity between them at spatial position p is:
the weights output by the formula (14) are summed along the channel, so that the dimensionality of the output weights is changed into a two-dimensional matrix, the dimensionality is W multiplied by H, and W and H are respectively the width and the height of the features, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train;
(2) Directly extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames to obtain scaling cosine similarity weights at the frame level, and taking the scaling cosine similarity weights as the frame level aggregation weights in the step 4;
given current frame characteristics f i And propagation characteristics f of the i-t frame i-t→i Then the weight scales the networkThe output weight scaling factor is:
due to lambda i-t Is a channel-level vector with a cosine similarity weight w i-t→i A matrix that is a 2-dimensional plane, the two being combined by channel-level multiplication in order to obtain pixel-level weights; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:
obtaining the cosine similarity weight after scaling through formulas (14), (15) and (16);
accordingly, the weight of the i + t frame propagation feature is:
normalizing the weight of position p along the multiframe such thatThe normalization operation is completed through a SoftMax function;
the mapping network and the weight scaling network share the first two layers, two continuous convolution layers of 1 multiplied by 1 convolution and 3 multiplied by 3 convolution are used after 1024-dimensional vectors output by ResNet-101, and then two branch subnets are connected; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped featuresThe second branch is also a 1 × 1 convolution, and then a global average pooling layer is connected as a weight scalingAnd the network generates a 1024-dimensional feature vector corresponding to each channel of the ResNet-101 output feature vector and used for measuring the importance degree of the features and controlling the scaling of the feature time aggregation weight.
The present embodiment selects the output tests of the three standard blocks of ResNet-101, i.e. tests the output res3c _ relu of res3 block, the output res4b22_ relu of res4 block and the output conv _3 × 3 _featof res5 block, and the present embodiment samples every 5 layers around res3c _ relu, samples every 3 layers in res4 block, and finally samples 9 layers for testing, where the number is (2, 7, 12, 19, 21, 24, 27, 30, 33), and the average precision of the detection is as shown in fig. 4. As can be seen from FIG. 4, res4b22_ relu has the best accuracy, conv _3 × 3_feat has the second best performance, and res3c _ relu has the worst performance. And from the 17 th layer, the performance of the former layer is reduced more quickly, the difference of the average accuracy of the mean values of the later layers is reduced, and the detection accuracy reaches the highest at the 30 th layer. The method verifies that the propagation performance of the characteristics of a deeper network of a shallow network is better, but the performance can be saturated along with the reduction of the number of network layers, even the difficulty of optical flow prediction is increased due to the increase of the resolution, and the overall detection performance is reduced.
This example was tested on the ImageNet VID validation set. The characteristic propagation layer number of the FGFA is adjusted to be used as the baseline of each hierarchy, and the test result is shown in Table 2.
TABLE 2 comparison of aggregate accuracy of multilayer and monolayer propagation characteristics
From the experimental results of table 2, it can be seen that the feature aggregation using res4 last layer (res 4b22_ relu) propagation is better than using res5 last layer (FGFA), and thus the performance of propagating features using shallow networks and deeper networks is better. It can also be seen from the results that propagating and aggregating the features of res4 and res5 can further improve the performance of the detection (72.1 → 73.6) ↑1.5 ) And the improvement of the detection precision by the multi-layer feature aggregation is verified.
To further demonstrate the improvement of the detection performance of the multi-layer feature aggregation method on small targets, the VID verification set is divided into three groups, namely small, medium and large according to the real box area, as shown in fig. 5. The division criterion for the target size is the area between (0, 64) 2 ) The classification between is small, between (64) 2 ,150 2 ) Is medium, is greater than 150 2 Is large. The present embodiment counts the proportion distribution of each packet in the authentication set, as shown in fig. 5. As can be seen from fig. 5, the large targets in the VID validation set are the majority (60.0%), the small targets are the few targets (13.5%), and the performance comparison of a single deep (res 5 last layer) feature propagation, a single shallow (res 4 last layer) feature propagation, and a fused multi-layer (res 4+ res5 last layer) propagation feature is tested on these three groups of the ImageNet VID validation set, respectively, and the test results are shown in table 3.
TABLE 3 detection accuracy of different methods on ImageNet VID validation set for different sized targets
Method | Mean average precision (%) (small) | Mean average precision (%) (middle) | Mean average of the extracts (%) (Large) |
FGFA(res5) | 26.9 | 51.4 | 83.0 |
FGFA(res4) | 29.5 | 50.8 | 84.1 |
FGFA(res4+res5) | 30.1 | 51.9 | 84.5 |
As can be seen from Table 3, the detection performance of the shallow feature aggregation for small objects is higher than that of the deep feature aggregation (26.9% → 29.5% ↑2.6% ) It is stated that errors in shallow feature propagation have less effect than errors in deep feature propagation for small target detection. And meanwhile, the characteristics of the shallow layer and the deep layer are aggregated, so that the best detection performance is obtained in all sub-parts of the verification set, the detection performance can be more comprehensively improved by fusing the characteristics of the deep layer and the shallow layer, and the respective advantages of the multilayer characteristics can be well fused by the multilayer characteristic aggregation algorithm disclosed by the invention.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit of the invention, which is defined by the claims.
Claims (4)
1. A method for spreading and aggregating optical flow multilayer frame features for video object detection is characterized in that: the method comprises a multi-layer frame-level feature extraction and propagation process based on optical flow and a frame-level feature aggregation process based on multi-layer propagation features;
the optical flow-based multi-layer frame-level feature extraction and propagation process comprises the following steps:
step S1: extracting multilayer characteristics of adjacent frames of the video;
using a residual error network ResNet-101 network as a feature network for extracting frame level features, wherein the ResNet-101 network has different step sizes on different layers, modifying the output step size of the last three layers of a residual error block res5 to be 16, adding an expansion convolutional layer at the end of the network, and reducing the dimension of the features output by the residual error block res 5;
step S2: extracting the optical flow of the video by using a FlowNet optical flow network, and performing post-processing on the optical flow to perform size conversion on the features with different sizes of each layer of the feature network;
step S2.1: extracting optical flow of the video by using a Simple version of the FlowNet network; directly connecting two adjacent frames of the video image in series in the channel dimension, and inputting the 6-channel image after the connection in series into a FlowNet network to extract an optical flow;
step S2.2: in order to match the size of the features, up-sampling and down-sampling are carried out on the optical flow to obtain the optical flow suitable for multi-layer feature propagation;
and step S3: transmitting the multi-layer frame level characteristics of the i-t frame and the i + t frame to the ith frame by utilizing the optical flow to obtain multi-layer transmission characteristics
The frame-level feature aggregation process based on the multilayer propagation features comprises the following steps:
step C1: propagation characteristics by the first layer of the characteristic networkCurrent frame characteristicsThe aggregate characteristics of the first layer of the resulting characteristic network are shown in the following equation:
wherein,for the aggregation feature of the first layer of the feature network,aggregating scaled cosine similarity weights for the first layer features;
and step C2: characterization of the polymerization of step C1Inputting the current frame feature into the second layer of the feature network to obtain the featureSimultaneously acquiring propagation characteristics of second layer of adjacent framesAnd (5) aggregating the characteristics again to obtain the aggregation characteristics of the second layer of the characteristic network as shown in the following formula:
wherein,for the aggregation feature of the second layer of the feature network,aggregating scaled cosine similarity weights for the second layer features;
and C3: repeating the aggregation process, aggregating the frame-level features of each layer of the feature network one by one, and taking the aggregation feature output by the previous layer as the current frame feature of the next layer until the aggregation feature of the last layer of the feature network is obtained, wherein the aggregation feature is shown in the following formula:
wherein,to characterize the aggregate characteristics of the nth layer of the network,aggregating the scaled cosine similarity weight of the nth layer of features, wherein n is the total number of layers of the feature network;
aggregated features of the nth layer of the feature networkI.e. the features that are ultimately used for video object detection,the time information of multiple frames is aggregated, the multi-layer spatial information of the feature network is aggregated, and the characterization capability of the current frame features is greatly enhanced;
the calculation method of the scaled cosine similarity weight of the aggregated nth layer features comprises the following steps:
(1) Modeling the mass distribution of the optical flow using cosine similarity weights;
(2) And extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames, obtaining scaling cosine similarity weight at a frame level, and taking the scaling cosine similarity weight as frame level aggregation weight.
2. The method of claim 1, wherein the method comprises: the specific method of step S2.2 is:
step S2.2.1: current frame image I of given video i And its adjacent frame image I i-t Then, the optical flow output by the FlowNet network is as follows:
wherein,represents the current frame I i And its adjacent frame I i-t The superscript 8 indicates a step size of 8,representing an optical flow network FlowNet;
step S2.2.2: the optical flow is up-sampled to obtain an optical flow with a corresponding feature step length of 4, as shown in the following formula:
wherein,represents the current frame I i Frame I adjacent thereto i-t Superscript 4 denotes a step size of 4, upsamplample (·) denotes the nearest neighbor upsampling function;
step S2.2.3: downsampling the optical flow to obtain an optical flow with a corresponding feature step size of 16, as shown in the following formula:
wherein,represents the current frame I i And its adjacent frame I i-t The superscript 16 denotes a step size of 16,downsample (·) denotes an average pooling downsampling;
step S2.2.4: if it isThen it is corresponding toWherein C is the number of channels, and 2 is default, H and W are the height and width of the optical flow respectively; obtaining an optical flow suitable for multi-layer feature propagation, as shown in the following formula:
where s represents the feature step size.
3. The method of claim 1, wherein the method comprises: the specific method of the step S3 comprises the following steps:
given multi-step optical flowPropagation characteristic layer number l and I-t frame image I i-t Then the final propagation characteristics are calculated by:
wherein l represents the number of layers, l is the number of layers (1, n), n is the total number of layers of the characteristic network,representing the l-th layer output of the feature network;represents a warp mapping function that characterizes the i-t frames by f i-t Mapping the value of the middle position p to the position p + deltap corresponding to the current frame i, wherein deltap represents the position offset;
then the multi-layer propagation characteristics of the i + t frame are calculated by the following formula:
4. the method of claim 1, wherein the frame-level feature aggregation method for video object detection comprises: the specific method for modeling the quality distribution of the optical flow by using the cosine similarity weight in the step C3 comprises the following steps:
using a shallow mapping networkMapping features to dimensions that are specialized in computing similarity, as shown in the following equation:
wherein,is characterized by i And f i-t→i The characteristics of the image after the mapping are carried out,is a mapping network;
the specific method for extracting the scaling factor from the appearance characteristics of the video frame and modeling the quality distribution of the video frame to obtain the scaling cosine similarity weight at the frame level comprises the following steps:
given current frame feature f i And the feature f propagated by adjacent frames i-t→i Then the cosine similarity between them at spatial position p is:
the weights output by the formula (14) are summed along the channels, so that the dimension of the output weights is changed into a two-dimensional matrix, the dimension is W multiplied by H, and W and H are respectively the width and the height of the feature, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train.
Given current frame characteristics f i And propagation characteristics f of the i-t frame i-t→i Then the weight scales the networkThe output weight scaling factor is:
due to lambda i-t Is a channel-level vector, and the cosine similarity weight w i-t→i A matrix that is a 2-dimensional plane, the two are combined by multiplication at the channel level in order to obtain weights at the pixel level; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:
obtaining the cosine similarity weight after scaling through formulas (14), (15) and (16);
accordingly, the weight of the i + t frame propagation feature is:
normalizing the weight of position p along the multiframe such thatThe normalization operation is completed through a SoftMax function;
the mapping network and the weight scaling network share the first two layers, two continuous convolution layers of 1 x 1 convolution and 3 x 3 convolution are used after 1024-dimensional vectors output by ResNet-101, and then two branch subnets are connected; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped featuresThe second branch is also 1 × 1 convolution, and then is connected with a global average pooling layer to serve as a weight scaling network to generate a 1024-dimensional feature vector corresponding to each channel of the ResNet-101 output feature vector, so as to measure the importance degree of the features and control the scaling of the feature time aggregation weight.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910230235.2A CN109993096B (en) | 2019-03-26 | 2019-03-26 | Optical flow multilayer frame feature propagation and aggregation method for video object detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910230235.2A CN109993096B (en) | 2019-03-26 | 2019-03-26 | Optical flow multilayer frame feature propagation and aggregation method for video object detection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109993096A CN109993096A (en) | 2019-07-09 |
CN109993096B true CN109993096B (en) | 2022-12-20 |
Family
ID=67131468
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910230235.2A Active CN109993096B (en) | 2019-03-26 | 2019-03-26 | Optical flow multilayer frame feature propagation and aggregation method for video object detection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109993096B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110400305A (en) * | 2019-07-26 | 2019-11-01 | 哈尔滨理工大学 | A kind of object detection method based on deep learning |
CN110852199A (en) * | 2019-10-28 | 2020-02-28 | 中国石化销售股份有限公司华南分公司 | Foreground extraction method based on double-frame coding and decoding model |
CN110866509B (en) * | 2019-11-20 | 2023-04-28 | 腾讯科技(深圳)有限公司 | Action recognition method, device, computer storage medium and computer equipment |
CN111144376B (en) * | 2019-12-31 | 2023-12-05 | 华南理工大学 | Video target detection feature extraction method |
CN113673545A (en) * | 2020-05-13 | 2021-11-19 | 华为技术有限公司 | Optical flow estimation method, related device, equipment and computer readable storage medium |
CN112307872B (en) * | 2020-06-12 | 2024-09-24 | 北京京东尚科信息技术有限公司 | Method and device for detecting target object |
CN111860293B (en) * | 2020-07-16 | 2023-12-22 | 中南民族大学 | Remote sensing scene classification method, device, terminal equipment and storage medium |
CN111950612B (en) * | 2020-07-30 | 2021-06-01 | 中国科学院大学 | FPN-based weak and small target detection method for fusion factor |
CN112307889B (en) * | 2020-09-22 | 2022-07-26 | 北京航空航天大学 | Face detection algorithm based on small auxiliary network |
CN112394356B (en) * | 2020-09-30 | 2024-04-02 | 桂林电子科技大学 | Small target unmanned aerial vehicle detection system and method based on U-Net |
CN111968064B (en) * | 2020-10-22 | 2021-01-15 | 成都睿沿科技有限公司 | Image processing method and device, electronic equipment and storage medium |
CN112966581B (en) * | 2021-02-25 | 2022-05-27 | 厦门大学 | Video target detection method based on internal and external semantic aggregation |
CN113223044A (en) * | 2021-04-21 | 2021-08-06 | 西北工业大学 | Infrared video target detection method combining feature aggregation and attention mechanism |
CN113570608B (en) * | 2021-06-30 | 2023-07-21 | 北京百度网讯科技有限公司 | Target segmentation method and device and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108242062A (en) * | 2017-12-27 | 2018-07-03 | 北京纵目安驰智能科技有限公司 | Method for tracking target, system, terminal and medium based on depth characteristic stream |
CN109376611A (en) * | 2018-09-27 | 2019-02-22 | 方玉明 | A kind of saliency detection method based on 3D convolutional neural networks |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10152627B2 (en) * | 2017-03-20 | 2018-12-11 | Microsoft Technology Licensing, Llc | Feature flow for video recognition |
-
2019
- 2019-03-26 CN CN201910230235.2A patent/CN109993096B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108242062A (en) * | 2017-12-27 | 2018-07-03 | 北京纵目安驰智能科技有限公司 | Method for tracking target, system, terminal and medium based on depth characteristic stream |
CN109376611A (en) * | 2018-09-27 | 2019-02-22 | 方玉明 | A kind of saliency detection method based on 3D convolutional neural networks |
Also Published As
Publication number | Publication date |
---|---|
CN109993096A (en) | 2019-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109993096B (en) | Optical flow multilayer frame feature propagation and aggregation method for video object detection | |
CN111583109B (en) | Image super-resolution method based on generation of countermeasure network | |
CN110119780B (en) | Hyper-spectral image super-resolution reconstruction method based on generation countermeasure network | |
CN109087273B (en) | Image restoration method, storage medium and system based on enhanced neural network | |
CN109035142B (en) | Satellite image super-resolution method combining countermeasure network with aerial image prior | |
WO2018161775A1 (en) | Neural network model training method, device and storage medium for image processing | |
CN110136062B (en) | Super-resolution reconstruction method combining semantic segmentation | |
CN111179167A (en) | Image super-resolution method based on multi-stage attention enhancement network | |
CN107274347A (en) | A kind of video super-resolution method for reconstructing based on depth residual error network | |
CN106204447A (en) | The super resolution ratio reconstruction method with convolutional neural networks is divided based on total variance | |
CN109993095A (en) | A kind of other characteristic aggregation method of frame level towards video object detection | |
CN110889895A (en) | Face video super-resolution reconstruction method fusing single-frame reconstruction network | |
JP2019067403A (en) | Learning method and learning device for image segmentation, and image segmentation method and image segmentation device using the same | |
CN112365514A (en) | Semantic segmentation method based on improved PSPNet | |
CN111931857B (en) | MSCFF-based low-illumination target detection method | |
CN105657402A (en) | Depth map recovery method | |
CN108765282B (en) | Real-time super-resolution method and system based on FPGA | |
CN111986085B (en) | Image super-resolution method based on depth feedback attention network system | |
CN110136067B (en) | Real-time image generation method for super-resolution B-mode ultrasound image | |
CN111861906A (en) | Pavement crack image virtual augmentation model establishment and image virtual augmentation method | |
Sun et al. | Learning local quality-aware structures of salient regions for stereoscopic images via deep neural networks | |
CN114842216A (en) | Indoor RGB-D image semantic segmentation method based on wavelet transformation | |
CN115760814A (en) | Remote sensing image fusion method and system based on double-coupling deep neural network | |
CN114626984A (en) | Super-resolution reconstruction method for Chinese text image | |
Wang et al. | Underwater image super-resolution and enhancement via progressive frequency-interleaved network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |