CN109993096B

CN109993096B - Optical flow multilayer frame feature propagation and aggregation method for video object detection

Info

Publication number: CN109993096B
Application number: CN201910230235.2A
Authority: CN
Inventors: 张斌; 柳波; 郭军; 刘晨; 张娅杰; 刘文凤; 王馨悦; 王嘉怡; 李薇; 陈文博; 侯帅
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2022-12-20
Anticipated expiration: 2039-03-26
Also published as: CN109993096A

Abstract

The invention provides a video object detection-oriented optical flow multi-layer frame feature propagation and aggregation method, and relates to the technical field of computer vision. The method comprises the steps of firstly extracting multilayer features of adjacent frames through a feature network, extracting an optical flow through an optical flow network, then transmitting the multilayer frame level features of a previous frame of a current frame and a next frame of the current frame to the current frame by using the optical flow, and performing up-sampling or down-sampling on the optical flow by layers with different step lengths to obtain multilayer transmission features; and then sequentially aggregating the propagation characteristics of each layer by layer, and finally generating multi-layer aggregated frame-level characteristics for the final video target detection. The optical flow multilayer frame feature propagation and aggregation method for video target detection enables the output frame level aggregation features to take the advantages of high resolution of a shallow layer network and high-dimensional semantic features of a deep layer network into account, can improve detection performance, and improves the detection performance of small targets by the multilayer feature aggregation method.

Description

Optical flow multilayer frame feature propagation and aggregation method for video object detection

Technical Field

The invention relates to the technical field of computer vision, in particular to a video object detection-oriented optical flow multi-layer frame feature propagation and aggregation method.

Background

At present, video target detection methods at home and abroad can be mainly divided into two types, one type is a frame level method, and the other type is a characteristic level method based on optical flow. In recent years, researchers pay attention to high semantic feature levels extracted by a deep neural network, model motion information between video frames through optical flows, propagate features of adjacent frames to a current frame by using the optical flows between frames, and predict or enhance the features of the current frame. Although optical flow can be used for spatial transformation of feature levels, the features between frames are propagated by using optical flow information, for example, when the features between the frames are propagated by DFF and FGFA, the features extracted by the last residual block res5 of the residual network are used, but due to the error of the optical flow network, local features are not aligned, which causes two problems: firstly, the feature resolution extracted by res5 is low, the semantic level is high, semantic information contained in each pixel point is very rich, and if the detection is directly carried out on the propagation features with errors or the detection is carried out after aggregation is carried out, the error pixel points are not corrected by some methods, so that the detection performance is directly influenced; secondly, the receptive field of each pixel point of the extracted features of the residual block res5 on the original image is large, some small targets in the video are lower than 64 × 64 resolution, the range of the feature value corresponding to the residual block res5 is lower than 4 × 4, and the influence of the error of a single pixel point on the detection of the small targets is far greater than that on the detection of large targets with the large resolution higher than 150 × 150. In the field of image target detection, multiple layers of features of a feature network are commonly used for detection at the same time to improve detection precision, particularly the detection precision of small targets, which is called as a feature pyramid.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method for propagating and aggregating optical flow multi-layer frame features for video object detection, so as to implement propagation and aggregation of optical flow features.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method for propagating and aggregating optical flow multilayer frame features for detecting video objects comprises two parts, namely a multilayer frame level feature extraction and propagation process based on optical flows and a frame level feature aggregation process based on multilayer propagation features;

the optical flow-based multi-layer frame-level feature extraction and propagation process comprises the following steps:

step S1: extracting multilayer characteristics of adjacent frames of the video;

using a residual error network ResNet-101 network as a feature network for extracting the features at the frame level; the ResNet-101 network has different step lengths on different layers, the output step length of the last three layers of the residual block res5 is modified to be 16, an expansion convolutional layer is added at the last of the network, and the dimension of the characteristics output by the residual block res5 is reduced;

step S2: extracting the optical flow of the video by adopting a FlowNet optical flow network, and carrying out post-processing on the optical flow to carry out size transformation on the features with different sizes of each layer of the feature network;

step S2.1: extracting optical flow of the video by using a Simple version of the FlowNet network; directly connecting two adjacent frames of video images in series in a channel dimension, and inputting the 6-channel images after connection in series into a FlowNet network to extract an optical flow;

step S2.2: upsampling and downsampling the optical flow in order to match the size of the features;

step S2.2.1: current frame image I of given video _i And its adjacent frame image I _i-t Then, the optical flow output by the FlowNet network is as follows:

wherein,

represents the current frame I _i Frame I adjacent thereto _i-t The superscript 8 indicates a step size of 8,

representing an optical flow network FlowNet;

step S2.2.2: the optical flow is up-sampled to obtain an optical flow with a corresponding feature step length of 4, as shown in the following formula:

wherein,

represents the current frame I _i And its adjacent frame I _i-t Superscript 4 denotes a step size of 4, upsamplample (·) denotes the nearest neighbor upsampling function;

step S2.2.3: downsampling the optical flow to obtain an optical flow with a corresponding feature step size of 16, as shown in the following formula:

wherein,

represents the current frame I _i And its adjacent frame I _i-t Superscript 16, a step size of 16,downsample (·), represents an average pooled downsampling;

step S2.2.4: if it is

Then it is corresponding to

Wherein C is the number of channels, and 2 is default, H and W are the height and width of the optical flow respectively; is obtained as appropriate toThe optical flow of the propagation of the multilayer features is shown by the following formula:

wherein s represents a feature step size;

and step S3: propagating the multi-layer frame level characteristics of the i-t frame and the i + t frame to the ith frame by using the optical flow to obtain multi-layer propagation characteristics

Given multi-step optical flow

Propagation characteristic layer number l and I-t frame image I _i-t Then the final propagation characteristics are calculated by:

wherein l represents the number of layers, l is epsilon (1, n), n is the total number of layers of the characteristic network,

representing the l-th layer output of the feature network;

represents a warp mapping function that characterizes the i-t frames _i-t Mapping the value of the middle position p to the position p + deltap corresponding to the current frame i, wherein deltap represents the position offset;

then the multi-layer propagation characteristics of the i + t frame are calculated by the following formula:

the frame-level feature aggregation process based on the multilayer propagation features comprises the following steps:

step C1: propagation characteristics by a first layer of a network of characteristics

Current frame characteristics

The aggregate characteristics of the first layer of the resulting characteristic network are shown in the following equation:

wherein,

is an aggregated feature of the first layer of the feature network,

aggregating scaled cosine similarity weights for the first layer features;

and step C2: characterization of the polymerization of step C1

Inputting the current frame feature into the second layer of the feature network to obtain the feature

Simultaneously acquiring propagation characteristics of second layer of adjacent frames

Aggregating the characteristics again to obtain the aggregation characteristics of the second layer of the characteristic network as the following formulaShown in the specification:

wherein,

to characterize the aggregate characteristics of the second layer of the network,

aggregating scaled cosine similarity weights for the second layer features;

step C3: repeating the aggregation process, aggregating the frame-level features of each layer of the feature network one by one, and taking the aggregation feature output by the previous layer as the current frame feature of the next layer until the aggregation feature of the last layer of the feature network is obtained, wherein the aggregation feature is shown in the following formula:

wherein,

to characterize the aggregate characteristics of the nth layer of the network,

aggregating the scaled cosine similarity weights of the n-th layer of features, wherein n is the total number of layers of the feature network;

aggregated features of the nth layer of the feature network

I.e. the features that are ultimately used for video object detection,

the time information of multiple frames is aggregated, and the multi-layer spatial information of the characteristic network is aggregated;

the calculation method of the scaled cosine similarity weight of the aggregated nth layer features comprises the following steps:

(1) Modeling the mass distribution of the optical flow using cosine similarity weights;

using a shallow mapping network

The features are mapped to dimensions that are specialized in computing similarity, as shown in the following equation:

wherein,

is characterized by f _i And f _i-t→i The characteristics of the image after the mapping are carried out,

to map the network;

given current frame feature f _i And the feature f propagated by adjacent frames _i-t→i Then the cosine similarity between them at spatial position p is:

the weights output by the formula (14) are summed along the channel, so that the dimensionality of the output weights is changed into a two-dimensional matrix, the dimensionality is W multiplied by H, and W and H are respectively the width and the height of the features, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train;

(2) Directly extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames to obtain scaling cosine similarity weights at the frame level, and taking the scaling cosine similarity weights as the frame level aggregation weights in the step 4;

given current frame characteristics f _i And propagation characteristics f of the i-t frame _i-t→i Then the weight scales the network

The output weight scaling factor is:

due to lambda _i-t Is a channel-level vector with a cosine similarity weight w _i-t→i A matrix that is a 2-dimensional plane, the two being combined by channel-level multiplication in order to obtain pixel-level weights; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:

wherein,

multiplication at the channel level;

obtaining the cosine similarity weight after scaling through formulas (14), (15) and (16);

accordingly, the weight of the i + t frame propagation feature is:

normalizing the weight of position p along the multiframe such that

The normalization operation is completed through a SoftMax function;

the mapping network and the weight scaling network share the first two layers, convolving two successive convolutional layers with 1 × 1 convolution and 3 × 3 convolution after the 1024-dimensional vector output by ResNet-101, and thenTwo branch subnets are connected at the back; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped features

The second branch is also 1 × 1 convolution, and then is connected with a global average pooling layer to serve as a weight scaling network to generate a 1024-dimensional feature vector corresponding to each channel of the ResNet-101 output feature vector, so as to measure the importance degree of the features and control the scaling of the feature time aggregation weight.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: according to the optical flow multilayer frame feature propagation and aggregation method for video target detection, the features are propagated on shallow output layers (res 3 layer and res4 layer) of a feature network, on one hand, the shallow network has high resolution, and the fault tolerance rate of small targets is high during feature propagation; on the other hand, the propagation error of the shallow network can be weakened through the subsequent network and even gradually corrected. Then, features are propagated simultaneously in the shallow and deep layers of the feature network and deep and shallow features are aggregated, thus utilizing the high semantic features of the deep network while retaining the high resolution of the shallow features. The output frame level aggregation characteristics have the advantages of both high resolution of a shallow network and high-dimensional semantic characteristics of a deep network, the detection performance can be improved, and the detection performance of a small target is improved by the multi-layer characteristic aggregation method.

Drawings

FIG. 1 is a flowchart of an optical flow multi-layer frame feature propagation and aggregation method for video object detection according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of optical flow-based multi-layer feature propagation and aggregation process thereof according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a FlowNet network structure (simple version) according to an embodiment of the present invention;

FIG. 4 is a comparison graph of the detection performance of different network layers according to an embodiment of the present invention;

fig. 5 is a real box area distribution histogram of the ImageNet VID validation set and the grouping thereof according to the embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention.

The implementation takes a video data set ImageNet VID as an example, and the video data is verified by adopting the optical flow multi-layer frame characteristic propagation and aggregation method facing to video target detection;

a method for propagating and aggregating optical flow multi-layer frame features for video object detection is disclosed, as shown in FIG. 1 and FIG. 2, which comprises two parts, namely a multi-layer frame-level feature extraction and propagation process based on optical flow and a frame-level feature aggregation process based on multi-layer propagation features;

step S1: extracting multilayer characteristics of adjacent frames of the video;

using a residual error network ResNet-101 network as a feature network for extracting the features at the frame level; the ResNet-101 network has different step lengths on different layers, refers to the R-FCN network, modifies the output step length of the last three layers of the residual block res5 to be 16, adds an expansion convolution layer at the end of the network, and reduces the dimension of the characteristic output by res 5;

in this embodiment, a modified ResNet-101 network is used as a feature network for extracting features at the frame level, and detailed step size and spatial scale statistics for each layer are shown in Table 1.ResNet-101 has different step sizes on different layers of the network, modifies the output step size of the last three layers res5a _ relu, res5b _ relu to 16, and adds an extended convolution layer feat _ conv _3 × 3_ relu of dilate =6, kernel =3, pad =6, num _filters = 1024.

TABLE 1 ResNet-101 layer step size statistics

Numbering	Layers of ResNet-101	Step size	Size of
				1	res2a_relu	4	1/4
2	res2b_relu	4	1/4
				3	res2c_relu	4	1/4
4	res3a_relu	8	1/8
				5	res3b1_relu	8	1/8
6	res3b2_relu	8	1/8
				7	res3b3_relu	8	1/8
8	res4a_relu	16	1/16
				9	res4b1_relu	16	1/16
10	res4b2_relu	16	1/16
				…	…	…	…
30	res4b22_relu	16	1/16
				31	res5a_relu	16	1/16
32	res5b_relu	16	1/16
				33	feat_conv_3×3_relu	16	1/16

Due to the structural characteristics of the residual error network, only the output layer of the residual error module is counted, the internal layer is not counted and cannot be used for feature propagation, number represents the Number of the corresponding network layer, layers enumerates all network layer outputs of ResNet-101 except the first two Layers, stride represents the feature step length of the corresponding network layer output, and spatial _ scale represents the scale of the corresponding layer output/the original picture scale; in this embodiment, the res2b _ relu layer, res3b3_ relu layer, res4b22_ relu layer, and feat _ conv _3 × 3 _relulayer are used for multi-layer feature propagation.

Step S2: extracting the optical flow of the video by using a FlowNet optical flow network, and performing post-processing on the optical flow to perform size conversion on the features with different sizes of each layer of the feature network;

step S2.1: extracting optical flow of the video using a Simple version of the FlowNet network as shown in figure 3; directly connecting two adjacent frames of the video image in series in the channel dimension, and inputting the 6-channel image after the connection in series into a FlowNet network to extract an optical flow;

the FlowNet network extracts the characteristics containing high-dimensional semantic information of two frames of images through downsampling CNN;

firstly, an average pooling layer with a window size of 2 multiplied by 2 and a step length of 2 is used for reducing the size of an original input picture by half, then the abstract level of features is improved through 9 continuous convolution layers, and meanwhile, the feature size is changed into 1/32 of the original feature size;

the output characteristic diagram of the down-sampling CNN has high semantic meaning, but the resolution ratio is low, compared with the original diagram, the characteristic diagram loses detail information among a plurality of images in the process of adopting the characteristic diagram, and the optical flow effect obtained by the characteristic is poor, so that the FlowNet network introduces a refining module after the down-sampling CNN, improves the characteristic resolution ratio and learns the high-quality optical flow among the images;

the refining module is based on the FCN thought, adopts deconvolution operation similar to FCN, improves the resolution of the features, meanwhile supplements lost detail information by combining the output features of the front layer, and finally outputs a dual-channel optical flow; the network structure of the refining module is as follows: firstly, doubling the size of a feature map by deconvolution, then serially connecting the feature map with a corresponding convolution layer output feature map in a down-sampling CNN along the channel dimension to serve as the input of the next layer, wherein the same basically applies to the following process, and the difference is that a flow branch is used for learning an optical flow with a corresponding size each time, and the optical flow is serially connected to the output feature map along the channel dimension to serve as the input of the next layer;

step S2.2.1: current frame image I of given video _i And its adjacent frame image I _i-t Then the optical flow output by the FlowNet network is as follows:

wherein,

represents the current frame I _i And its adjacent frame I _i-t The superscript 8 indicates a step size of 8,

representing an optical flow network FlowNet;

wherein,

represents the current frame I _i Frame I adjacent thereto _i-t The superscript 4 represents a step size of 4, supersample (·) represents a nearest neighbor upsampling function;

wherein,

represents the current frame I _i And its adjacent frame I _i-t The superscript 16 denotes a step size of 16,downsample (·) denotes an average pooling downsampling;

step S2.2.4: if it is

Corresponding to

Wherein C is the number of channels, and 2, H and W are the height and width of the optical flow by default; obtaining an optical flow suitable for multi-layer feature propagation, as shown in the following formula:

wherein s represents a feature step length;

In this embodiment, in order to propagate the multilayer features, the same optical flow is used for each layer of the same step length; for example, passing the res4a _ relu layer to the expanded convolution layer, feat _ conv _3 × 3_ relu, is a feature that propagates with an optical flow with a step size of 16.

Given multi-step optical flow

Propagation characteristic layer number 1 and I-t frame image I _i-t Then the final propagation characteristics are calculated by:

where l represents the Number of layers, l ∈ (1, n), n being the total Number of layers of the feature network, corresponding to the first column Number in Table 1,

representing the l-th layer output of the feature network;

represents a warp mapping function that characterizes the i-t frames _i-t The value of the middle position p is mapped to the position p + deltap corresponding to the current frame i, and 6p represents the position offset;

the frame level feature aggregation process based on the multilayer propagation features comprises the following steps:

step C1: propagation characteristics by the first layer of the characteristic network

Current frame characteristics

wherein,

for the aggregation feature of the first layer of the feature network,

aggregating scaled cosine similarity weights for the first layer features;

and step C2: characterization of the polymerization of step C1

And (5) aggregating the characteristics again to obtain the aggregation characteristics of the second layer of the characteristic network as shown in the following formula:

wherein,

network of featuresThe polymerized character of the second layer is such that,

aggregating scaled cosine similarity weights for the second layer features;

and C3: repeating the aggregation process, aggregating the frame-level features of each layer of the feature network one by one, and taking the aggregation feature output by the previous layer as the current frame feature of the next layer until the aggregation feature of the last layer of the feature network is obtained, wherein the aggregation feature is shown in the following formula:

wherein,

to characterize the aggregate characteristics of the nth layer of the network,

aggregating the scaled cosine similarity weight of the nth layer of features, wherein n is the total number of layers of the feature network;

aggregated characteristics of an nth layer of the characteristics network

I.e. the features that are ultimately used for video object detection,

the time information of multiple frames is aggregated, the spatial information of multiple layers of the feature network is aggregated, and the characterization capability of the current frame features is greatly enhanced.

using a shallow mapping network

Mapping features to dimensions that are specialized in computing similarity, as shown in the following equation:

wherein,

is characterized by _i And f _i-t→i The characteristics of the image after the mapping are carried out,

to map the network;

given current frame characteristics f _i And features f propagated by adjacent frames _i-t→i Then the cosine similarity between them at spatial position p is:

The output weight scaling factor is:

wherein,

multiplication at the channel level;

accordingly, the weight of the i + t frame propagation feature is:

normalizing the weight of position p along the multiframe such that

The normalization operation is completed through a SoftMax function;

the mapping network and the weight scaling network share the first two layers, two continuous convolution layers of 1 multiplied by 1 convolution and 3 multiplied by 3 convolution are used after 1024-dimensional vectors output by ResNet-101, and then two branch subnets are connected; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped features

The second branch is also a 1 × 1 convolution, and then a global average pooling layer is connected as a weight scalingAnd the network generates a 1024-dimensional feature vector corresponding to each channel of the ResNet-101 output feature vector and used for measuring the importance degree of the features and controlling the scaling of the feature time aggregation weight.

The present embodiment selects the output tests of the three standard blocks of ResNet-101, i.e. tests the output res3c _ relu of res3 block, the output res4b22_ relu of res4 block and the output conv _3 × 3 _featof res5 block, and the present embodiment samples every 5 layers around res3c _ relu, samples every 3 layers in res4 block, and finally samples 9 layers for testing, where the number is (2, 7, 12, 19, 21, 24, 27, 30, 33), and the average precision of the detection is as shown in fig. 4. As can be seen from FIG. 4, res4b22_ relu has the best accuracy, conv _3 × 3_feat has the second best performance, and res3c _ relu has the worst performance. And from the 17 th layer, the performance of the former layer is reduced more quickly, the difference of the average accuracy of the mean values of the later layers is reduced, and the detection accuracy reaches the highest at the 30 th layer. The method verifies that the propagation performance of the characteristics of a deeper network of a shallow network is better, but the performance can be saturated along with the reduction of the number of network layers, even the difficulty of optical flow prediction is increased due to the increase of the resolution, and the overall detection performance is reduced.

This example was tested on the ImageNet VID validation set. The characteristic propagation layer number of the FGFA is adjusted to be used as the baseline of each hierarchy, and the test result is shown in Table 2.

TABLE 2 comparison of aggregate accuracy of multilayer and monolayer propagation characteristics

From the experimental results of table 2, it can be seen that the feature aggregation using res4 last layer (res 4b22_ relu) propagation is better than using res5 last layer (FGFA), and thus the performance of propagating features using shallow networks and deeper networks is better. It can also be seen from the results that propagating and aggregating the features of res4 and res5 can further improve the performance of the detection (72.1 → 73.6) _↑1.5 ) And the improvement of the detection precision by the multi-layer feature aggregation is verified.

To further demonstrate the improvement of the detection performance of the multi-layer feature aggregation method on small targets, the VID verification set is divided into three groups, namely small, medium and large according to the real box area, as shown in fig. 5. The division criterion for the target size is the area between (0, 64) ² ) The classification between is small, between (64) ² ，150 ² ) Is medium, is greater than 150 ² Is large. The present embodiment counts the proportion distribution of each packet in the authentication set, as shown in fig. 5. As can be seen from fig. 5, the large targets in the VID validation set are the majority (60.0%), the small targets are the few targets (13.5%), and the performance comparison of a single deep (res 5 last layer) feature propagation, a single shallow (res 4 last layer) feature propagation, and a fused multi-layer (res 4+ res5 last layer) propagation feature is tested on these three groups of the ImageNet VID validation set, respectively, and the test results are shown in table 3.

TABLE 3 detection accuracy of different methods on ImageNet VID validation set for different sized targets

Method	Mean average precision (%) (small)	Mean average precision (%) (middle)	Mean average of the extracts (%) (Large)
				FGFA(res5)	26.9	51.4	83.0
FGFA(res4)	29.5	50.8	84.1
				FGFA(res4+res5)	30.1	51.9	84.5

As can be seen from Table 3, the detection performance of the shallow feature aggregation for small objects is higher than that of the deep feature aggregation (26.9% → 29.5% _↑2.6％ ) It is stated that errors in shallow feature propagation have less effect than errors in deep feature propagation for small target detection. And meanwhile, the characteristics of the shallow layer and the deep layer are aggregated, so that the best detection performance is obtained in all sub-parts of the verification set, the detection performance can be more comprehensively improved by fusing the characteristics of the deep layer and the shallow layer, and the respective advantages of the multilayer characteristics can be well fused by the multilayer characteristic aggregation algorithm disclosed by the invention.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit of the invention, which is defined by the claims.

Claims

1. A method for spreading and aggregating optical flow multilayer frame features for video object detection is characterized in that: the method comprises a multi-layer frame-level feature extraction and propagation process based on optical flow and a frame-level feature aggregation process based on multi-layer propagation features;

step S1: extracting multilayer characteristics of adjacent frames of the video;

using a residual error network ResNet-101 network as a feature network for extracting frame level features, wherein the ResNet-101 network has different step sizes on different layers, modifying the output step size of the last three layers of a residual error block res5 to be 16, adding an expansion convolutional layer at the end of the network, and reducing the dimension of the features output by the residual error block res 5;

step S2.1: extracting optical flow of the video by using a Simple version of the FlowNet network; directly connecting two adjacent frames of the video image in series in the channel dimension, and inputting the 6-channel image after the connection in series into a FlowNet network to extract an optical flow;

step S2.2: in order to match the size of the features, up-sampling and down-sampling are carried out on the optical flow to obtain the optical flow suitable for multi-layer feature propagation;

and step S3: transmitting the multi-layer frame level characteristics of the i-t frame and the i + t frame to the ith frame by utilizing the optical flow to obtain multi-layer transmission characteristics

Current frame characteristics

wherein,

for the aggregation feature of the first layer of the feature network,

aggregating scaled cosine similarity weights for the first layer features;

and step C2: characterization of the polymerization of step C1

wherein,

for the aggregation feature of the second layer of the feature network,

aggregating scaled cosine similarity weights for the second layer features;

wherein,

to characterize the aggregate characteristics of the nth layer of the network,

aggregated features of the nth layer of the feature network

I.e. the features that are ultimately used for video object detection,

the time information of multiple frames is aggregated, the multi-layer spatial information of the feature network is aggregated, and the characterization capability of the current frame features is greatly enhanced;

(2) And extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames, obtaining scaling cosine similarity weight at a frame level, and taking the scaling cosine similarity weight as frame level aggregation weight.

2. The method of claim 1, wherein the method comprises: the specific method of step S2.2 is:

wherein,

representing an optical flow network FlowNet;

wherein,

represents the current frame I _i Frame I adjacent thereto _i-t Superscript 4 denotes a step size of 4, upsamplample (·) denotes the nearest neighbor upsampling function;

wherein,

step S2.2.4: if it is

Then it is corresponding to

Wherein C is the number of channels, and 2 is default, H and W are the height and width of the optical flow respectively; obtaining an optical flow suitable for multi-layer feature propagation, as shown in the following formula:

where s represents the feature step size.

3. The method of claim 1, wherein the method comprises: the specific method of the step S3 comprises the following steps:

given multi-step optical flow

wherein l represents the number of layers, l is the number of layers (1, n), n is the total number of layers of the characteristic network,

representing the l-th layer output of the feature network;

represents a warp mapping function that characterizes the i-t frames by f _i-t Mapping the value of the middle position p to the position p + deltap corresponding to the current frame i, wherein deltap represents the position offset;

4. the method of claim 1, wherein the frame-level feature aggregation method for video object detection comprises: the specific method for modeling the quality distribution of the optical flow by using the cosine similarity weight in the step C3 comprises the following steps:

using a shallow mapping network

wherein,

is a mapping network;

the specific method for extracting the scaling factor from the appearance characteristics of the video frame and modeling the quality distribution of the video frame to obtain the scaling cosine similarity weight at the frame level comprises the following steps:

the weights output by the formula (14) are summed along the channels, so that the dimension of the output weights is changed into a two-dimensional matrix, the dimension is W multiplied by H, and W and H are respectively the width and the height of the feature, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train.

The output weight scaling factor is:

due to lambda _i-t Is a channel-level vector, and the cosine similarity weight w _i-t→i A matrix that is a 2-dimensional plane, the two are combined by multiplication at the channel level in order to obtain weights at the pixel level; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:

wherein,

multiplication at the channel level;

accordingly, the weight of the i + t frame propagation feature is:

normalizing the weight of position p along the multiframe such that

The normalization operation is completed through a SoftMax function;

the mapping network and the weight scaling network share the first two layers, two continuous convolution layers of 1 x 1 convolution and 3 x 3 convolution are used after 1024-dimensional vectors output by ResNet-101, and then two branch subnets are connected; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped features