CN111696137A

CN111696137A - Target tracking method based on multilayer feature mixing and attention mechanism

Info

Publication number: CN111696137A
Application number: CN202010518472.1A
Authority: CN
Inventors: 王正宁; 曾浩; 潘力立; 何庆东; 刘怡君; 曾仪; 彭大伟
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2020-09-22
Anticipated expiration: 2040-06-09
Also published as: CN111696137B

Abstract

The invention discloses a target tracking method based on a multilayer feature mixing and attention mechanism, which utilizes an improved FPN structure to better reserve and utilize shallow features of an image, and the improved FPN structure which can better reserve the shallow features can output fusion features with multi-dimension and multi-scale features. The method has better tracking capability for targets with different sizes and targets with continuously changing sizes. The FPN is used for the cascaded RPN, so that the feature extraction is more accurate, the similar interferents are better distinguished when the tracking is guaranteed, and the condition of error tracking is reduced. Meanwhile, by utilizing an attention mechanism, on a spatial scale, the network gives more attention to the possible positions of the target, so that the condition of target loss or target tracking error caused by target half-shading, deformation, illumination and the like is reduced.

Description

Target tracking method based on multilayer feature mixing and attention mechanism

Technical Field

The invention belongs to the field of image processing and computer vision, and particularly relates to a target tracking method based on multilayer feature mixing and attention mechanism.

Background

The visual target tracking is an important computer visual task and can be applied to the fields of visual monitoring, human-computer interaction, video compression and the like. Despite extensive research on this problem, it still has difficulty in dealing with complex changes in the appearance of objects due to the effects of lighting variations, partial occlusion, shape distortion, and camera motion.

The target tracking algorithm mainly has two large branches at the present stage, one is based on a correlation filtering algorithm, and the other is based on a deep learning algorithm. The target tracking method provided by the invention belongs to the branch of deep learning.

The following methods are mainly used for deep learning: a convolutional neural network; a recurrent neural network; generating a countermeasure network; a twin neural network. The target tracking method based on the convolutional neural network proposed by Learning spatial-aware regression for visual tracking, c.sun, d.wang, h.lu, a nd m.yang, in proc.ieee CVPR,2018, pp.8962-8970 "constructs a plurality of target models to capture various target appearances, learns different target models, processes partial occlusion and deformation based on part models, and simultaneously prevents overfitting and Learning the rotation information of the target by using the dual-flow network. Although this method has made great progress in the accuracy of target estimation, this type of convolutional neural network-based method still has high computational complexity. The invention discloses a multi-maneuvering-target tracking method based on an LSTM network, and particularly relates to a CN110780290A target tracking method based on a recurrent neural network, which utilizes context information to process the influence of a similar background on a tracked target. But since visual target tracking is related to the spatial and temporal information of the video frames, a cyclic neural network based approach is used while taking into account the motion of the target. The number of methods based on recurrent neural networks is limited due to the large number of parameters that exist in the model, which makes training difficult. Almost all of these approaches attempt to improve target modeling with other information and memory. Furthermore, a second goal of using a recurrent neural network-based approach is to avoid fine tuning of the pre-trained CNN model, which requires a lot of time and is prone to overfitting. "Visual tracking via adaptive sampling, Y.Song, C.Ma, X.Wu, L.G ong, L.Bao, W.Zuo, C.Shen, R.W.Lau, and M.H.Yang, inProc.IEEE CVPR,2018, pp.8990-8999" performs target tracking based on generation of countermeasure network, can generate required samples to solve the problem of unbalanced distribution of training samples, and simultaneously solves the problem of insufficient sample amount by generating samples. But the generation of a countermeasure network is often difficult to train and evaluate, and in practice the mechanics of solving this problem is very strong. In the invention patent of an infrared weak and small target detection and tracking method based on a convolutional neural network, CN110728697A utilizes a twin network to track a target, and performs characteristic matching by extracting the depth characteristic of a picture to complete the tracking of the target.

Aiming at the problems of uneven utilization of target features and shielding, half shielding, illumination change, deformation and the like of a tracked object in the prior deep learning, the method is based on a twin network, combines shallow and deep features by utilizing a plurality of FPNs, and improves the robustness of the method by using an attention mechanism.

Disclosure of Invention

The invention belongs to the field of computer vision and deep learning, and enables the whole target tracking network to have stronger feature extraction capability and robustness by improving the feature extraction part and the region recommendation network part of a twin network. The invention provides a target tracking method based on a multilayer feature mixing and attention mechanism, which comprises the following specific steps:

(1) before training, preprocessing the data set: the training data is composed of video sequences and carries the target object

A location and size tag; the target tracking network needs to input a template frame corresponding to a tracking target and a search frame for finding the target. Cutting the original video sequence to obtain w_t×h_tTemplate frame F of pixels_tAnd w_c×h_cSearch frame F of pixels_cWherein the template frame corresponds to a first frame of the video sequence and the search frame corresponds to the remaining video sequence starting with a second frame of the video sequence.

(2) Designing two parallel 5-block depth residual networks N₁、N₂Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing mode_SThe depth residual network is used to remove padding from the first 7 × 7 convolution of the existing "ResNet-50", and to change the last two convolutions of step 2 in the "ResNet-50" to convolutions of step 1, and to change the template frame F to a frame F_tAnd search frame F_cAre respectively fed into N₁、N₂Extracting the features of the two types of the image at different depths through operations such as convolution, pooling, activation and the like; ConvM _ N (F)_t) And ConvM _ N (F)_c) Respectively represent template frames F on different levels of the network_tAnd search frame F_cWherein M represents the location of the block in the ResNet network in which the feature map resides, and N represents the specific location in a block.

(3) Designing a feature pyramid network FPN, comprising three FPNs: FPN1, FPN2 and FPN3 are respectively connected with slave network N₁、N₂Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 sets of output features with different depths are respectively fused to obtain 3 sets of fused features, each FPN receives 3 feature maps with different scales, and the feature maps are respectively F from large to small and from shallow to deep₁、F₂、F₃The feature fusion is completed by point-to-point addition, the number of channels of one feature is adjusted by using 1 × 1 convolution to ensure that the number of channels of two features is the same, and then the size of the other feature is adjusted by using 2 times of up-sampling or 3 × 3 convolution with the step length of 2 to ensure that the sizes of the two adjusted features are the same, so that point-to-point addition is completedAdding, namely feature fusion; fusing the 3 characteristics and finally outputting a fused characteristic F_MAnd F is_MSize and F₃The same; finally, three FPNs respectively output the mixed characteristics F of the template frame_{M_1}(F_t)、F_{M_2}(F_t)、F_{M_3}(F_t) And hybrid feature of search frame F_{M_1}(F_c)、F_{M_2}(F_c)、F_{M_3}(F_c)；

(4) Designing a regional recommended network (RPN), which comprises three RPNs: RPN1, RPN2 and RPN3 respectively input three pairs of template frame and search frame mixture features: f_{M_1}(F_t)、F_{M_1}(F_c)；F_{M_2}(F_t)、F_{M_2}(F_c)；F_{M_3}(F_t)、F_{M_3}(F_c) Obtaining a classification result CLS and a regression result REG of the suggestion box;

(5) the RPN outputs the regression results of the classification CLS and the REG of the suggestion frame, the two different outputs are completed by two paths, the upper half part of the RPN outputs the classification CLS of the suggestion frame, and the lower half part outputs the regression REG of the suggestion frame; RPN first extracts the hybrid feature F from the template frame_M(F_t) Cutting from the edge, wherein c is the number of current mixed characteristic channels, and the number of mixed characteristic channels in different combinations is different; then, by adjusting the convolution, F_M(F_t) And F_M(F_c) Adjusted to a suitable size [ F ]_M(F_t)]_c，[F_M(F_c)]_c，[F_M(F_t)]_r，[F_M(F_c)]_r(ii) a Will [ F ]_M(F_t)]_c，[F_M(F_c)]_cPerforming cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ]_M(F_t)]_r，[F_M(F_c)]_rPerforming cross-correlation operation to obtain a primary regression result REG _ O;

CLS _ O has a size w_res×h_res× 2k, REG _ O of size w_res×h_res× 4k, the output result is at w_res×h_resDimension and original drawing w_c×h_cIn a spatially linear relationship, at w_res×h_resEach position of the anchor frame corresponds to k anchor frames with preset sizes, and the center of each anchor frame is the center of the current position; the 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the target_posAnd probability P of not containing the object_neg(ii) a The 4k channels of REG _ O represent the length-width difference and position difference of k anchor boxes predicted by the network and the actual target box, which are dx, dy, dw, dh respectively. The relationship with the actual target frame is as follows:

wherein A is_x、A_yRepresents the center point of the reference frame, A_w、A_hWidth and height of the reference frame, T_x、T_y、T_w、T_hThe coordinate and the length and the width of the truth value are expressed, and finally the final target is found out through methods such as maximum suppression and the like;

(6) after CLS _ O and REG _ O are obtained through output, the CLS _ O and REG _ O are input into a space attention module, and w is obtained through average pooling, maximum pooling, convolution and Sigmoid activation operation_res×h_res× 1, multiplying CLS _ O and REG _ O with the corresponding positions of SA _ c and SA _ r, respectively, and adding the result to original CLS _ O and REG _ O to obtain final RPN output results CLS and REG;

(7) for three RPNs: and the output results of the RPN1, the RPN2 and the RPN3 are subjected to weighted addition to serve as a final target tracking network output result:

wherein, α₁，α₂，α₃，β₁，β₂，β₃Is a preset weight.

(8) Class loss L in training the target tracking network_clsUsing cross-entropy loss, regression loss L_regUsing a smoothed L1 penalty with normalized coordinates; y denotes the tag value and y denotes the actual classification value, i.e. P_pos；dx_T，dy_T，dw_T，dh_TRepresenting the length and width difference and the position difference between the actual k kinds of anchor frames and the actual target frame, namely a true value; the loss functions are defined as:

wherein:

the final loss function is as follows:

loss＝L_cls+λL_reg(5)

where λ is a hyperparameter, which is used to balance the two types of losses.

The present invention utilizes an improved FPN structure. Compared with the situation that the deep features obtained in the traditional FPN have insufficient reservation for the shallow features, the shallow features of the image are better reserved and utilized by utilizing the improved FPN structure. This improved FPN structure with better retention of shallow features can output fused features with multi-dimensional, multi-scale features. The method has better tracking capability for targets with different sizes and targets with continuously changing sizes. The FPN is used for the cascaded RPN, so that the feature extraction is more accurate, the similar interferents are better distinguished when the tracking is guaranteed, and the condition of error tracking is reduced. Meanwhile, by utilizing an attention mechanism, on a spatial scale, the network gives more attention to the possible positions of the target, so that the condition of target loss or target tracking error caused by target half-shading, deformation, illumination and the like is reduced.

Drawings

FIG. 1 is a diagram of a template frame and a search frame according to the present invention

FIG. 2 is an overall structure diagram of the target tracking network of the present invention

FIG. 3 is a diagram of the FPN structure of the present invention

FIG. 4 is a diagram of the RPN architecture of the present invention

FIG. 5 is a diagram illustrating the RPN output result of the present invention

FIG. 6 is a flow chart of target tracking network training in accordance with the present invention

Detailed Description

The following provides a more detailed description of the embodiments and the operation of the present invention with reference to the accompanying drawings.

The invention provides a target tracking method based on a multilayer feature mixing and attention mechanism, which comprises the following specific steps:

(1) the data set is pre-processed prior to training. The training data is composed of video sequences with labels of the position and size of the target object. The target tracking network needs to input a template frame corresponding to a tracking target and a search frame for finding the target. Cutting the original video sequence to obtain w_t×h_tTemplate frame F of pixels_tAnd w_c×h_cSearch frame F of pixels_cAs shown in fig. 1 and 2. Wherein the template frame corresponds to a first frame of the video sequence and the search frame corresponds to the remaining video sequence starting with a second frame of the video sequence.

(2) Designing two parallel 5-block depth residual networks N₁、N₂Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing mode_SThe depth residual network used removes padding from the first 7 × 7 convolution of the existing "ResNet-50", while changing the last two convolutions of step 2 in this "ResNet-50" to convolutions of step 1_tAnd search frame F_cAre respectively fed into N₁、N₂And extracting the features of the image in different depths respectively through operations such as convolution, pooling, activation and the like. ConvM _ N (F)_t) And ConvM _ N (F)_c) Respectively represent template frames F on different levels of the network_tAnd search frame F_cWherein M represents the location of the block in the ResNet network in which the feature map resides, and N represents the specific location in a block.

(3) Design feature pyramid network (Feat)Pure Pyramid Networks, FPN), three FPNs (FPN1, FPN2, FPN3) will be respectively from the network N₁、N₂Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 sets of output features at different depths were fused separately to obtain 3 sets of fused features.

The specific structure of a single FPN used in the present invention is shown in fig. 4. Each FPN receives 3 feature maps with different scales, wherein the feature maps are respectively F from large to small and from shallow to deep₁、F₂、F₃The feature fusion is completed by point-to-point addition, the number of channels of one feature is adjusted by using 1 × 1 convolution to ensure that the number of channels of two features is the same, then the size of the other feature is adjusted by using 2 times of up-sampling or 3 × 3 convolution with the step length of 2 to ensure that the sizes of the two adjusted features are the same, and the point-to-point addition, namely the feature fusion, is completed, the 3 features are fused, and finally the fused feature F is output_MAnd F is_MSize and F₃The same is true. Finally, three FPNs respectively output the mixed characteristics F of the template frame_{M_1}(F_t)、F_{M_2}(F_t)、F_{M_3}(F_t) And hybrid feature of search frame F_{M_1}(F_c)、F_{M_2}(F_c)、F_{M_3}(F_c)。

(4) Regional recommendation Network (RPN), three RPNs (RPN1, RPN2, RPN3) respectively input the mixed features of three pairs of template frames and search frames: f_{M_1}(F_t)、F_{M_1}(F_c)；F_{M_2}(F_t)、F_{M_2}(F_c)；F_{M_3}(F_t)、F_{M_3}(F_c) And obtaining the classification result CLS and the regression result REG of the suggestion box, as shown in FIG. 2.

(5) The RPN needs to output the regression results of the classification CLS and REG of the proposed frame, and these two different outputs need two paths to complete, the upper half of the RPN in fig. 2 outputs the classification CLS of the proposed frame, and the lower half outputs the regression REG of the proposed frame. RPN first extracts the hybrid feature F from the template frame_M(F_t) Cutting from the edge, whereinc is the number of the current mixed characteristic channels, and the number of the mixed characteristic channels in different combinations is different. Then, by adjusting the convolution, F_M(F_t) And F_M(F_c) Adjusted to a suitable size [ F ]_M(F_t)]_c，[F_M(F_c)]_c，[F_M(F_t)]_r，[F_M(F_c)]_r. Will [ F ]_M(F_t)]_c，[F_M(F_c)]_cPerforming cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ]_M(F_t)]_r，[F_M(F_c)]_rAnd performing cross-correlation operation to obtain a primary regression result REG _ O.

CLS _ O has a size w_res×h_res× 2k, REG _ O of size w_res×h_res× 4k, as shown in FIG. 5, outputs a result at w_res×h_resDimension and original drawing w_c×h_cIn a spatially linear relationship, at w_res×h_resCorresponds to k anchor frames with preset sizes, and the center of the anchor frame is the center of the current position. The 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the target_posAnd probability P of not containing the object_neg. The 4k channels of REG _ O represent the length-width difference and position difference of k anchor boxes predicted by the network and the actual target box, which are dx, dy, dw, dh respectively. The relationship with the actual target frame is as follows:

wherein A is_x、A_yRepresents the center point of the reference frame, A_w、A_hWidth and height of the reference frame, T_x、T_y、T_w、T_hCoordinates and length and width representing true values. And finally finding out a final target by methods such as maximum value inhibition and the like.

(6) After CLS _ O and REG _ O are obtained from the output, they are input to the spatial attention module, as shown in FIG. 4, by average pooling and maximizationValue pooling, convolution, Sigmoid activation operations, obtain w_res×h_res× 1 the spatial attention weights SA _ c and SA _ r, CLS _ O and REG _ O are multiplied by the corresponding positions of SA _ c and SA _ r, respectively, and added to the original CLS _ O and REG _ O, or the final RPN output results CLS and REG are obtained.

(7) And performing weighted addition on output results of the three RPNs (RPN1, RPN2 and RPN3) to obtain a final target tracking network output result:

wherein, α₁，α₂，α₃，β₁，β₂，β₃Is a preset weight.

(8) Class loss L in training the target tracking network_clsUsing cross-entropy loss, regression loss L_regA smoothed L1 penalty with normalized coordinates is used. y represents the label value and y represents the actual classification value (i.e., P)_pos)；dx_T，dy_T，dw_T，dh_TAnd the length and width difference and the position difference of the actual k anchor boxes and the actual target box are represented, namely true values. The loss functions are defined as:

wherein:

the final loss function is as follows:

loss＝L_cls+λL_reg(5)

where λ is a hyperparameter, which is used to balance the two types of losses.

The key parameters related to one embodiment of the present invention are shown in table 1, and the specific parameters marked in the diagram in appendix 1 are based on the implementation parameters:

TABLE 1 example parameters

The specific training process of the target tracking network designed by the invention is shown in fig. 6, wherein the specific training process and the related parameters of the specific implementation of the scheme are as follows:

the video sequence in the data set is processed according to the label information, a template frame F of 127 × 127 pixels is obtained by cutting_tAnd 255 × 255 pixels search frame F_c。

Template frame F_tAnd search frame F_cFeature extraction network ResNet _ N as fed into FIG. 2₁And ResNet _ N₂And extracting five features of different depth levels, wherein two feature extraction networks share weight.

Three feature pyramid networks, as shown in fig. 3, FPN1, FPN2, and FPN3 respectively extract template frames F of different depth levels_tAnd search frame F_cFeature fusion is performed on the features, wherein the FPN1 fuses the features obtained from the first, second and third blocks (layers), the FPN2 fuses the features obtained from the first, second and fourth blocks (layers), and the FPN3 fuses the features obtained from the first, second and fifth blocks (layers), as shown in fig. 2. Three pairs of FPNs respectively output mixed characteristics F of template frames_{M_1}(F_t)、F_{M_2}(F_t)、F_{M_3}(F_t) And hybrid feature of search frame F_{M_1}(F_c)、F_{M_2}(F_c)、F_{M_3}(F_c) The blended feature sizes of the template frames are all 15 × 15 × 512, and the blended feature sizes of the search frames are all 31 × 31 × 512.

Mixing three pairs of features F_{M_1}(F_t) And F_{M_1}(F_c)、F_{M_2}(F_t) And F_{M_2}(F_c)、F_{M_3}(F_t) And F_{M_3}(F_c) Three regional recommendation networks RPN1, RPN2 and RPN3 are respectively fed, as shown in FIG. 2. The structure of each area recommendation network is the same, and as shown in fig. 4, 5 anchor boxes are provided, that is, k is 5. Firstly, mixing characteristics F of template frame_M(F_t) Cutting to remove the elements of the peripheral part to obtain a size of 7 × 7 × 512, and adjusting F by four convolution layers_M(F_t) Hybrid feature F with search frame_M(F_c) The channel numbers of (a) can respectively obtain: [ F ]_M(F_t)]_cSize 5 × 5 × (10 × 512) [ F ]_M(F_t)]_rSize 5 × 5 × (20 × 512) [ F ]_M(F_c)]_cSize 29 × 29 × 512; [ F ]_M(F_c)]_rAnd size 29 × 29 × 512.

Respectively reacting [ F_M(F_t)]_cAnd [ F ]_M(F_c)]_c、[F_M(F_t)]_rAnd [ F ]_M(F_c)]_rPerforming a cross-correlation operation may obtain a classification intermediate result CLS _ O and a regression intermediate result REG _ O, where CLS _ O has a size of 25 × 25 × 10 and REG _ O has a size of 25 × 25 × 20.

CLS _ O and REG _ O are respectively sent to the corresponding space attention modules to obtain space attention weights SA _ c and SA _ r. And multiplying the CLS _ O and the REG _ O by the corresponding positions of SA _ c and SA _ r, and adding the obtained product to the original CLS _ O and REG _ O to obtain a final RPN output classification result CLS and a regression result REG. CLS and CLS _ O are the same in size; REG and REG _ O are the same size. The "spatial attention" in the flow chart completes the above steps.

And performing weighted addition on the classification results output by the RPN1, the RPN2 and the RPN3 and the regression results according to the weight values of 0.2, 0.3 and 0.5 to obtain the final target classification result and the regression result of the suggestion frame. The losses were calculated and optimized according to equations (3), (4) and (5). When the set number of training rounds is 50 rounds, the training is finished and the test is carried out.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps; any non-essential addition and replacement made by the technical characteristics of the technical scheme of the invention by a person skilled in the art belong to the protection scope of the invention.

Claims

1. A target tracking method based on a multilayer feature mixing and attention mechanism is characterized by comprising the following steps:

(1) before training, preprocessing the data set: the training data is composed of video sequences and is provided with a label of the position and the size of a target object; the target tracking network needs to input a template frame corresponding to a tracking target and a search frame for finding the target. Cutting the original video sequence to obtain w_t×h_tTemplate frame F of pixels_tAnd w_c×h_cSearch frame F of pixels_cWherein the template frame corresponds to a first frame of the video sequence and the search frame corresponds to the remaining video sequence starting from a second frame of the video sequence;

(2) designing two parallel 5-block depth residual networks N₁、N₂Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing mode_SThe depth residual network is used to remove padding from the first 7 × 7 convolution of the existing "ResNet-50", and to change the last two convolutions of step 2 in the "ResNet-50" to convolutions of step 1, and to change the template frame F to a frame F_tAnd search frame F_cAre respectively fed into N₁、N₂Extracting the features of the two types of the image at different depths through operations such as convolution, pooling, activation and the like; ConvM _ N (F)_t) And ConvM _ N (F)_c) Respectively represent template frames F on different levels of the network_tAnd search frame F_cWherein M represents the location of the block in the ResNet network in which the feature map resides, and N represents the specific location in a block;

(3) designing a feature pyramid network FPN, comprising three FPNs: FPN1, FPN2 and FPN3 are respectively connected with slave network N₁、N₂Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 sets of output features with different depths were respectively fused to obtain 3 sets of fused output featuresEach FPN receives 3 feature maps with different scales, wherein the feature maps are respectively F from large to small and from shallow to deep₁、F₂、F₃The feature fusion is completed by point-to-point addition, the number of channels of one feature is adjusted by using 1 × 1 convolution to ensure that the number of channels of two features is the same, then the size of the other feature is adjusted by using 2 times of up-sampling or 3 × 3 convolution with the step length of 2 to ensure that the sizes of the two adjusted features are the same, so that the point-to-point addition, namely the feature fusion, is completed, the 3 features are fused, and finally the fused feature F is output_MAnd F is_MSize and F₃The same; finally, three FPNs respectively output the mixed characteristics F of the template frame_{M_1}(F_t)、F_{M_2}(F_t)、F_{M_3}(F_t) And hybrid feature of search frame F_{M_1}(F_c)、F_{M_2}(F_c)、F_{M_3}(F_c)；

wherein, α₁，α₂，α₃，β₁，β₂，β₃Is a preset weight.

(8) Class loss L in training the target tracking network_clsUsing cross-entropy loss, regression loss L_regUsing a smoothed L1 penalty with normalized coordinates; y represents the value of the tag and,

representing the actual classification value, i.e. P_pos；dx_T，dy_T，dw_T，dh_TRepresenting the length and width difference and the position difference between the actual k kinds of anchor frames and the actual target frame, namely a true value; the loss functions are defined as:

wherein:

the final loss function is as follows:

loss＝L_cls+λL_reg(5)

where λ is a hyperparameter, which is used to balance the two types of losses.

2. The target tracking method based on the multi-layer feature mixing and attention mechanism as claimed in claim 1, wherein the step (8) of training the target tracking network specifically comprises:

processing the video sequence in the data set, and cutting a template frame F of 127 × 127 pixels according to the label information_tAnd 255 × 255 pixels search frame F_c；

Template frame F_tAnd search frame F_cSend into feature extraction network ResNet _ N₁And ResNet _ N₂Extracting five features of different depth levels, wherein two features extract network sharing weight;

three characteristic pyramid networks, namely FPN1, FPN2 and FPN3 respectively extract template frames F of different depth levels_tAnd search frame F_cPerforming feature fusion on the features, wherein the FPN1 fuses first, second and third blocks, namely features obtained by one, two and three layers, the FPN2 fuses first, second and four blocks, namely features obtained by one, two and four layers, the FPN3 fuses first, second and five blocks, namely features obtained by one, two and five layers, and three pairs of FPNs respectively output mixed features F of template frames_{M_1}(F_t)、F_{M_2}(F_t)、F_{M_3}(F_t) And hybrid feature of search frame F_{M_1}(F_c)、F_{M_2}(F_c)、F_{M_3}(F_c) The mixed feature sizes of the template frames are all 15 × 15 × 512, and the mixed feature sizes of the search frames are all 31 × 31 × 512;

mixing three pairs of features F_{M_1}(F_t) And F_{M_1}(F_c)、F_{M_2}(F_t) And F_{M_2}(F_c)、F_{M_3}(F_t) And F_{M_3}(F_c) Three regional recommended networks RPN1, RPN2 and RPN3 are respectively sent, wherein the structure of each regional recommended network is the same, and 5 anchor frames are totally arranged, namely k is 5; firstly, mixing characteristics F of template frame_M(F_t) Cutting to remove the elements of the peripheral part to obtain a size of 7 × 7 × 512, and adjusting F by four convolution layers_M(F_t) Hybrid feature F with search frame_M(F_c) The number of channels of (a) is obtained: [ F ]_M(F_t)]_cSize 5 × 5 × (10 × 512) [ F ]_M(F_t)]_rSize 5 × 5 × (20 × 512) [ F ]_M(F_c)]_cSize 29 × 29 × 512; [ F ]_M(F_c)]_rSize 29 × 29 × 512;

respectively reacting [ F_M(F_t)]_cAnd [ F ]_M(F_c)]_c、[F_M(F_t)]_rAnd [ F ]_M(F_c)]_rAnd performing cross-correlation operation to obtain a classification intermediate result CLS _ O and a regression intermediate result REG _ O, wherein the size of the CLS _ O is 25 × 25 × 10, and the size of the REG _ O is 25 × 25 × 20.

CLS _ O and REG _ O are respectively sent to corresponding space attention modules to obtain space attention weights SA _ c and SA _ r; multiplying the CLS _ O and the REG _ O with the corresponding positions of SA _ c and SA _ r, and adding the obtained product with the original CLS _ O and REG _ O to obtain a final RPN output classification result CLS and a regression result REG; CLS and CLS _ O are the same in size; REG and REG _ O have the same size, and the above steps are completed with "spatial attention";

weighting and adding the classification results output by RPN1, RPN2 and RPN3 and the regression results according to the weight values of 0.2, 0.3 and 0.5 to obtain the final target classification result and the regression result of the suggestion frame, and calculating loss and optimizing according to the formulas (3), (4) and (5); when the set number of training rounds is 50 rounds, the training is finished and the test is carried out.