CN111696137A - Target tracking method based on multilayer feature mixing and attention mechanism - Google Patents
Target tracking method based on multilayer feature mixing and attention mechanism Download PDFInfo
- Publication number
- CN111696137A CN111696137A CN202010518472.1A CN202010518472A CN111696137A CN 111696137 A CN111696137 A CN 111696137A CN 202010518472 A CN202010518472 A CN 202010518472A CN 111696137 A CN111696137 A CN 111696137A
- Authority
- CN
- China
- Prior art keywords
- frame
- cls
- reg
- feature
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
- G06T7/248—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a target tracking method based on a multilayer feature mixing and attention mechanism, which utilizes an improved FPN structure to better reserve and utilize shallow features of an image, and the improved FPN structure which can better reserve the shallow features can output fusion features with multi-dimension and multi-scale features. The method has better tracking capability for targets with different sizes and targets with continuously changing sizes. The FPN is used for the cascaded RPN, so that the feature extraction is more accurate, the similar interferents are better distinguished when the tracking is guaranteed, and the condition of error tracking is reduced. Meanwhile, by utilizing an attention mechanism, on a spatial scale, the network gives more attention to the possible positions of the target, so that the condition of target loss or target tracking error caused by target half-shading, deformation, illumination and the like is reduced.
Description
Technical Field
The invention belongs to the field of image processing and computer vision, and particularly relates to a target tracking method based on multilayer feature mixing and attention mechanism.
Background
The visual target tracking is an important computer visual task and can be applied to the fields of visual monitoring, human-computer interaction, video compression and the like. Despite extensive research on this problem, it still has difficulty in dealing with complex changes in the appearance of objects due to the effects of lighting variations, partial occlusion, shape distortion, and camera motion.
The target tracking algorithm mainly has two large branches at the present stage, one is based on a correlation filtering algorithm, and the other is based on a deep learning algorithm. The target tracking method provided by the invention belongs to the branch of deep learning.
The following methods are mainly used for deep learning: a convolutional neural network; a recurrent neural network; generating a countermeasure network; a twin neural network. The target tracking method based on the convolutional neural network proposed by Learning spatial-aware regression for visual tracking, c.sun, d.wang, h.lu, a nd m.yang, in proc.ieee CVPR,2018, pp.8962-8970 "constructs a plurality of target models to capture various target appearances, learns different target models, processes partial occlusion and deformation based on part models, and simultaneously prevents overfitting and Learning the rotation information of the target by using the dual-flow network. Although this method has made great progress in the accuracy of target estimation, this type of convolutional neural network-based method still has high computational complexity. The invention discloses a multi-maneuvering-target tracking method based on an LSTM network, and particularly relates to a CN110780290A target tracking method based on a recurrent neural network, which utilizes context information to process the influence of a similar background on a tracked target. But since visual target tracking is related to the spatial and temporal information of the video frames, a cyclic neural network based approach is used while taking into account the motion of the target. The number of methods based on recurrent neural networks is limited due to the large number of parameters that exist in the model, which makes training difficult. Almost all of these approaches attempt to improve target modeling with other information and memory. Furthermore, a second goal of using a recurrent neural network-based approach is to avoid fine tuning of the pre-trained CNN model, which requires a lot of time and is prone to overfitting. "Visual tracking via adaptive sampling, Y.Song, C.Ma, X.Wu, L.G ong, L.Bao, W.Zuo, C.Shen, R.W.Lau, and M.H.Yang, inProc.IEEE CVPR,2018, pp.8990-8999" performs target tracking based on generation of countermeasure network, can generate required samples to solve the problem of unbalanced distribution of training samples, and simultaneously solves the problem of insufficient sample amount by generating samples. But the generation of a countermeasure network is often difficult to train and evaluate, and in practice the mechanics of solving this problem is very strong. In the invention patent of an infrared weak and small target detection and tracking method based on a convolutional neural network, CN110728697A utilizes a twin network to track a target, and performs characteristic matching by extracting the depth characteristic of a picture to complete the tracking of the target.
Aiming at the problems of uneven utilization of target features and shielding, half shielding, illumination change, deformation and the like of a tracked object in the prior deep learning, the method is based on a twin network, combines shallow and deep features by utilizing a plurality of FPNs, and improves the robustness of the method by using an attention mechanism.
Disclosure of Invention
The invention belongs to the field of computer vision and deep learning, and enables the whole target tracking network to have stronger feature extraction capability and robustness by improving the feature extraction part and the region recommendation network part of a twin network. The invention provides a target tracking method based on a multilayer feature mixing and attention mechanism, which comprises the following specific steps:
(1) before training, preprocessing the data set: the training data is composed of video sequences and carries the target object
A location and size tag; the target tracking network needs to input a template frame corresponding to a tracking target and a search frame for finding the target. Cutting the original video sequence to obtain wt×htTemplate frame F of pixelstAnd wc×hcSearch frame F of pixelscWherein the template frame corresponds to a first frame of the video sequence and the search frame corresponds to the remaining video sequence starting with a second frame of the video sequence.
(2) Designing two parallel 5-block depth residual networks N1、N2Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing modeSThe depth residual network is used to remove padding from the first 7 × 7 convolution of the existing "ResNet-50", and to change the last two convolutions of step 2 in the "ResNet-50" to convolutions of step 1, and to change the template frame F to a frame FtAnd search frame FcAre respectively fed into N1、N2Extracting the features of the two types of the image at different depths through operations such as convolution, pooling, activation and the like; ConvM _ N (F)t) And ConvM _ N (F)c) Respectively represent template frames F on different levels of the networktAnd search frame FcWherein M represents the location of the block in the ResNet network in which the feature map resides, and N represents the specific location in a block.
(3) Designing a feature pyramid network FPN, comprising three FPNs: FPN1, FPN2 and FPN3 are respectively connected with slave network N1、N2Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 sets of output features with different depths are respectively fused to obtain 3 sets of fused features, each FPN receives 3 feature maps with different scales, and the feature maps are respectively F from large to small and from shallow to deep1、F2、F3The feature fusion is completed by point-to-point addition, the number of channels of one feature is adjusted by using 1 × 1 convolution to ensure that the number of channels of two features is the same, and then the size of the other feature is adjusted by using 2 times of up-sampling or 3 × 3 convolution with the step length of 2 to ensure that the sizes of the two adjusted features are the same, so that point-to-point addition is completedAdding, namely feature fusion; fusing the 3 characteristics and finally outputting a fused characteristic FMAnd F isMSize and F3The same; finally, three FPNs respectively output the mixed characteristics F of the template frameM_1(Ft)、FM_2(Ft)、FM_3(Ft) And hybrid feature of search frame FM_1(Fc)、FM_2(Fc)、FM_3(Fc);
(4) Designing a regional recommended network (RPN), which comprises three RPNs: RPN1, RPN2 and RPN3 respectively input three pairs of template frame and search frame mixture features: fM_1(Ft)、FM_1(Fc);FM_2(Ft)、FM_2(Fc);FM_3(Ft)、FM_3(Fc) Obtaining a classification result CLS and a regression result REG of the suggestion box;
(5) the RPN outputs the regression results of the classification CLS and the REG of the suggestion frame, the two different outputs are completed by two paths, the upper half part of the RPN outputs the classification CLS of the suggestion frame, and the lower half part outputs the regression REG of the suggestion frame; RPN first extracts the hybrid feature F from the template frameM(Ft) Cutting from the edge, wherein c is the number of current mixed characteristic channels, and the number of mixed characteristic channels in different combinations is different; then, by adjusting the convolution, FM(Ft) And FM(Fc) Adjusted to a suitable size [ F ]M(Ft)]c,[FM(Fc)]c,[FM(Ft)]r,[FM(Fc)]r(ii) a Will [ F ]M(Ft)]c,[FM(Fc)]cPerforming cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ]M(Ft)]r,[FM(Fc)]rPerforming cross-correlation operation to obtain a primary regression result REG _ O;
CLS _ O has a size wres×hres× 2k, REG _ O of size wres×hres× 4k, the output result is at wres×hresDimension and original drawing wc×hcIn a spatially linear relationship, at wres×hresEach position of the anchor frame corresponds to k anchor frames with preset sizes, and the center of each anchor frame is the center of the current position; the 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the targetposAnd probability P of not containing the objectneg(ii) a The 4k channels of REG _ O represent the length-width difference and position difference of k anchor boxes predicted by the network and the actual target box, which are dx, dy, dw, dh respectively. The relationship with the actual target frame is as follows:
wherein A isx、AyRepresents the center point of the reference frame, Aw、AhWidth and height of the reference frame, Tx、Ty、Tw、ThThe coordinate and the length and the width of the truth value are expressed, and finally the final target is found out through methods such as maximum suppression and the like;
(6) after CLS _ O and REG _ O are obtained through output, the CLS _ O and REG _ O are input into a space attention module, and w is obtained through average pooling, maximum pooling, convolution and Sigmoid activation operationres×hres× 1, multiplying CLS _ O and REG _ O with the corresponding positions of SA _ c and SA _ r, respectively, and adding the result to original CLS _ O and REG _ O to obtain final RPN output results CLS and REG;
(7) for three RPNs: and the output results of the RPN1, the RPN2 and the RPN3 are subjected to weighted addition to serve as a final target tracking network output result:
wherein, α1,α2,α3,β1,β2,β3Is a preset weight.
(8) Class loss L in training the target tracking networkclsUsing cross-entropy loss, regression loss LregUsing a smoothed L1 penalty with normalized coordinates; y denotes the tag value and y denotes the actual classification value, i.e. Ppos;dxT,dyT,dwT,dhTRepresenting the length and width difference and the position difference between the actual k kinds of anchor frames and the actual target frame, namely a true value; the loss functions are defined as:
wherein:
the final loss function is as follows:
loss=Lcls+λLreg(5)
where λ is a hyperparameter, which is used to balance the two types of losses.
The present invention utilizes an improved FPN structure. Compared with the situation that the deep features obtained in the traditional FPN have insufficient reservation for the shallow features, the shallow features of the image are better reserved and utilized by utilizing the improved FPN structure. This improved FPN structure with better retention of shallow features can output fused features with multi-dimensional, multi-scale features. The method has better tracking capability for targets with different sizes and targets with continuously changing sizes. The FPN is used for the cascaded RPN, so that the feature extraction is more accurate, the similar interferents are better distinguished when the tracking is guaranteed, and the condition of error tracking is reduced. Meanwhile, by utilizing an attention mechanism, on a spatial scale, the network gives more attention to the possible positions of the target, so that the condition of target loss or target tracking error caused by target half-shading, deformation, illumination and the like is reduced.
Drawings
FIG. 1 is a diagram of a template frame and a search frame according to the present invention
FIG. 2 is an overall structure diagram of the target tracking network of the present invention
FIG. 3 is a diagram of the FPN structure of the present invention
FIG. 4 is a diagram of the RPN architecture of the present invention
FIG. 5 is a diagram illustrating the RPN output result of the present invention
FIG. 6 is a flow chart of target tracking network training in accordance with the present invention
Detailed Description
The following provides a more detailed description of the embodiments and the operation of the present invention with reference to the accompanying drawings.
The invention provides a target tracking method based on a multilayer feature mixing and attention mechanism, which comprises the following specific steps:
(1) the data set is pre-processed prior to training. The training data is composed of video sequences with labels of the position and size of the target object. The target tracking network needs to input a template frame corresponding to a tracking target and a search frame for finding the target. Cutting the original video sequence to obtain wt×htTemplate frame F of pixelstAnd wc×hcSearch frame F of pixelscAs shown in fig. 1 and 2. Wherein the template frame corresponds to a first frame of the video sequence and the search frame corresponds to the remaining video sequence starting with a second frame of the video sequence.
(2) Designing two parallel 5-block depth residual networks N1、N2Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing modeSThe depth residual network used removes padding from the first 7 × 7 convolution of the existing "ResNet-50", while changing the last two convolutions of step 2 in this "ResNet-50" to convolutions of step 1tAnd search frame FcAre respectively fed into N1、N2And extracting the features of the image in different depths respectively through operations such as convolution, pooling, activation and the like. ConvM _ N (F)t) And ConvM _ N (F)c) Respectively represent template frames F on different levels of the networktAnd search frame FcWherein M represents the location of the block in the ResNet network in which the feature map resides, and N represents the specific location in a block.
(3) Design feature pyramid network (Feat)Pure Pyramid Networks, FPN), three FPNs (FPN1, FPN2, FPN3) will be respectively from the network N1、N2Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 sets of output features at different depths were fused separately to obtain 3 sets of fused features.
The specific structure of a single FPN used in the present invention is shown in fig. 4. Each FPN receives 3 feature maps with different scales, wherein the feature maps are respectively F from large to small and from shallow to deep1、F2、F3The feature fusion is completed by point-to-point addition, the number of channels of one feature is adjusted by using 1 × 1 convolution to ensure that the number of channels of two features is the same, then the size of the other feature is adjusted by using 2 times of up-sampling or 3 × 3 convolution with the step length of 2 to ensure that the sizes of the two adjusted features are the same, and the point-to-point addition, namely the feature fusion, is completed, the 3 features are fused, and finally the fused feature F is outputMAnd F isMSize and F3The same is true. Finally, three FPNs respectively output the mixed characteristics F of the template frameM_1(Ft)、FM_2(Ft)、FM_3(Ft) And hybrid feature of search frame FM_1(Fc)、FM_2(Fc)、FM_3(Fc)。
(4) Regional recommendation Network (RPN), three RPNs (RPN1, RPN2, RPN3) respectively input the mixed features of three pairs of template frames and search frames: fM_1(Ft)、FM_1(Fc);FM_2(Ft)、FM_2(Fc);FM_3(Ft)、FM_3(Fc) And obtaining the classification result CLS and the regression result REG of the suggestion box, as shown in FIG. 2.
(5) The RPN needs to output the regression results of the classification CLS and REG of the proposed frame, and these two different outputs need two paths to complete, the upper half of the RPN in fig. 2 outputs the classification CLS of the proposed frame, and the lower half outputs the regression REG of the proposed frame. RPN first extracts the hybrid feature F from the template frameM(Ft) Cutting from the edge, whereinc is the number of the current mixed characteristic channels, and the number of the mixed characteristic channels in different combinations is different. Then, by adjusting the convolution, FM(Ft) And FM(Fc) Adjusted to a suitable size [ F ]M(Ft)]c,[FM(Fc)]c,[FM(Ft)]r,[FM(Fc)]r. Will [ F ]M(Ft)]c,[FM(Fc)]cPerforming cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ]M(Ft)]r,[FM(Fc)]rAnd performing cross-correlation operation to obtain a primary regression result REG _ O.
CLS _ O has a size wres×hres× 2k, REG _ O of size wres×hres× 4k, as shown in FIG. 5, outputs a result at wres×hresDimension and original drawing wc×hcIn a spatially linear relationship, at wres×hresCorresponds to k anchor frames with preset sizes, and the center of the anchor frame is the center of the current position. The 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the targetposAnd probability P of not containing the objectneg. The 4k channels of REG _ O represent the length-width difference and position difference of k anchor boxes predicted by the network and the actual target box, which are dx, dy, dw, dh respectively. The relationship with the actual target frame is as follows:
wherein A isx、AyRepresents the center point of the reference frame, Aw、AhWidth and height of the reference frame, Tx、Ty、Tw、ThCoordinates and length and width representing true values. And finally finding out a final target by methods such as maximum value inhibition and the like.
(6) After CLS _ O and REG _ O are obtained from the output, they are input to the spatial attention module, as shown in FIG. 4, by average pooling and maximizationValue pooling, convolution, Sigmoid activation operations, obtain wres×hres× 1 the spatial attention weights SA _ c and SA _ r, CLS _ O and REG _ O are multiplied by the corresponding positions of SA _ c and SA _ r, respectively, and added to the original CLS _ O and REG _ O, or the final RPN output results CLS and REG are obtained.
(7) And performing weighted addition on output results of the three RPNs (RPN1, RPN2 and RPN3) to obtain a final target tracking network output result:
wherein, α1,α2,α3,β1,β2,β3Is a preset weight.
(8) Class loss L in training the target tracking networkclsUsing cross-entropy loss, regression loss LregA smoothed L1 penalty with normalized coordinates is used. y represents the label value and y represents the actual classification value (i.e., P)pos);dxT,dyT,dwT,dhTAnd the length and width difference and the position difference of the actual k anchor boxes and the actual target box are represented, namely true values. The loss functions are defined as:
wherein:
the final loss function is as follows:
loss=Lcls+λLreg(5)
where λ is a hyperparameter, which is used to balance the two types of losses.
The key parameters related to one embodiment of the present invention are shown in table 1, and the specific parameters marked in the diagram in appendix 1 are based on the implementation parameters:
TABLE 1 example parameters
The specific training process of the target tracking network designed by the invention is shown in fig. 6, wherein the specific training process and the related parameters of the specific implementation of the scheme are as follows:
the video sequence in the data set is processed according to the label information, a template frame F of 127 × 127 pixels is obtained by cuttingtAnd 255 × 255 pixels search frame Fc。
Template frame FtAnd search frame FcFeature extraction network ResNet _ N as fed into FIG. 21And ResNet _ N2And extracting five features of different depth levels, wherein two feature extraction networks share weight.
Three feature pyramid networks, as shown in fig. 3, FPN1, FPN2, and FPN3 respectively extract template frames F of different depth levelstAnd search frame FcFeature fusion is performed on the features, wherein the FPN1 fuses the features obtained from the first, second and third blocks (layers), the FPN2 fuses the features obtained from the first, second and fourth blocks (layers), and the FPN3 fuses the features obtained from the first, second and fifth blocks (layers), as shown in fig. 2. Three pairs of FPNs respectively output mixed characteristics F of template framesM_1(Ft)、FM_2(Ft)、FM_3(Ft) And hybrid feature of search frame FM_1(Fc)、FM_2(Fc)、FM_3(Fc) The blended feature sizes of the template frames are all 15 × 15 × 512, and the blended feature sizes of the search frames are all 31 × 31 × 512.
Mixing three pairs of features FM_1(Ft) And FM_1(Fc)、FM_2(Ft) And FM_2(Fc)、FM_3(Ft) And FM_3(Fc) Three regional recommendation networks RPN1, RPN2 and RPN3 are respectively fed, as shown in FIG. 2. The structure of each area recommendation network is the same, and as shown in fig. 4, 5 anchor boxes are provided, that is, k is 5. Firstly, mixing characteristics F of template frameM(Ft) Cutting to remove the elements of the peripheral part to obtain a size of 7 × 7 × 512, and adjusting F by four convolution layersM(Ft) Hybrid feature F with search frameM(Fc) The channel numbers of (a) can respectively obtain: [ F ]M(Ft)]cSize 5 × 5 × (10 × 512) [ F ]M(Ft)]rSize 5 × 5 × (20 × 512) [ F ]M(Fc)]cSize 29 × 29 × 512; [ F ]M(Fc)]rAnd size 29 × 29 × 512.
Respectively reacting [ FM(Ft)]cAnd [ F ]M(Fc)]c、[FM(Ft)]rAnd [ F ]M(Fc)]rPerforming a cross-correlation operation may obtain a classification intermediate result CLS _ O and a regression intermediate result REG _ O, where CLS _ O has a size of 25 × 25 × 10 and REG _ O has a size of 25 × 25 × 20.
CLS _ O and REG _ O are respectively sent to the corresponding space attention modules to obtain space attention weights SA _ c and SA _ r. And multiplying the CLS _ O and the REG _ O by the corresponding positions of SA _ c and SA _ r, and adding the obtained product to the original CLS _ O and REG _ O to obtain a final RPN output classification result CLS and a regression result REG. CLS and CLS _ O are the same in size; REG and REG _ O are the same size. The "spatial attention" in the flow chart completes the above steps.
And performing weighted addition on the classification results output by the RPN1, the RPN2 and the RPN3 and the regression results according to the weight values of 0.2, 0.3 and 0.5 to obtain the final target classification result and the regression result of the suggestion frame. The losses were calculated and optimized according to equations (3), (4) and (5). When the set number of training rounds is 50 rounds, the training is finished and the test is carried out.
While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps; any non-essential addition and replacement made by the technical characteristics of the technical scheme of the invention by a person skilled in the art belong to the protection scope of the invention.
Claims (2)
1. A target tracking method based on a multilayer feature mixing and attention mechanism is characterized by comprising the following steps:
(1) before training, preprocessing the data set: the training data is composed of video sequences and is provided with a label of the position and the size of a target object; the target tracking network needs to input a template frame corresponding to a tracking target and a search frame for finding the target. Cutting the original video sequence to obtain wt×htTemplate frame F of pixelstAnd wc×hcSearch frame F of pixelscWherein the template frame corresponds to a first frame of the video sequence and the search frame corresponds to the remaining video sequence starting from a second frame of the video sequence;
(2) designing two parallel 5-block depth residual networks N1、N2Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing modeSThe depth residual network is used to remove padding from the first 7 × 7 convolution of the existing "ResNet-50", and to change the last two convolutions of step 2 in the "ResNet-50" to convolutions of step 1, and to change the template frame F to a frame FtAnd search frame FcAre respectively fed into N1、N2Extracting the features of the two types of the image at different depths through operations such as convolution, pooling, activation and the like; ConvM _ N (F)t) And ConvM _ N (F)c) Respectively represent template frames F on different levels of the networktAnd search frame FcWherein M represents the location of the block in the ResNet network in which the feature map resides, and N represents the specific location in a block;
(3) designing a feature pyramid network FPN, comprising three FPNs: FPN1, FPN2 and FPN3 are respectively connected with slave network N1、N2Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 sets of output features with different depths were respectively fused to obtain 3 sets of fused output featuresEach FPN receives 3 feature maps with different scales, wherein the feature maps are respectively F from large to small and from shallow to deep1、F2、F3The feature fusion is completed by point-to-point addition, the number of channels of one feature is adjusted by using 1 × 1 convolution to ensure that the number of channels of two features is the same, then the size of the other feature is adjusted by using 2 times of up-sampling or 3 × 3 convolution with the step length of 2 to ensure that the sizes of the two adjusted features are the same, so that the point-to-point addition, namely the feature fusion, is completed, the 3 features are fused, and finally the fused feature F is outputMAnd F isMSize and F3The same; finally, three FPNs respectively output the mixed characteristics F of the template frameM_1(Ft)、FM_2(Ft)、FM_3(Ft) And hybrid feature of search frame FM_1(Fc)、FM_2(Fc)、FM_3(Fc);
(4) Designing a regional recommended network (RPN), which comprises three RPNs: RPN1, RPN2 and RPN3 respectively input three pairs of template frame and search frame mixture features: fM_1(Ft)、FM_1(Fc);FM_2(Ft)、FM_2(Fc);FM_3(Ft)、FM_3(Fc) Obtaining a classification result CLS and a regression result REG of the suggestion box;
(5) the RPN outputs the regression results of the classification CLS and the REG of the suggestion frame, the two different outputs are completed by two paths, the upper half part of the RPN outputs the classification CLS of the suggestion frame, and the lower half part outputs the regression REG of the suggestion frame; RPN first extracts the hybrid feature F from the template frameM(Ft) Cutting from the edge, wherein c is the number of current mixed characteristic channels, and the number of mixed characteristic channels in different combinations is different; then, by adjusting the convolution, FM(Ft) And FM(Fc) Adjusted to a suitable size [ F ]M(Ft)]c,[FM(Fc)]c,[FM(Ft)]r,[FM(Fc)]r(ii) a Will [ F ]M(Ft)]c,[FM(Fc)]cPerforming cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ]M(Ft)]r,[FM(Fc)]rPerforming cross-correlation operation to obtain a primary regression result REG _ O;
CLS _ O has a size wres×hres× 2k, REG _ O of size wres×hres× 4k, the output result is at wres×hresDimension and original drawing wc×hcIn a spatially linear relationship, at wres×hresEach position of the anchor frame corresponds to k anchor frames with preset sizes, and the center of each anchor frame is the center of the current position; the 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the targetposAnd probability P of not containing the objectneg(ii) a The 4k channels of REG _ O represent the length-width difference and position difference of k anchor boxes predicted by the network and the actual target box, which are dx, dy, dw, dh respectively. The relationship with the actual target frame is as follows:
wherein A isx、AyRepresents the center point of the reference frame, Aw、AhWidth and height of the reference frame, Tx、Ty、Tw、ThThe coordinate and the length and the width of the truth value are expressed, and finally the final target is found out through methods such as maximum suppression and the like;
(6) after CLS _ O and REG _ O are obtained through output, the CLS _ O and REG _ O are input into a space attention module, and w is obtained through average pooling, maximum pooling, convolution and Sigmoid activation operationres×hres× 1, multiplying CLS _ O and REG _ O with the corresponding positions of SA _ c and SA _ r, respectively, and adding the result to original CLS _ O and REG _ O to obtain final RPN output results CLS and REG;
(7) for three RPNs: and the output results of the RPN1, the RPN2 and the RPN3 are subjected to weighted addition to serve as a final target tracking network output result:
wherein, α1,α2,α3,β1,β2,β3Is a preset weight.
(8) Class loss L in training the target tracking networkclsUsing cross-entropy loss, regression loss LregUsing a smoothed L1 penalty with normalized coordinates; y represents the value of the tag and,representing the actual classification value, i.e. Ppos;dxT,dyT,dwT,dhTRepresenting the length and width difference and the position difference between the actual k kinds of anchor frames and the actual target frame, namely a true value; the loss functions are defined as:
wherein:
the final loss function is as follows:
loss=Lcls+λLreg(5)
where λ is a hyperparameter, which is used to balance the two types of losses.
2. The target tracking method based on the multi-layer feature mixing and attention mechanism as claimed in claim 1, wherein the step (8) of training the target tracking network specifically comprises:
processing the video sequence in the data set, and cutting a template frame F of 127 × 127 pixels according to the label informationtAnd 255 × 255 pixels search frame Fc;
Template frame FtAnd search frame FcSend into feature extraction network ResNet _ N1And ResNet _ N2Extracting five features of different depth levels, wherein two features extract network sharing weight;
three characteristic pyramid networks, namely FPN1, FPN2 and FPN3 respectively extract template frames F of different depth levelstAnd search frame FcPerforming feature fusion on the features, wherein the FPN1 fuses first, second and third blocks, namely features obtained by one, two and three layers, the FPN2 fuses first, second and four blocks, namely features obtained by one, two and four layers, the FPN3 fuses first, second and five blocks, namely features obtained by one, two and five layers, and three pairs of FPNs respectively output mixed features F of template framesM_1(Ft)、FM_2(Ft)、FM_3(Ft) And hybrid feature of search frame FM_1(Fc)、FM_2(Fc)、FM_3(Fc) The mixed feature sizes of the template frames are all 15 × 15 × 512, and the mixed feature sizes of the search frames are all 31 × 31 × 512;
mixing three pairs of features FM_1(Ft) And FM_1(Fc)、FM_2(Ft) And FM_2(Fc)、FM_3(Ft) And FM_3(Fc) Three regional recommended networks RPN1, RPN2 and RPN3 are respectively sent, wherein the structure of each regional recommended network is the same, and 5 anchor frames are totally arranged, namely k is 5; firstly, mixing characteristics F of template frameM(Ft) Cutting to remove the elements of the peripheral part to obtain a size of 7 × 7 × 512, and adjusting F by four convolution layersM(Ft) Hybrid feature F with search frameM(Fc) The number of channels of (a) is obtained: [ F ]M(Ft)]cSize 5 × 5 × (10 × 512) [ F ]M(Ft)]rSize 5 × 5 × (20 × 512) [ F ]M(Fc)]cSize 29 × 29 × 512; [ F ]M(Fc)]rSize 29 × 29 × 512;
respectively reacting [ FM(Ft)]cAnd [ F ]M(Fc)]c、[FM(Ft)]rAnd [ F ]M(Fc)]rAnd performing cross-correlation operation to obtain a classification intermediate result CLS _ O and a regression intermediate result REG _ O, wherein the size of the CLS _ O is 25 × 25 × 10, and the size of the REG _ O is 25 × 25 × 20.
CLS _ O and REG _ O are respectively sent to corresponding space attention modules to obtain space attention weights SA _ c and SA _ r; multiplying the CLS _ O and the REG _ O with the corresponding positions of SA _ c and SA _ r, and adding the obtained product with the original CLS _ O and REG _ O to obtain a final RPN output classification result CLS and a regression result REG; CLS and CLS _ O are the same in size; REG and REG _ O have the same size, and the above steps are completed with "spatial attention";
weighting and adding the classification results output by RPN1, RPN2 and RPN3 and the regression results according to the weight values of 0.2, 0.3 and 0.5 to obtain the final target classification result and the regression result of the suggestion frame, and calculating loss and optimizing according to the formulas (3), (4) and (5); when the set number of training rounds is 50 rounds, the training is finished and the test is carried out.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010518472.1A CN111696137B (en) | 2020-06-09 | 2020-06-09 | Target tracking method based on multilayer feature mixing and attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010518472.1A CN111696137B (en) | 2020-06-09 | 2020-06-09 | Target tracking method based on multilayer feature mixing and attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111696137A true CN111696137A (en) | 2020-09-22 |
CN111696137B CN111696137B (en) | 2022-08-02 |
Family
ID=72479929
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010518472.1A Active CN111696137B (en) | 2020-06-09 | 2020-06-09 | Target tracking method based on multilayer feature mixing and attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111696137B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112258557A (en) * | 2020-10-23 | 2021-01-22 | 福州大学 | Visual tracking method based on space attention feature aggregation |
CN112288778A (en) * | 2020-10-29 | 2021-01-29 | 电子科技大学 | Infrared small target detection method based on multi-frame regression depth network |
CN112308013A (en) * | 2020-11-16 | 2021-02-02 | 电子科技大学 | Football player tracking method based on deep learning |
CN112489088A (en) * | 2020-12-15 | 2021-03-12 | 东北大学 | Twin network visual tracking method based on memory unit |
CN112651954A (en) * | 2020-12-30 | 2021-04-13 | 广东电网有限责任公司电力科学研究院 | Method and device for detecting insulator string dropping area |
CN112669350A (en) * | 2020-12-31 | 2021-04-16 | 广东电网有限责任公司电力科学研究院 | Adaptive feature fusion intelligent substation human body target tracking method |
CN112785624A (en) * | 2021-01-18 | 2021-05-11 | 苏州科技大学 | RGB-D characteristic target tracking method based on twin network |
CN113298850A (en) * | 2021-06-11 | 2021-08-24 | 安徽大学 | Target tracking method and system based on attention mechanism and feature fusion |
CN114120056A (en) * | 2021-10-29 | 2022-03-01 | 中国农业大学 | Small target identification method, small target identification device, electronic equipment, medium and product |
CN114399533A (en) * | 2022-01-17 | 2022-04-26 | 中南大学 | Single-target tracking method based on multi-level attention mechanism |
CN114612519A (en) * | 2022-03-16 | 2022-06-10 | 西安理工大学 | Twin network target tracking method based on dual-template feature fusion |
CN114663812A (en) * | 2022-03-24 | 2022-06-24 | 清华大学 | Combined detection and tracking method, device and equipment based on multidimensional attention mechanism |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030053658A1 (en) * | 2001-06-29 | 2003-03-20 | Honeywell International Inc. | Surveillance system and methods regarding same |
GB201908574D0 (en) * | 2019-06-14 | 2019-07-31 | Vision Semantics Ltd | Optimised machine learning |
CN110349185A (en) * | 2019-07-12 | 2019-10-18 | 安徽大学 | A kind of training method and device of RGBT target following model |
CN110544269A (en) * | 2019-08-06 | 2019-12-06 | 西安电子科技大学 | twin network infrared target tracking method based on characteristic pyramid |
CN110704665A (en) * | 2019-08-30 | 2020-01-17 | 北京大学 | Image feature expression method and system based on visual attention mechanism |
CN111126472A (en) * | 2019-12-18 | 2020-05-08 | 南京信息工程大学 | Improved target detection method based on SSD |
CN111192292A (en) * | 2019-12-27 | 2020-05-22 | 深圳大学 | Target tracking method based on attention mechanism and twin network and related equipment |
-
2020
- 2020-06-09 CN CN202010518472.1A patent/CN111696137B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030053658A1 (en) * | 2001-06-29 | 2003-03-20 | Honeywell International Inc. | Surveillance system and methods regarding same |
GB201908574D0 (en) * | 2019-06-14 | 2019-07-31 | Vision Semantics Ltd | Optimised machine learning |
CN110349185A (en) * | 2019-07-12 | 2019-10-18 | 安徽大学 | A kind of training method and device of RGBT target following model |
CN110544269A (en) * | 2019-08-06 | 2019-12-06 | 西安电子科技大学 | twin network infrared target tracking method based on characteristic pyramid |
CN110704665A (en) * | 2019-08-30 | 2020-01-17 | 北京大学 | Image feature expression method and system based on visual attention mechanism |
CN111126472A (en) * | 2019-12-18 | 2020-05-08 | 南京信息工程大学 | Improved target detection method based on SSD |
CN111192292A (en) * | 2019-12-27 | 2020-05-22 | 深圳大学 | Target tracking method based on attention mechanism and twin network and related equipment |
Non-Patent Citations (3)
Title |
---|
HUANG LH ET AL: "《Bridging the Gap Between Detection and Tracking: A Unified Approach》", 《IEEE》 * |
沈文祥等: "《基于多级特征和混合注意力机制的室内人群检测网络》", 《计算机应用》 * |
胡滔: "《基于深度特征增强的光学遥感目标检测技术研究》", 《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112258557B (en) * | 2020-10-23 | 2022-06-10 | 福州大学 | Visual tracking method based on space attention feature aggregation |
CN112258557A (en) * | 2020-10-23 | 2021-01-22 | 福州大学 | Visual tracking method based on space attention feature aggregation |
CN112288778A (en) * | 2020-10-29 | 2021-01-29 | 电子科技大学 | Infrared small target detection method based on multi-frame regression depth network |
CN112308013A (en) * | 2020-11-16 | 2021-02-02 | 电子科技大学 | Football player tracking method based on deep learning |
CN112489088A (en) * | 2020-12-15 | 2021-03-12 | 东北大学 | Twin network visual tracking method based on memory unit |
CN112651954A (en) * | 2020-12-30 | 2021-04-13 | 广东电网有限责任公司电力科学研究院 | Method and device for detecting insulator string dropping area |
CN112669350A (en) * | 2020-12-31 | 2021-04-16 | 广东电网有限责任公司电力科学研究院 | Adaptive feature fusion intelligent substation human body target tracking method |
CN112785624A (en) * | 2021-01-18 | 2021-05-11 | 苏州科技大学 | RGB-D characteristic target tracking method based on twin network |
CN112785624B (en) * | 2021-01-18 | 2023-07-04 | 苏州科技大学 | RGB-D characteristic target tracking method based on twin network |
CN113298850A (en) * | 2021-06-11 | 2021-08-24 | 安徽大学 | Target tracking method and system based on attention mechanism and feature fusion |
CN114120056A (en) * | 2021-10-29 | 2022-03-01 | 中国农业大学 | Small target identification method, small target identification device, electronic equipment, medium and product |
CN114399533A (en) * | 2022-01-17 | 2022-04-26 | 中南大学 | Single-target tracking method based on multi-level attention mechanism |
CN114399533B (en) * | 2022-01-17 | 2024-04-16 | 中南大学 | Single-target tracking method based on multi-level attention mechanism |
CN114612519A (en) * | 2022-03-16 | 2022-06-10 | 西安理工大学 | Twin network target tracking method based on dual-template feature fusion |
CN114612519B (en) * | 2022-03-16 | 2024-10-18 | 西安理工大学 | Twin network target tracking method based on dual-template feature fusion |
CN114663812A (en) * | 2022-03-24 | 2022-06-24 | 清华大学 | Combined detection and tracking method, device and equipment based on multidimensional attention mechanism |
CN114663812B (en) * | 2022-03-24 | 2024-07-26 | 清华大学 | Combined detection and tracking method, device and equipment based on multidimensional attention mechanism |
Also Published As
Publication number | Publication date |
---|---|
CN111696137B (en) | 2022-08-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111696137B (en) | Target tracking method based on multilayer feature mixing and attention mechanism | |
CN110188239B (en) | Double-current video classification method and device based on cross-mode attention mechanism | |
CN110929736B (en) | Multi-feature cascading RGB-D significance target detection method | |
CN108537824B (en) | Feature map enhanced network structure optimization method based on alternating deconvolution and convolution | |
CN109443382A (en) | Vision SLAM closed loop detection method based on feature extraction Yu dimensionality reduction neural network | |
CN113128424B (en) | Method for identifying action of graph convolution neural network based on attention mechanism | |
CN112131908A (en) | Action identification method and device based on double-flow network, storage medium and equipment | |
Le et al. | A comprehensive review of recent deep learning techniques for human activity recognition | |
CN108805151B (en) | Image classification method based on depth similarity network | |
CN111179419A (en) | Three-dimensional key point prediction and deep learning model training method, device and equipment | |
CN111626159A (en) | Human body key point detection method based on attention residual error module and branch fusion | |
CN116343334A (en) | Motion recognition method of three-stream self-adaptive graph convolution model fused with joint capture | |
CN116030498A (en) | Virtual garment running and showing oriented three-dimensional human body posture estimation method | |
CN114882234A (en) | Construction method of multi-scale lightweight dense connected target detection network | |
CN113807176A (en) | Small sample video behavior identification method based on multi-knowledge fusion | |
CN113673313A (en) | Gesture posture recognition method based on hierarchical convolutional neural network | |
CN116434010A (en) | Multi-view pedestrian attribute identification method | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
CN114743273B (en) | Human skeleton behavior recognition method and system based on multi-scale residual error map convolution network | |
CN116092190A (en) | Human body posture estimation method based on self-attention high-resolution network | |
CN115116139A (en) | Multi-granularity human body action classification method based on graph convolution network | |
CN114743162A (en) | Cross-modal pedestrian re-identification method based on generation of countermeasure network | |
CN110415261A (en) | A kind of the expression animation conversion method and system of subregion training | |
CN114240811A (en) | Method for generating new image based on multiple images | |
CN110197226B (en) | Unsupervised image translation method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |