CN112489088A

CN112489088A - Twin network visual tracking method based on memory unit

Info

Publication number: CN112489088A
Application number: CN202011473954.6A
Authority: CN
Inventors: 于瑞云; 杨骞; 王开开; 李张杰
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-03-12

Abstract

A twin network visual tracking method based on a memory unit belongs to the technical field of target tracking and comprises the following steps: step 1, building a tracking model; step 2, obtaining initial target template characteristics; step 3, acquiring the corresponding position of the tracking target in the current video frame; step 4, cutting out an area where the target is located according to the position where the tracking target is located in the current frame found in the step 3 to be used as a target template, and inputting the target template into a target template branch of the tracking model to obtain new target template characteristics; and (4) reading the next frame of the video as the current frame, and turning to the step (3) to carry out the next iteration until all frames in the video are read and the iteration is finished. The method can effectively solve the problems of shielding, background mixing, violent change of the target form and the like in the visual tracking process, and improves the tracking robustness of the tracking model facing a complex environment.

Description

Twin network visual tracking method based on memory unit

Technical Field

The invention belongs to the technical field of target tracking, and particularly relates to a twin network visual tracking method based on a memory unit.

Background

The target tracking technology plays a very important role in the computer vision technology, is widely applied to the fields of intelligent transportation, security, sports, medical treatment, robot navigation, human-computer interaction and the like, and has great commercial value. The task to be completed by visual tracking is to select an interested area in an image sequence as a tracking target and obtain accurate information such as the position, specific form and motion track of the target in a plurality of next continuous image frames. From the technical development point of view, the research of the visual tracking technology can be divided into three stages, and in the first stage, a classical tracking method represented by Kalman filtering, mean value filtering, particle filtering and an optical flow method is adopted; in the second stage, a TLD model is taken as a representative visual tracking method based on detection, and a CSK algorithm is taken as a representative related filtering visual tracking method; in the third stage, a visual tracking method based on deep learning. However, in the visual tracking task, only the target labeling information of the first image can be used, and therefore, sufficient a priori knowledge is not available in the training process to ensure the accuracy of the tracking model. In addition, the visual tracking problem also faces the challenges of illumination change, serious shielding of a tracked target, background mixing, severe change of the target form, motion blurring and the like.

The twin network based visual tracking model translates the tracking problem into a picture block similarity matching problem. The twin network tracking model takes the target template image and the search image corresponding to the current frame in the video as the input vector of the network, and the area of the search image is usually larger. And obtaining respective features of the two tensors through a backbone network, and performing cross correlation operation by taking the features corresponding to the target template as convolution kernels and the features corresponding to the search pictures to finally obtain a similarity score map, wherein the position of the maximum point in the similarity score map is the position of the target in the current frame of the video.

However, in the conventional twin network tracking model, only the initial frame is used as the target template, and the network parameters and the target template of the model are not updated after the offline training of the model is completed. The fact that the network parameters are not updated means that huge landslide occurs in the tracking performance of the model when the model encounters unseen scenes or targets, and the tracking drift problem is caused when the target template is not updated, so that the targets in the video sequence are subjected to severe appearance change or are seriously shielded, and the like, which can cause the reduction of the robustness and the tracking accuracy of the model.

Disclosure of Invention

Aiming at the problems in the traditional twin network tracking model, the invention starts from the aspect of updating the target template, and utilizes the DWConv-LSTM memory unit to solve the problem of updating the target template from two aspects of time and space, thereby being capable of obtaining better tracking accuracy while improving the robustness of the model and having considerable use value.

The technical scheme of the invention is that a backbone network in a tracking model adopts a twin network based on residual errors, the model has two branches, the upper branch is a target template branch, and the lower branch is a search image branch. In the target template branch, the target template obtains the robust features of the target through a backbone network and a memory unit; in the search image branch, in order to adapt to the target scale change in the tracking process, multi-scale processing is carried out on the search images to obtain three search images with different scales, and the search images are subjected to feature extraction through a backbone network. And performing cross correlation operation on the features corresponding to the three search images with different scales and the features corresponding to the target template respectively to finally obtain three similarity score graphs, finding the optimal feature graph among the three, and determining the predicted position of the target by the average value of the K response values with the maximum value on the feature graph. And acquiring a new target template according to the detected target position, inputting the acquired target template into a target template branch of the tracking model, and learning the change of the target template in time and space by using a DWConv-LSTM memory unit.

The invention discloses a twin network visual tracking method based on a memory unit, which comprises the following specific steps:

step 1: building a tracking model;

the tracking model is divided into two branches: the method comprises the steps that a target template branch and a search image branch are respectively arranged, the target template branch consists of a backbone network and a DWConv-LSTM memory unit, the search image branch consists of a backbone network, and the backbone networks of the two branches are twin networks which are built on the basis of residual modules and share weights;

the DWConv-LSTM memory unit is essentially a long-short term memory network LSTM fused with deep separable convolution operation, the conventional LSTM can well describe time sequence information, but the internal part adopts a full-connection structure, so that the LSTM learns a large amount of information which is useless for describing time sequence and cannot depict the spatial characteristic information of a target like convolution operation; the invention adopts the depth separable convolution to replace the full connection structure in the LSTM, so that the target template characteristics output by the memory unit not only contain the variation information on the time sequence, but also contain the characteristic information on the space;

step 2: acquiring an initial target template characteristic F;

step 2.1: extracting characteristics e of an initial target template image through a target template branched backbone network₀，e₀The initial cell unit c in DWConv-LSTM memory unit is obtained by convolution operation of 3 × 3 and 1 × 1₀，e₀Obtaining an initial hidden layer state h in a DWConv-LSTM memory cell by performing a convolution operation on a branch consisting of another set of convolutions of 3 × 3 and convolutions of 1 × 1₀；

Step 2.2: c is to₀，h₀And e₀Inputting the target template into a DWConv-LSTM memory unit to obtain an initial target template characteristic F;

and step 3: acquiring the corresponding position of a tracking target in a current video frame;

step 3.1: cutting out a search image from a current frame of a video, and acquiring characteristics corresponding to a multi-scale search image;

assuming that the current frame is the tth frame (t is 1,2,3, …, n) of the video, the search image S is cut out from the current frame in the video_tEstablishing with S_tCorresponding multiscale search image collection

Multiscale search images in set S asObtaining a characteristic set corresponding to the multi-scale search image by a batch of search image branches of the tracking model

Step 3.2: acquiring a similarity score map;

the target template feature F at the current moment and the feature corresponding to the multi-scale search image obtained in step 3.1

Performing cross correlation operation according to the formula (1) to obtain a similarity score map set

In the formula (#) represents a convolution operator;

step 3.3: acquiring a corresponding position of a tracking target according to the similarity score map;

upsampling each similarity score map in the set r to obtain an upsampled similarity score map set

Finding the upsampled similarity score map where the maximum value is located in the set R, and marking the upsampled similarity score map as R_tTo R, to R_tComparing all the values to obtain K response value points with the maximum value, averaging the K response value points to obtain a response value point d, and finding a corresponding position of the d point in the current video frame, wherein the corresponding position of the d point is the position of the searched target;

and 4, step 4: cutting out an area where the target is located according to the position where the tracking target is located in the current frame found in the step 3 to be used as a target template, and inputting the target template into a target template branch of the tracking model to obtain a new target template characteristic F; and (4) reading the next frame of the video as the current frame, and turning to the step (3) to carry out the next iteration until all frames in the video are read and the iteration is finished.

The twin network visual tracking method based on the memory unit comprises the following steps:

in step 3, the 1 st frame of the initial video is taken as the current frame, i.e., the first iteration time t is 1, and then when the next iteration is performed from step 4 to step 3, the next frame of the video is taken as the current frame, i.e., t is t + 1.

In step 4, in the process of updating the target template feature F, the DWConv-LSTM memory unit outputs the cell unit c in the DWConv-LSTM memory unit in the previous updating process_t-1And hidden layer state h_t-1And the extracted features e of the target template branch backbone network_tAs input, cell units and hidden layer states are updated and target template features F are obtained.

The benefits of the invention are:

according to the invention, the memory unit based on DWConv-LSTM added in the target template branch of the tracking model can learn the appearance change trend of the target on a time sequence, and meanwhile, the convolution network is utilized to ensure the stability of the target on the space, so that the problems of shielding, background mixing, violent change of the target form and the like in the visual tracking process can be effectively solved, the tracking robustness of the tracking model facing a complex environment is improved, and meanwhile, the memory unit is further accelerated by using the depth separable convolution in the memory unit, so that the tracking model can ensure the real-time property, and the memory unit has important application value.

Drawings

FIG. 1 is a flow chart of a twin network visual tracking method based on memory units according to the present invention.

FIG. 2 tracking model overall architecture diagram of the present invention

FIG. 3 is a diagram of the basic network structure of the DWConv-LSTM memory unit of the present invention.

FIG. 4 is a diagram illustrating an exemplary cropping of a target template image according to the present invention.

FIG. 5 is a schematic diagram of the process of obtaining the initial cell unit and the hidden layer state of the memory cell according to the present invention.

FIG. 6 is an exemplary diagram of the present invention cropping a multi-scale search image in a video frame.

Fig. 7 and 8 are diagrams illustrating effects of the embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings.

In the method of the embodiment, the TensorFlow deep learning framework is adopted for algorithm implementation, and the operating system is Ubuntu 16.04LTS.

As shown in fig. 1, a twin network visual tracking method based on memory units includes the following steps:

step 1: building a tracking model;

as shown in fig. 2, the tracking model is divided into two branches, namely a target template branch and a search image branch, the target template branch is composed of a backbone network and a DWConv-LSTM memory unit, the search image branch is composed of a backbone network, and the backbone networks of the two branches are twin networks which are built based on residual modules and share weights;

FIG. 3 shows a network structure of DWConv-LSTM memory unit, which mainly includes a forgetting gate f_tAnd input gate i_tAnd an output gate o_tThree gating units and one cell unit c_tThe calculation of each gate is performed according to the equations (2), (3) and (4), respectively, where W_i、W_fAnd W_oAs a weight matrix, b_i、b_fAnd b_oIs an offset amount, e_tFeatures extracted for the target template branch backbone network, h_t-1For the hidden state at the last moment, () represents a convolution operator, and sigma represents a sigmoid activation function;

cell unit c_tUpdate of (2), hidden layer state h_tThe updating and the obtaining of the target template characteristics F are calculated according to formulas (5), (6) and (7) respectively; wherein, tanh represents the tanh activation function, c_t-1For the cell unit at the previous moment, (. star.) represents the convolution operator, W_cAnd W_tAs a weight matrix, b_cIs an offset amount, h_t-1The hidden layer state at the previous moment;

h_t＝o_t*tanh(c_t) (6)

F＝W_t*h_t (7)

the convolution operation of each gate is realized by using deep separable convolution, which is beneficial to capturing the spatial relation among the characteristics and reducing the network parameter number so as to accelerate the forward reasoning speed.

Step 2: acquiring an initial target template feature vector F;

step 2.1: and calculating the size Z of the area where the tracking target is located according to a formula (8) according to the width w and the height h of the given tracking target to be detected, wherein p represents the extended length and is calculated according to a formula (9). As shown in fig. 4, a side length is cut out from the original image with a given center position (cx, cy) of the target to be tracked as the center

The square area of (a) is the template image. The height-width dimension of the template image is adjusted to 127 x 127 to obtain the initial target template. As shown in FIG. 5, the initial target template gets the feature e through the backbone network of the target template branch₀，e₀Obtaining cell units c through two branches respectively₀Hiding layer state h₀Each branch consists of a convolution of 3 x 3 and a convolution of 1 x 1.

Z＝(w+2p)*(h+2p) (8)

p＝(w+h)/4 (9)

Step 2.2: c is to₀，h₀And e₀The initial target template characteristics F are obtained by inputting the target template characteristics F into a DWConv-LSTM memory unit.

assuming that the current frame is the tth frame of the video (t is 1,2,3, …, n), the area a of the region to be clipped is calculated as the equation (10) { a ═ a^-1,A⁰,A¹And p' represents the extended length and is calculated according to the formula (11). w and h represent the width and height, respectively, of a given tracking target, k being 1.05, kⁱRepresents the power i of k. Then, the target center detected in the previous frame of the video by the algorithm is used as a central point, and the side lengths of the target center are respectively cut out from the current frame of the video

The three square regions of (a) result in a search image, the result is shown in fig. 6. Then, the image sizes are adjusted to 255 x 255 to obtain a multi-scale search image set

Taking the multi-scale search images in the set S as a batch, and obtaining a feature set corresponding to the multi-scale search images through search image branching

Aⁱ＝kⁱ(w+4p′)*kⁱ(h+4p′),i＝-1,0,1 (10)

p′＝(w+h)/4 (11)

Step 3.2: acquiring a similarity score map;

target template feature F at current moment and multi-ruler obtained in step 3.1Searching for corresponding features of image

and 4, step 4: cutting out an area where the target is located according to the position where the tracking target is located in the current frame found in the step 3 to be used as a target template, inputting the target template into a target template branch of a tracking model to obtain a new target template characteristic F, reading the next frame of the video to be used as the current frame, and turning to the step 3 to perform the next iteration until all frames in the video are read and the iteration is finished.

In order to detect the effectiveness of the invention, Bird2 video sequences in the OTB100 data set are selected and tested, and the result is shown in fig. 7, which shows that the invention can effectively cope with the problems of occlusion, morphological change and the like in the video tracking process, and has good robustness. Here, a real scene is selected to test the method of the present invention, and the same method as the above implementation steps is adopted to track the target, and as a result, as shown in fig. 8, it can be seen from the example that the present invention can track the target well, and has a certain use value.

In conclusion, the twin network visual tracking method based on the memory unit can be used for learning the appearance change trend of the target on the time sequence, ensuring the stability of the target on the space, ensuring the real-time performance of the tracking process and improving the tracking robustness of the tracking model facing a complex environment.

Claims

1. A twin network visual tracking method based on a memory unit is characterized by comprising the following specific steps:

step 1: building a tracking model;

step 2: acquiring an initial target template characteristic F;

2. The twin network vision tracking method based on memory units as claimed in claim 1, wherein in step 1, the DWConv-LSTM memory unit is established by using deep separable convolution instead of full connection structure in long-short term memory network LSTM, so that the target template feature output by the memory unit contains both time-series variation information and spatial feature information.

3. The twin network visual tracking method based on memory units as claimed in claim 1, wherein the step 2 obtains an initial target template feature F, and the specific operation steps are as follows:

Step 2.2: c is to₀，h₀And e₀Inputting the target template features into a DWConv-LSTM memory unit, and acquiring initial target template features F.

4. The twin network visual tracking method based on memory unit as claimed in claim 1, wherein the step 3 obtains the corresponding position of the tracking target in the current frame of the video, and the specific operation steps are as follows:

cutting out search image S from current frame in video_tEstablishing with S_tCorresponding multiscale search image collection

Taking the multi-scale search image in the set S as a batch, obtaining a feature set corresponding to the multi-scale search image through the search image branch of the tracking model

Step 3.2: acquiring a similarity score map;

Performing cross correlation operation according to the following formula to obtain a similarity score atlasCombination of Chinese herbs

In the formula (#) represents a convolution operator;

Finding the upsampled similarity score map where the maximum value is located in the set R, and marking the upsampled similarity score map as R_tTo R, to R_tAnd comparing all the values to obtain K response value points with the maximum value, averaging the K response value points to obtain a response value point d, and finding the corresponding position of the d point in the current video frame, namely the position of the searched target.