CN117635458A

CN117635458A - Video prediction method based on deep stream analysis network

Info

Publication number: CN117635458A
Application number: CN202311659020.5A
Authority: CN
Inventors: 金贝贝; 宋晓辉; 李金东; 张鹏飞
Original assignee: Institute Of Physics Henan Academy Of Sciences; Henan Academy of Sciences
Current assignee: Institute Of Physics Henan Academy Of Sciences; Henan Academy of Sciences
Priority date: 2023-12-05
Filing date: 2023-12-05
Publication date: 2024-03-01

Abstract

The invention discloses a video prediction method based on a depth stream analysis network, which predicts future scenes by analyzing an optical stream into a rigid stream and a residual stream, wherein the rigid stream represents scene dynamics generated by self-movement of an observer, and the residual stream corresponds to movement of other objects in the scene. Specifically, the method proposes an end-to-end unsupervised depth neural network that predicts future video frames by decomposing scene motion into self-motion (camera motion) and object-centric motion. The method improves the capability of the model for analyzing the dynamic information of the scene, and has certain social value and practical significance.

Description

Video prediction method based on deep stream analysis network

Technical Field

The invention belongs to the technical field of video analysis and prediction, and particularly relates to a video prediction method based on a deep stream analysis network.

Background

The ability to predict future conditions based on current and historical observations is critical to machine decision making. This task is relatively easy for humans, but is very challenging for machines. In recent years, computer vision researchers have focused their attention on video prediction tasks, specifically, predicting future video frames from video frames that have been observed.

The robust and effective video prediction method not only needs to fully utilize space semantic information, but also needs to accurately master time sequence motion rules. The motion dynamics contain rich scene evolution information, which is critical to understanding the environment, especially for automatic driving automobiles. Existing methods almost always jointly estimate the motion of background and foreground objects by direct optical flow or inter-frame differences, however, the motion of background and foreground objects in a scene is of different origin: the former is purely from the self-motion of the observer camera, while the latter is from the double superposition of the self-motion of the observer camera and the residual motion of the object. Therefore, the existing method has limited capability in distinguishing the static object and the moving object of the scene, and can not analyze the dynamic information of the scene with high fidelity. This problem is further exacerbated in complex urban environments where dynamic objects are dense.

Rushton et al found that there is a "flow resolution mechanism" in the human visual system, and the brain uses its sensitivity to optical flow to resolve retinal motion into components that result from self or object-centric motion, depth information also plays an important role in this process. The self-motion component is first estimated from the visual stimulus produced by the observer's motion to the retina, and then the "true" object-centric motion estimate is calculated by "subtracting" the self-motion from the retinal motion. This cognitive ability helps humans systematically solve problems and adapt to new situations. The method obtains inspiration from the biological 'flow analysis mechanism', and proposes decoupling background change and residual motion centered on an object through scene geometry reconstruction, thereby facilitating the inference of future frames in a video sequence.

Existing video prediction algorithms can be divided into deterministic video prediction methods and stochastic video prediction methods. The goal of deterministic video prediction methods is to minimize the reconstruction distance between the real world and the predicted result. In addition to ensuring the predicted quality of each frame, it is also necessary to extract a temporal representation in the video sequence. Deterministic video prediction tasks are of great significance for autopilot, robotic control, etc., and can generate predictions that are accurate enough to make safer, more reliable decisions. In deterministic methods, direct pixel synthesis models attempt to directly predict future pixel intensities on a frame-by-frame basis, which implicitly model the dynamic and static content of a scene during feature extraction. Ranzato et al use k-means to discrete video frames in a cluster of image blocks, and they assume that non-overlapping image blocks are different in k-means discretized space. The method is based on a model of recurrent neural network, short-term prediction is performed at the block level, and since the whole frame is made up of predicted blocks, prediction of large and fast moving objects is accurate, however, there is still room for improvement when small and slow moving objects are involved. Lotter et al propose a "PredNet" whose inspiration comes from the neuroscience concept of "predictive coding". "PredNet" consists of a series of repetitively stacked modules that attempt to locally predict the inputs of the modules, although exhibiting some promising results, the model has a limited length of timing that can be predicted. Therefore, improvement of long-term prediction performance is an important point for subsequent work. Jin et al use the generation of an antagonism network to improve the authenticity of predictions. Inspired by the band decomposition characteristics of the human visual system, jin et al propose video prediction that utilizes wavelet analysis to explore multi-frequency analysis for high fidelity and timing consistency. Shouno et al propose a depth residual network with a hierarchical structure to handle large motions, where each layer predicts future states at different spatial resolutions. The predictions of these different layers are combined by a top-down connection to generate future frames. Another type of deterministic method uses the generation of a transformation matrix for video prediction, which is equivalent to affine transformation between adjacent frames. Vondrick et al deal with future uncertainties and past memory by learning transformations, separating past memory from predictions of the future.

Random video prediction methods consider future prediction as a multi-modal task, which generally encodes uncertainty as a sequence of potential variables. Random methods are typically based on generating a challenge network, varying the self-encoder structure, etc. Babaizadeh et al propose a first task of random multi-frame prediction that proposes a random variation video prediction method that predicts the different possible future of each potential variable sample. Denton et al propose a stochastic video generation model that combines a deterministic frame predictor with random latent variables that vary over time. Lee et al propose the first effort to produce high quality predictions by varying the lower bound and countermeasure training.

Although existing video prediction algorithms have achieved a certain performance, their lack of understanding of motion information decoupling often results in blurred predicted video sequences and lack of timing consistency, which makes it difficult to perform well.

Disclosure of Invention

The embodiment of the invention discloses a video prediction method based on a depth stream analysis network, which predicts a future scene by analyzing an optical stream into a rigid stream and a residual stream, wherein the rigid stream represents scene dynamics generated by self-movement of an observer, and the residual stream corresponds to movement of other objects in the scene. Specifically, the method proposes an end-to-end unsupervised depth neural network that predicts future video frames by decomposing scene motion into self-motion (camera motion) and object-centric motion. The method improves the capability of the model for analyzing the dynamic information of the scene, and has certain social value and practical significance.

The technical scheme of the invention is as follows:

a video prediction method based on a deep stream analysis network comprises the following steps:

s1, acquiring a training sample;

s2, preprocessing video data;

s3, constructing a depth and pose prediction network;

based on a convolutional neural network architecture, removing the original full-connection layer and all subsequent layers, only reserving a rolling and pooling part, and constructing a depth and pose prediction network;

s4, constructing a geometric rigid flow projection unit, and connecting the geometric rigid flow projection unit to the back of the convolution and pooling convolution neural network architecture reserved in the S3;

s5, constructing a residual flow network based on a convolutional neural network, outputting residual flow, and adding the residual flow with the residual flow to obtain an overall optical flow;

s6, constructing an LSTM module, inputting the whole optical flow and memorizing time sequence information;

s7, constructing a decoder module, and connecting to the LSTM constructed in the S6 to obtain a video prediction network model M;

s8, training a video prediction model M;

s9, calculating training loss, and updating network parameters by using a back propagation algorithm;

s10, video frame prediction is carried out on the input video sequence by utilizing the trained network.

Further, the step S1 specifically includes:

the method comprises the steps of obtaining video sequence data sets from a database, wherein the data sets comprise a KITTI data set for carrying out video prediction on automatic driving of an automobile and a Caltech Pedestrain data set, extracting a certain number of video frame sequences as input by taking one data set as a unique data set during training of a network, taking subsequent video frames as corresponding reference results, and then carrying out the same operation by taking the other data set as the unique data set.

Further, the step S2 specifically includes:

s21, scaling: scaling the video frame to the original theta times, wherein the value range in the embodiment is 1.0-1.5;

s22, cutting: the original training samples are randomly sheared into 320 x 320 pixel video sequences;

s23, HSL adjustment: the chromaticity (Hue), saturation (Saturation), and luminance (lighting) of the clipped samples are multiplied by a random value delta e 1.0,1.2 to simulate the illumination variation of the natural environment.

S24, dividing the video sequence data set into a training set and a testing set;

further, step S8 specifically includes:

extracting a sequence of t successive video images x= { X from the input video sequence in S1 ₁ ,x ₂ ,…,x _t Sequentially inputting the video image sequence X into the video prediction network M constructed in S7 to extract features and predict the next video frame image

Further, step S9 specifically includes:

video frames to be predictedThe video prediction network inputted to S7 gets predicted +.>And so on until a k-frame video sequence to be predicted is obtained +.>The true video sequence s= { x ₁ ,x ₂ ,…,x _t ,x _t+1 ,x _t+2 ,…,x _t+k Video frame sequence of } and prediction +.>In contrast to this, the number of the cells,calculating loss, training a network model M by using a back propagation algorithm, wherein loss functions used in training are respectively as follows:

compared with the prior art, the invention has the beneficial technical effects that:

1) The invention provides a video prediction method based on a deep stream analysis network. In real scenes, self-motion of the camera and object-centric motion superposition result in complex dynamic evolution, the full knowledge and understanding of which is necessary for the video prediction task. Previous studies have mostly focused on the processing of global motion, ignoring the ambiguity of camera self-motion and object center motion, resulting in incomplete understanding of overall scene dynamics. The method is inspired by a 'stream parsing mechanism' of a human visual system, and provides separation of background change and residual motion centered on an object through scene geometry reconstruction so as to facilitate inference of future frames in a video sequence. Compared with the traditional video prediction method, the method can better sense the motion in the video, and further improves the accuracy and stability of prediction.

2) The present invention emphasizes the importance of disambiguating camera self-motion and object center motion in future predictions. The optical flow is resolved into a rigid optical flow related to camera motion and a residual optical flow related to object center motion. In addition, content information is synchronously extracted from the historical frames through the full convolution neural network, and a better prediction effect is achieved through double understanding of the content and the motion characteristics.

3) The invention realizes deep understanding of video motion by introducing a stream analysis mechanism, thereby improving the accuracy and stability of the model. Therefore, the method has important application value and wide development prospect in the field of video prediction. In practical use, the result prediction sequence can be obtained by only inputting the video sequence into a generating network through one forward propagation, and the method has better effect compared with the traditional video prediction method.

Drawings

FIG. 1 is a flow chart of a video prediction method of the present invention;

FIG. 2 is a diagram of an embodiment of the present invention;

fig. 3 is a schematic diagram of a video prediction network structure according to the present invention.

Detailed Description

As shown in fig. 1-3, a video prediction method based on a deep stream parsing network includes the following steps:

s1, obtaining a training sample

Acquiring video sequence data sets from a database, wherein the data sets comprise a KITTI data set and a Caltech Pedestrian data set for carrying out video prediction on automatic driving of an automobile, extracting a certain number of video frame sequences as input by taking one data set as a unique data set during training of a network, taking subsequent video frames as corresponding reference results, and then carrying out the same operation by taking the other data set as the unique data set;

s2, preprocessing operation of video data

The step S2 specifically comprises the following steps:

s3, constructing a depth and pose prediction network;

s8, training a video prediction model M;

the step S8 specifically comprises the following steps:

extracting a sequence of t successive video images x= { X from the input video sequence in S1 ₁ ,x ₂ ,…,x _t X in the formula }, where _i Representing the ith frame image, inputting the video image sequence X into the video prediction network M constructed in S7 in sequence to extract features and predict the next video frame imageI.e. the image frame at time t + 1.

S9, calculating training loss, and updating network parameters by using a back propagation algorithm

The step S9 specifically comprises the following steps:

video frames to be predictedThe video prediction network inputted to S7 gets predicted +.>I.e. the image frame at time t +2, and so on until a k-frame video sequence to be predicted is obtained>The true video sequence s= { x ₁ ,x ₂ ,…,x _t ,x _t+1 ,x _t+2 ,…,x _t+k And (3)Predicted video frame sequence->In contrast, the loss is calculated, the network model M is trained by using a back propagation algorithm, and loss functions used in training are respectively as follows:

The above embodiments are only illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solutions of the present invention should fall within the protection scope defined by the claims of the present invention without departing from the design spirit of the present invention.

Claims

1. A space-time wavelet analysis video prediction method based on a differential attention mechanism is characterized by comprising the following steps:

s1, acquiring a training sample;

s2, preprocessing video data;

s3, constructing a depth and pose prediction network;

s8, training a video prediction model M;