Skip to content

Deep learning model to predict the normal flow between two consecutive frames, being the normal flow the projection of the optical flow on the gradient directions.

Notifications You must be signed in to change notification settings

FandosA/Normal_Flow_Prediction

Repository files navigation

Normal Flow Prediction

This project consists of a deep learning model able to predict the normal flow between two consecutive frames, being the normal flow the projection of the optical flow on the gradient directions. The dataset used to train the model has been TartanAir dataset and the deep learning model is an encoder-decoder with residual blocks based on EVPropNet.

Normal Flow

The normal flow is the projection of the optical flow on the gradient directions of the image and serves as a representation of image motion. To compute it, the brightness constancy constraint has to be applied. The brightness constancy constraint is one of the fundamental assumptions in optical flow computation and computer vision. It is based on the idea that, the intensity, or a function of the intensity, at a pixel remains constant over two consecutive frames. The mathematical expression of this constraint is:

$I(x,y,z)=I(x + u \delta t, y + v \delta t, t + \delta t)$

where $u$ and $v$ represent the optical flow of the pixel $(x, y)$ (i.e., the motion of the image pixel from time $t$ to $t+1$). Thus, the equation can be rewritten as:

$I(x,y,z)=I(x + \delta x, y + \delta y, t + \delta t)$

Approximating the right part of the previous equation with a first order Taylor expansion we obtain:

$I(x,y,z)=I(x,y,z) + \frac{\partial I}{\partial x}\delta x + \frac{\partial I}{\partial y}\delta y + \frac{\partial I}{\partial t}\delta t$

And subtracting $I(x,y,t)$ from both sides of the equation, and then dividing by $\delta t$:

$0 = \frac{\partial I}{\partial x} \frac{\delta x}{\delta t} + \frac{\partial I}{\partial y} \frac{\delta y}{\delta t} + \frac{\partial I}{\partial t} \frac{\delta t}{\delta t} = I_x \frac{\partial x}{\partial t} + I_y \frac{\partial y}{\partial t} + I_t$

Finally, keeping in mind that the optical flow $(u,v)$ represents the motion of the image pixel from time $t$ to $t+1$, this last equation can be rewritten as:

$0 = I_x u + I_y v + I_t$

This equation represents the constraint line. That is, for any point $(x,y)$ in the image, its optical flow $(u,v)$ lies on this line. In the image below, it can be seen an example of this line alongside a gradient vector at a pixel in the image and the corresponding optical flow vector $u$ at that pixel. As shown in the image, the optical flow vector (blue arrow) can be decomposed into two components: the normal flow (depicted by the red arrow) and the parallel flow (indicated by the green arrow).

Therefore, to compute the normal flow vector, it is necessary to calculate the unit vector perpendicular to the constraint line and its magnitude, which corresponds to the distance from the origin to the constraint line. The mathematical expressions of this components are:

$\hat{u}_n = \frac{(I_x, I_y)}{\sqrt{I_x^2 + I_y^2}}$

$|\hat{u}_n| = \frac{|I_t|}{\sqrt{I_x^2 + I_y^2}}$

Obtaining $I_t$ from the constraint line equation, $-I_t = I_xu + I_yv$, the final normal flow vector can be calculated by combining the unit vector of the normal flow, its magnitude, and the value of $|I_t|$:

$u_n = |\hat{u}_n| \hat{u}_n = \frac{|I_t|}{\sqrt{I_x^2 + I_y^2}} \cdot \frac{(I_x,I_y)}{\sqrt{I_x^2 + I_y^2}} = \frac{I_xu + I_yv}{I_x^2 + I_y^2} (I_x,I_y)$

Autoencoder

The deep learning model chosen to predict the normal flow between two consecutive frames has been an autoencoder. This autoencoder is based on EVPropNet, which in turn is based on ResNet. The encoder contains residual blocks with convolutional layers and the decoder contains residual blocks with transpose convolutional layers. The gradients are backpropagated using a mean squared loss computed between groundtruth and predicted normal flow:

$argmin$ $||n - \hat{n} ||_2^2$

The model takes as input two concatenated frames and outputs a matrix of two channels, the components of the normal flow of each pixel. That is, the dimensions of the input and output tensors are $(h,w,6)$ and $(h,w,2)$, respectively.

Run the implementation

As mention before, ehe dataset used to train the model has been TartanAir dataset. This dataset provides many image sequences of different scenarios created in Unreal Engine. At the same time they provide depth maps, optical flow, camera positions and orientations in each image and more. you need to visit their website, download the scenarios you want, and organize the images and their optical flow data the same way they are here in the dataset/train/ folder in the repository. When the data is correctly organized, run the file

python dataset.py

This will create a json file like the one here in the repository with the paths to all images and optical flow data. Then run

python train.py

and the model will start training. A folder like the one here called autoencoder/ will be created. The training checkpoints, as well as the loss values, will be saved here. At the end of the training, an image showing the loss curves will also be saved. You can check the folder in this repository to see what it looks like and the loss curves I have obtained.

To test the model, organise the dataset in the same way as before but using the dataset/test/ folder instead, and run the test file

python test.py

You only have to enter the name of the checkpoint you want to use. In my case, I run

python test.py --checkpoint=checkpoint_395_best.pth

because that's the name of the checkpoint where the loss value was the lowest in my training.

Results

In the following video you can see the results obtained using the hospital scenario of the data set. The video shows 4 screens: the top left screen is the original image, the top right screen is the optical flow, the bottom left screen is the ground truth normal flow and the bottom right screen is the predicted normal flow.

https://www.youtube.com/watch?v=17LuuflLbSo