# Automated Video Captioning using S2VT ## Introduction This repository contains my implementation of a video captioning system. This system takes as input a **video** and generates a **caption** describing the event in the video. I took inspiration from [Sequence to Sequence -- Video to Text](https://vsubhashini.github.io/s2vt.html), a video captioning work proposed by researchers at the University of Texas, Austin. ## Requirements For running my code and reproducing the results, the following packages need to be installed first. I have used Python 2.7 for the whole of this project. Packages: * TensorFlow * Caffe * NumPy * cv2 * imageio * scikit-image ## S2VT - Architecture and working Attached below is the **architecture diagram of S2VT** as given in their [paper](http://www.cs.utexas.edu/users/ml/papers/venugopalan.iccv15.pdf). ![Arch_S2VT](images/Arch_S2VT.png) The **working** of the system while generating a caption for a given video is represented below diagrammatically. ![S2VT_Working](images/S2VT.png) ## Running instructions 1. Install all the packages mentioned in the 'Requirements' section for the smooth running of this project. 2. Using [Vid2Url_Full.txt](text_files/Vid2Url_Full.txt), download the dataset clips from Youtube and store in . * Example to use Vid2Url - {'vid1547': 'm1NR0uNNs5Y_104_110'} * YouTube video identifier - m1NR0uNNs5Y * Start time - 104 seconds, End time - 110 seconds * Download frames between 104 seconds and 110 seconds in https://www.youtube.com/watch?v=m1NR0uNNs5Y * Relevant frames for video id 'vid1547' have been downloaded 3. Pass downloaded video paths and batch size (depending on hardware constraints) to extract_feats() in [Extract_Feats.py](Extract_Feats.py) to extract VGG16 features for the downloaded video clips and store in . 4. Change paths in lines 13 to 16 in [utils.py](utils.py) to point to directories in your workspace. 5. Run [training_vidcap.py](training_vidcap.py) with the number of epochs as a command line argument. eg. python training_vidcap.py 10 6. Pass saved checkpoint files from Step 5 to [test_videocap.py](test_videocap.py) to run trained model on the validation set. ## Sample results Attached below are a few screenshots from caption generation for videos from the **validation set**. ![Result1](images/Res1.png) ![Result2](images/Res2.png) # Dataset Even though S2VT was trained on MSVD, M-VAD and MPII-MD, I have trained my system **only on MSVD**, which can be downloaded [here.](https://www.microsoft.com/en-us/download/details.aspx?id=52422) # Demo A **demo of my system** can be found [here](https://www.youtube.com/watch?v=tmLzgFdI7Xg) # Acknowledgements * [Sequence to Sequence -- Video to Text](https://arxiv.org/abs/1505.00487) - Subhasini Venugopalan et al.