🔍 Shotluck Holmes

Large Language Vision Models For Shot-Level Video Understanding (Richard Luo, Austin Peng, Adithya Vasudev, Rishabh Jain)

Read the Preprint Here »

Table of Contents

Introduction
🔧 Requirements and Installation
Data Pre-processing
- Downloading
- Pre-processing
Finetuning
Model
Results

Introduction

Video is an increasingly prominent and information-dense medium, yet it poses substantial challenges for language models. A typical video consists of a sequence of shorter segments, or shots, that collectively form a coherent narrative. Each shot is analogous to a word in a sentence where multiple data streams of information (such as visual and auditory data) must be processed simultaneously. Comprehension of the entire video requires not only understanding the visual-audio information of each shot but also requires that the model links the ideas between each shot to generate a larger, all-encompassing story. Despite significant progress in the field, current works often overlook videos’ more granular shot-by-shot semantic information. In this project, we propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning called Shotluck Holmes. By leveraging better pretraining and data collection strategies, we extend the abilities of existing small LLVMs from being able to understand a picture to being able to understand a sequence of frames. Specifically, we show that Shotluck Holmes achieves better performance than state-of-the-art results on the Shot2Story video captioning and summary task with significantly smaller and more computationally efficient models.

🔧 Requirements and Installation

Clone this repository and navigate to the folder

git clone https://github.com/Skyline-9/Shotluck-Holmes.git
cd Shotluck-Holmes

Install packages

conda create -n shotluck python=3.10 -y
conda activate shotluck
cd model
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
cd ..
pip install flash-attn==2.5.8 --no-build-isolation  # upgrade to this version of flash-attn for H100
# pip install flash-attn==1.0.9 --no-build-isolation  # downgrade to flash attention v1 for older GPUs

Alternatively, you can run setup-speedrun.sh from the root directory to execute all the commands above

sh scripts/setup-speedrun.sh

Data Pre-processing

Note: all the following commands should be run from the project root directory

Downloading

Raw annotations should already be downloaded with this repository. If your annotations are missing, download the annotations by running

sh data/scripts/download/download_annotations.sh

If running on Shot2Story dataset, follow bytedance/Shot2Story#5 to download the data and extract the videos into data/raw/videos.

Pre-processing

First, process the videos by running process_videos.py in scripts/data/process, which will run ffmpeg to split the shot videos into different files. Then, convert the annotation data and scan for corrupted videos by running convert_shot2story_to_llava.py

Set processes to a reasonable number depending on how many CPU cores you have available.

python scripts/data/process/process_videos.py --processes=<YOUR_NUM_PROCESSES>
python scripts/data/process/convert_shot2story_to_llava.py

If you plan on running eval, make sure to run convert_shot2story_to_llava.py on the test set as well.

Note: ffmpeg is required for process_videos.py. If this is not installed, download ffmpeg accordingly for your OS or install it locally using the download-ffmpeg.sh script.

Finetuning

Finetuning scripts are in scripts/run/finetune. Run the finetuning script corresponding to which model you want to use.

sh scripts/run/finetune/finetune_1b5.sh  # finetune the 1.5B model

sh scripts/run/finetune/finetune_3b1.sh  # finetune the 3.1B model

Model

Full model metrics, model zoo, and more details coming soon!

Results

Table 1: Performance of best models on single-shot video captioning

Model	BLEU	METEOR	ROUGE	CIDER
Shot2Story (7B+)	10.7	16.2	29.6	37.4
Shotluck-Holmes (3.1B)	8.7	25.7	36.2	63.2
Shotluck-Holmes (1.5B)	9.3	25.3	36.3	68.9

Table 2: Performance of best models on multi-shot video summarization

Model	BLEU	METEOR	ROUGE	CIDER
Shot2Story (7B+)	11.7	19.7	26.8	8.6
Shotluck-Holmes (3.1B)	7.67	23.2	43	152.3
Shotluck-Holmes (1.5B)	6.48	21.3	40.2	144.3

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
data		data
model		model
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Shotluck Holmes

Introduction

🔧 Requirements and Installation

Data Pre-processing

Downloading

Pre-processing

Finetuning

Model

Results

About

Releases

Packages

Contributors 3

Languages

License

Skyline-9/Shotluck-Holmes

Folders and files

Latest commit

History

Repository files navigation

🔍 Shotluck Holmes

Introduction

🔧 Requirements and Installation

Data Pre-processing

Downloading

Pre-processing

Finetuning

Model

Results

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages