Skip to content

Commit

Permalink
Releases VideoGPT+
Browse files Browse the repository at this point in the history
  • Loading branch information
mmaaz60 committed Jun 13, 2024
0 parents commit ea4a040
Show file tree
Hide file tree
Showing 84 changed files with 11,299 additions and 0 deletions.
395 changes: 395 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

184 changes: 184 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# VideoGPT+ :movie_camera: :speech_balloon:

<p align="center">
<img src="docs/images/videogpt_plus_face.jpeg" alt="videogpt_plus_face" width="200">
</p>

<p align="center">
<img src="https://i.imgur.com/waxVImv.png" alt="Oryx Video-ChatGPT">
</p>

### VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

#### [Muhammad Maaz](https://www.muhammadmaaz.com) , [Hanoona Rasheed](https://www.hanoonarasheed.com/) , [Salman Khan](https://salman-h-khan.github.io/) and [Fahad Khan](https://sites.google.com/view/fahadkhans/home)

#### **Mohamed bin Zayed University of Artificial Intelligence**

---

[![paper](https://img.shields.io/badge/arXiv-Paper-blue.svg)](https://github.com/mbzuai-oryx/VideoGPT-plus)
[![video](https://img.shields.io/badge/Project-HuggingFace-F9D371)](https://www.youtube.com/watch?v=0dZ4dlNIGTY)
[![Dataset](https://img.shields.io/badge/VCGBench-Diverse-green)](https://huggingface.co/datasets/MBZUAI/VCGBench-Diverse)
[![Demo](https://img.shields.io/badge/Annotation-Pipeline-red)](https://huggingface.co/datasets/MBZUAI/video_annotation_pipeline)

## :loudspeaker: Latest Updates
- **Jun-13-24**: VideoGPT+ is released. :fire::fire:
---

## VideoGPT+ Overview :bulb:

VideoGPT+ integrates image and video encoders to leverage detailed spatial understanding and global temporal context, respectively. It processes videos in segments using adaptive pooling on features from both encoders, enhancing performance across various video benchmarks.

<p align="center">
<img src="docs/images/block_diagram.png" alt="VideoGPT+ Architectural Overview">
</p>

---

## Contributions :trophy:

- **VideoGPT+ Model**: We present VideoGPT+, the first video-conversation model that benefits from a dual-encoding scheme based on both image and video features. These complimentary sets of features offer rich spatiotemporal details for improved video understanding.
- **VCG+ 112K Dataset**: Addressing the limitations of the existing VideoInstruct100K dataset, we develop VCG+ 112K with a novel semi-automatic annotation pipeline, offering dense video captions along with spatial understanding and reasoning-based QA pairs, further improving the model performance.
- **VCGBench-Diverse Benchmark**: Recognizing the lack of diverse benchmarks for video-conversation tasks, we propose VCGBench-Diverse, which provides 4,354 human annotated QA pairs across 18 video categories to extensively evaluate the performance of a video-conversation model.

<p align="center">
<img src="docs/images/intro_radar_plot.png" alt="Contributions" width="650">
</p>

---

## Video Annotation Pipeline (VCG+ 112K) :open_file_folder:
Video-ChatGPT introduces the VideoInstruct100K dataset, which employs a semi-automatic annotation pipeline to generate 75K instruction-tuning QA pairs. To address the limitations of this annotation process, we present \ourdata~dataset developed through an improved annotation pipeline. Our approach improves the accuracy and quality of instruction tuning pairs by improving keyframe extraction, leveraging SoTA large multimodal models (LMMs) for detailed descriptions, and refining the instruction generation strategy.

<p align="center">
<img src="docs/images/vcg120k_block_diagram.png" alt="Contributions">
</p>

---
## VCGBench-Diverse :mag:
Recognizing the limited diversity in existing video conversation benchmarks, we introduce VCGBench-Diverse to comprehensively evaluate the generalization ability of video LMMs. While VCG-Bench provides an extensive evaluation protocol, it is limited to videos from the ActivityNet200 dataset. Our benchmark comprises a total of 877 videos, 18 broad video categories and 4,354 QA pairs, ensuring a robust evaluation framework.

<p align="center">
<img src="docs/images/vcgbench_block_diag.png" alt="Contributions">
</p>

---

## Installation :wrench:

We recommend setting up a conda environment for the project:
```shell
conda create --name=videogpt_plus python=3.11
conda activate videogpt_plus

git clone https://github.com/mbzuai-oryx/VideoGPT-plus
cd VideoGPT-plus

pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.41.0

pip install -r requirements.txt

export PYTHONPATH="./:$PYTHONPATH"
```
Additionally, install [FlashAttention](https://github.com/HazyResearch/flash-attention) for training,
```shell
pip install ninja

git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
python setup.py install
```
---

## Quantitative Evaluation 📊
We provide instructions to reproduce VideoGPT+ results on VCGBench, VCGBench-Diverse and MVBench. Please follow the instructions at [eval/README.md](eval/README.md).

### VCGBench Evaluation: Video-based Generative Performance Benchmarking :chart_with_upwards_trend:
<p align="center">
<img src="docs/images/VCGBench_quantitative.png" alt="VCGBench_quantitative" width="1000">
</p>

---
### VCGBench-Diverse Evaluation :bar_chart:
<p align="center">
<img src="docs/images/VCGDiverse_quantitative.png" alt="VCGDiverse_quantitative">
</p>

---
### Zero-Shot Question-Answer Evaluation :question:
<p align="center">
<img src="docs/images/zero_shot_quantitative.png" alt="zero_shot_quantitative">
</p>

---

### MVBench Evaluation :movie_camera:
<p align="center">
<img src="docs/images/MVBench_quantitative.png" alt="MVBench_quantitative">
</p>

---

## Training :train:
We provide scripts for pretraining and finetuning of VideoGPT+. Please follow the instructions at [scripts/README.md](scripts/README.md).

---

## Qualitative Analysis :mag:
A comprehensive evaluation of VideoGPT+ performance across multiple tasks and domains.
<p align="center">
<img src="docs/images/demo_vcg+_main.png" alt="demo_vcg+_main" width="700">
</p>

---

<p align="center">
<img src="docs/images/demo_vcg+_full_part1.jpg" alt="demo_vcg+_full_part1" width="700">
</p>


<p align="center">
<img src="docs/images/demo_vcg+_full_part2.jpg" alt="demo_vcg+_full_part2" width="700">
</p>

---

## Acknowledgements :pray:

+ [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT): A pioneering attempt in Video-based conversation models.
+ [LLaVA](https://github.com/haotian-liu/LLaVA): Our code base is build upon LLaVA and Video-ChatGPT.
+ [Chat-UniVi](https://github.com/PKU-YuanGroup/Chat-UniVi): A recent work in image and video-based conversation models. We borrowed some implementation details from their public codebase.

## Citations 📜:

If you're using VideoGPT+ in your research or applications, please cite using this BibTeX:
```bibtex
@inproceedings{Maaz2024VideoGPT+,
title={VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding},
author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},
journal={coming soon},
year={2024},
url={coming soon}
}
@inproceedings{Maaz2023VideoChatGPT,
title={Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models},
author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},
booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)},
year={2024}
}
```

## License :scroll:
<a rel="license" href="http:https://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a><br />This work is licensed under a <a rel="license" href="http:https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.


Looking forward to your feedback, contributions, and stars! :star2:
Please raise any issues or questions [here](https://github.com/mbzuai-oryx/VideoGPT-plus/issues).


---
[<img src="docs/images/IVAL_logo.png" width="200" height="100">](https://www.ival-mbzuai.com)
[<img src="docs/images/Oryx_logo.png" width="100" height="100">](https://github.com/mbzuai-oryx)
[<img src="docs/images/MBZUAI_logo.png" width="360" height="85">](https://mbzuai.ac.ae)
126 changes: 126 additions & 0 deletions annotation_pipeline/1_scenedetect_and_keyframes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
"""
Semi-automatic Video Annotation Pipeline - Step # 1: Detect scenes and extract keyframes
Copyright 2024 MBZUAI ORYX
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http:https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""

import argparse
from Katna.video import Video
from Katna.writer import KeyFrameDiskWriter
import os
from scenedetect import detect, ContentDetector, split_video_ffmpeg, open_video, SceneManager
import warnings
import json
from tqdm import tqdm
import sys
import contextlib

# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)


def parse_args():
"""
Command-line argument parser.
"""
parser = argparse.ArgumentParser(description="Detect scenes and extract keyframes.")

parser.add_argument("--video_dir", required=True, help="Directory containing ActivityNet videos.")

parser.add_argument("--ann_video_ids_file", required=True,
help="Path to the unique video ids JSON file (e.g. path to unique_video_ids.json).")
parser.add_argument("--gt_caption_file", required=True,
help="Path to the ground truth captions file (e.g. path to activitynet_gt_captions_train.json).")

parser.add_argument("--scene_output_dir", required=False, help="Path to save the scene files.", default="scenes")
parser.add_argument("--frames_output_dir", required=False, help="Path to save the keyframes.", default="key_frames")
parser.add_argument("--num_keyframes", type=int, default=1, help="Number of keyframes to extract per scene.")

return parser.parse_args()


@contextlib.contextmanager
def suppress_output():
with open(os.devnull, "w") as devnull:
old_stdout = sys.stdout
sys.stdout = devnull
try:
yield
finally:
sys.stdout = old_stdout


def get_keyframes(video_path, num_keyframes, output_dir):
"""
Extracts keyframes using Katna from the video and returns their file paths,
operating within a temporary directory.
"""
# Create a temporary directory for extracted frames
# Initialize video module and disk writer
vd = Video()
diskwriter = KeyFrameDiskWriter(location=output_dir)

# Suppress print output during keyframe extraction
with suppress_output():
vd.extract_video_keyframes(no_of_frames=num_keyframes, file_path=video_path, writer=diskwriter)

return None


def get_scenes(video_path, output_dir):
video = open_video(video_path)
scene_manager = SceneManager()
scene_manager.add_detector(ContentDetector())
scene_manager.detect_scenes(video)
# If `start_in_scene` is True, len(scene_list) will always be >= 1
scene_list = scene_manager.get_scene_list(start_in_scene=True)
split_video_ffmpeg(video_path, scene_list, output_dir)

return scene_list


def main():
args = parse_args()
os.makedirs(args.scene_output_dir, exist_ok=True)
os.makedirs(args.frames_output_dir, exist_ok=True)
with open(args.ann_video_ids_file, 'r') as file:
data = json.load(file)
video_ids_to_annotate = data['v2_videos']

# Read ground truth captions file
gt_file = args.gt_caption_file
with open(gt_file) as file:
gt_json_data = json.load(file)

video_ids_to_annotate = [id for id in video_ids_to_annotate if id in gt_json_data]

files_to_annotate = [file for file in os.listdir(args.video_dir) if file.split('.')[0] in video_ids_to_annotate]

for file in tqdm(files_to_annotate):
try:
video_id = file.split('.')[0]
video_path = os.path.join(args.video_dir, file)
curr_scene_dir = f'{args.scene_output_dir}/{video_id}'
_ = get_scenes(video_path, curr_scene_dir) # Extract the scenes and save in the curr_scene_dir
scenes_to_annotate = os.listdir(curr_scene_dir)
for scene in tqdm(scenes_to_annotate):
sce_video_path = os.path.join(curr_scene_dir, scene)
get_keyframes(sce_video_path, num_keyframes=args.num_keyframes, output_dir=args.frames_output_dir)
except Exception as e:
print(f"Error processing video {file}: {e}")


if __name__ == '__main__':
main()
Loading

0 comments on commit ea4a040

Please sign in to comment.