ShareGemini: Scaling Up Video Caption Data for Multimodal Large Language Models

To enable Multimodal Large Language Models (MLLMs) to understand videos, a large volume of high-quality caption data is crucial for modality alignment. Existing publicly available video caption datasets, however, are insufficient for this purpose, suffering from limited size, short captions, and restricted diversity. Recent benchmarks like Video-MME have demonstrated the remarkable video comprehension capabilities of Gemini. Inspired by this, an expert group leverages the Gemini-1.5-Pro API to annotate captions for existing public video datasets (Webvid, Kinetic-400, InternVid, HD-VILA). Furthermore, we investigate the scaling laws of these collected caption data by training Chat-UniVi-7B with different levels of data scale.

Release

[2024/07/29] 🔥 ShareGemini-K400 is available here.
[2024/06/18] 🔥 ShareGemini-Webvid-core100k is available here. The caption data being prepared for release includes:
- ShareGemini-Webvid
- ShareGemini-InternVid
- ShareGemini-K400
- ShareGemini-HDVILA

Data Scaling on ShareGemini-Webvid

We explore data scaling for the annotated 530k ShareGemini-Webvid dataset through the following steps:

Embedding Extraction: InternVideo2 is used to extract embeddings from the 530k videos.
Clustering & Pruning: The resulting video embeddings are clustered using the ToMe algorithm, generating data subsets of core-400k, core-200k, core-100k, and core-50k.
Training: An additional training stage is introduced between Stage-1 and Stage-2 of the baseline Chat-UniVi-7B. Different caption subsets above are utilized respectively in this stage to train both the connector and the LLM.
Evaluation: Comprehensive evaluations are performed on the Video-MME benchmark.

Visulization of Clustering

Here are the visulization of some clustered videos. It can be observed that videos within the same cluster exhibit significant similarities, while videos from different clusters show notable differences.

Cluster	Image
cluster_1
cluster_2
cluster_3
cluster_4

Results on Video-MME

The models in 2nd to 6th row differ from the baseline only by the inclusion of an additional training stage using ShareGemini-Webvid. All other aspects, including model architecture, hyperparameters, and data, remain identical. "SG-WV" in the tables represents ShareGemini-Webvid.

W/o Subtitles

	Model	LLM Params	Overall (%)	Short Video (%)	Medium Video (%)	Long Video (%)
1	Chat-UniVi-1.5	7B	41.2	46.3	40.3	36.9
2	Chat-UniVi-1.5 +50k SG-WV	7B	41.8 (+0.6)	47.6 (+1.3)	40.2 (-0.1)	37.7 (+0.8)
3	Chat-UniVi-1.5 +100k SG-WV	7B	43.2 (+2.0)	49.1 (+2.8)	41.3 (+1.0)	39.1 (+2.2)
4	Chat-UniVi-1.5 +200k SG-WV	7B	43.0 (+1.8)	49.3 (+3.0)	42.0 (+1.7)	37.7 (+0.8)
5	Chat-UniVi-1.5 +400k SG-WV	7B	42.5 (+1.3)	50.2 (+3.9)	40.0 (-0.3)	37.3 (+0.4)
6	Chat-UniVi-1.5 +530k SG-WV	7B	42.6 (+1.4)	49.1 (+2.8)	39.9 (-0.4)	38.8 (+1.7)

W/ Subtitles

	Model	LLM Params	Overall (%)	Short Video (%)	Medium Video (%)	Long Video (%)
1	Chat-UniVi-1.5	7B	46.3	51.4	45.2	42.3
2	Chat-UniVi-1.5 +50k SG-WV	7B	47.6 (+0.7)	54.2 (+2.8)	45.7 (+0.5)	42.8 (+0.5)
3	Chat-UniVi-1.5 +100k SG-WV	7B	47.9 (+1.6)	52.8 (+1.4)	47.3 (+2.1)	43.4 (+1.1)
4	Chat-UniVi-1.5 +200k SG-WV	7B	47.4 (+1.1)	55.0 (+3.6)	46.2 (+1.0)	41.1 (-1.2)
5	Chat-UniVi-1.5 +400k SG-WV	7B	47.6 (+1.3)	55.0 (+3.6)	44.8 (-0.4)	43.0 (+0.7)
6	Chat-UniVi-1.5 +530k SG-WV	7B	47.3 (+1.0)	53.0 (+1.6)	46.1 (+0.9)	42.9 (+0.6)

Discussion

We dive into the key findings observed during the evaluation of Video-MME:

Effectiveness of Pre-training: The results demonstrate that pre-training Chat-UniVi-1.5 with a large corpus of high-quality captions significantly enhances its video understanding capabilities, aligning with our initial research motivation.
Data Scaling Robustness: Pretraining with ShareGemini-Webvid enhances the model's performance across Short Video, Medium Video, and Long Video categories on Video-MME. The most robust data scaling improvements are observed with short videos, where model accuracy consistently increases with larger data volumes. This aligns with the fact that Webvid predominantly consists of short videos (< 30s). The upcoming ShareGemini-HDVILA (average duration 5 min) will further improve the model's understanding of medium and long videos.
Data Scaling Saturation: Scaling ShareGemini-Webvid data to 100k instances results in saturation of performance improvements for Chat-UniVi-1.5-7B. This saturation may be attributed to the limited capacity of the 7B model.

Citation

If you find it useful for your research and applications, please cite the related webpage using this BibTeX:

@misc{sharegemini,
  title={ShareGemini: Scaling Up Video Caption Data for Multimodal Large Language Models},
  url={https://github.com/Share14/ShareGemini},
  author={Share},
  month={June},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShareGemini: Scaling Up Video Caption Data for Multimodal Large Language Models

Release

Data Scaling on ShareGemini-Webvid

Visulization of Clustering

Results on Video-MME

W/o Subtitles

W/ Subtitles

Discussion

Citation

About

Releases

Packages

Share14/ShareGemini

Folders and files

Latest commit

History

Repository files navigation

ShareGemini: Scaling Up Video Caption Data for Multimodal Large Language Models

Release

Data Scaling on ShareGemini-Webvid

Visulization of Clustering

Results on Video-MME

W/o Subtitles

W/ Subtitles

Discussion

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages