Skip to content

Commit

Permalink
Open-source code for "Dense Video Object Captioning from Disjoint Sup…
Browse files Browse the repository at this point in the history
…ervision" (Dense VOC).

PiperOrigin-RevId: 622956462
  • Loading branch information
xingyizhou authored and Scenic Authors committed Apr 8, 2024
1 parent ba814f7 commit 5525936
Show file tree
Hide file tree
Showing 43 changed files with 9,626 additions and 1 deletion.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,8 @@ Projects that were developed in Scenic or used it for their experiments:
* [MatFormer: Nested Transformer for Elastic Inference](https://arxiv.org/abs/2310.07707)
* [Pixel Aligned Language Models](https://arxiv.org/abs/2312.09237)
* [A Generative Approach for Wikipedia-Scale Visual Entity Recognition](https://arxiv.org/abs/2403.02041)
* [Streaming Dense Video Captioning](https://arxiv.org/abs/xxxx.xxxxx)
* [Streaming Dense Video Captioning](https://arxiv.org/abs/2404.01297)
* [Dense Video Object Captioning from Disjoint Supervision](https://arxiv.org/abs/2306.11729)

More information can be found in [projects](https://github.com/google-research/scenic/tree/main/scenic/projects#list-of-projects-hosted-in-scenic).

Expand Down
6 changes: 6 additions & 0 deletions scenic/projects/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,12 @@
> Streaming DVC is a framework for dense captioning of long videos.
> Details can be found in the [paper](https://arxiv.org/abs/2404.01297).
* [Dense Video Object Captioning](densevoc)

> Dense VOC is an end-to-end model for joint object detection, tracking,
> and captioning in videos.
> Details can be found in the [paper](https://arxiv.org/abs/2306.11729).
<a name="projects"></a>
## Scenic projects
A typical project consists of models, trainers, configs, a runner, and some
Expand Down
101 changes: 101 additions & 0 deletions scenic/projects/densevoc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Dense Video Object Captioning from Disjoint Supervision

> [**Dense Video Object Captioning from Disjoint Supervision**](http:https://arxiv.org/abs/2306.11729),\
> Xingyi Zhou*, Anurag Arnab*, Chen Sun, Cordelia Schmid.
<p align="center"> <img src="densevoc_teaser.png" height="150" /><img src="densevoc_framework.png" height="150" /> </p>

Dense video object captioning (Dense VOC) is the task of detecting, tracking, and captioning object trajectories in a video.
We propose an end-to-end model for the task, and leverage a mixture of disjoint tasks
and datasets that supervise different parts of our model.

## Getting started

First install Scenic following the instructions
[here](https://github.com/google-research/scenic#quickstart).
Then install additional dependencies with:

```
pip install -r scenic/projects/densevoc/requirements.txt
```

Before training and evaluation, we need to set up [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)
datasets that our data pipeline processes.
Please follow the instructions in `tools/build_*_tfrecord.py` to set up the datasets
for [Visual Genome](https://homes.cs.washington.edu/~ranjay/visualgenome/index.html),
[Spoken Moments-in-Time](http:https://moments.csail.mit.edu/spoken.html),
[VidSTG](https://github.com/Guaranteer/VidSTG-Dataset),
and [Video Localized Narratives](https://google.github.io/video-localized-narratives/).
The first two are used for pretraining, and the latter two for finetuning and evaluation.

Next, create the COCO-format json annotation files for evaluation following
[tools/create_coco_json_from_tfrecord.py](tools/create_coco_json_from_tfrecord.py).
After that, update the generated data path in [configs/common.py](configs/common.py).

Our model use a [CLIP](https://github.com/openai/CLIP) pretrained checkpoint.
Before training, run [tools/densevoc_convert_clip_b16_weights_to_jax.ipynb](tools/densevoc_convert_clip_b16_weights_to_jax.ipynb)
to convert and download the CLIP weights, and update the path in [configs/common.py](configs/common.py) accordingly.

To train a config, e.g., `densevoc_vidstg`, run

```shell
python -m scenic.projects.densevoc.main \
--config=scenic/projects/densevoc/configs/densevoc_vidstg.py \
--workdir=./output/densevoc_vidstg/
```

By default, our models are trained on 16 TPU/GPU cores with a total batch size
16 videos (per-device batch size of 1).
It is possible to train with different number of devices and
different batch sizes following the linear learning rate rule.
Training can be done on accelerators with 16GB of memory.

Our model feeds all video frames (up to 200 frames) to the model and produces
trajectories end-to-end, and therefore requires more TPU/GPU memory during inference.
Evaluation on VidSTG requires 32GB of accelerator memory, and we therefore use a separate config
for evaluation.
To perform evaluation on VidSTG, run

```shell
python -m scenic.projects.densevoc.main \
--config=scenic/projects/densevoc/configs/densevoc_vidstg_videoeval.py \
--workdir=./output/densevoc_vidstg/ \
--config.weights=/path/to/checkpoint/densevoc_vidstg
```

## Model Zoo

**Pretrained models**. We report zero-shot evaluation on VidSTG below

| | CHOTA | CapA | AssA | DetA | mAP | Checkpoint |
|--------------------------------------------------------------------------|--------|------|------|------|--------|-----------------|
| [grit_vg_384](configs/grit_vg_384.py) | - | - | - | - | 17.1 | Coming soon |
| [densevoc_disjoint_pretraining](configs/densevoc_disjoint_pretraining.py)| 31.1 | 9.8 | 59.6 | 51.4 | 39.5 | Coming soon |

**VidSTG**. These models are finetuned on VidSTG from the pretrained model above.

| | CHOTA | CapA | AssA | DetA | mAP | Checkpoint |
|----------------------------------------------------------------------------------------------|--------|------|------|------|-------|-----------------|
| [densevoc_vidstg_ftgrit_soft_aggregation](configs/densevoc_vidstg_ftgrit_soft_aggregation.py)| 54.6 | 38.4 | 65.9 | 64.4 | 68.7 | Coming soon |
| [densevoc_vidstg_ftgrit_hard_aggregation](configs/densevoc_vidstg_ftgrit_hard_aggregation.py)| 54.9 | 39.1 | 65.9 | 64.2 | 68.7 | Coming soon |
| [densevoc_vidstg](configs/densevoc_vidstg.py) | 56.9 | 39.7 | 70.4 | 65.8 | 71.5 | Coming soon |

**Video Localized Narratives (VLN)**. These models are finetuned on VLN from the
pretrained model above.

| | CHOTA | CapA | AssA | DetA | mAP | Checkpoint |
|----------------------------------------|--------|------|------|------|-------|-----------------|
| [densevoc_vln](configs/densevoc_vln.py)| 41.3 | 17.7 | 89.5 | 44.3 | 48.2 | Coming soon |

## Citation

If you use our Dense VOC project, please cite the following BibTeX entry:

```
@article{zhou2023dense,
title={Dense Video Object Captioning from Disjoint Supervision},
author={Zhou, Xingyi and Arnab, Anurag and Sun, Chen and Schmid, Cordelia},
journal={arXiv:2306.11729},
year={2023}
}
```
Empty file.
Loading

0 comments on commit 5525936

Please sign in to comment.