Skip to content

mozilla/distilvit

Repository files navigation

distilvit

Fine-tune a Visual Encoder Decoder model for image captioning.

Resulting model is available on Hugging Face model hub at https://huggingface.co/mozilla/distilvit

The train script is inspired from https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/#references

To install, use your favorite tools or you can run this:

python -m venv .
bin/pip install -r requirements.txt
bin/pip install -e .

To train against all image & caption pairs (COCO, Flickr30k and TextCaps), make sure you have 2T of disk space, and run:

bin/train --dataset all

Once trained, you can try it out with the test script:

bin/python distilvit/infere.py

Releases

No releases published

Packages

No packages published

Languages