Skip to content

Generates parallel captions and audio from a single input image using multimodal models that handle image, text and audio.

License

Notifications You must be signed in to change notification settings

Namerlight/Image-Captioned-Audio-Synthesis

Repository files navigation

Image-to-Captioned-Audio Synthesis

Project for CMSC 691 - Computer Vision (Dr. Tejas Gokhale) at UMBC

Installing it

Works on Python 3.9 with system CUDA version 12.3. Tested on RTX 4060 (8 GB VRAM) and 16 GB RAM, so those are the minimum system reqs but may work on lower.

  • Run setup.sh to create folders and fetch external libraries.
  • If I made a mistake in setup sh, just run the commands manually.
  • Install required packages with pip install -r requirements.txt
  • Download DeCap weights from DeCap_CoCo.zip.
  • Unzip and place inside custom_pipeline/pretrained weights/ (see gen_caption.py lines 32 and 58 for reference)

Running it.

Just run main.py.

The existing pipeline can run inference in 2-3 minutes. The custom pipeline may take up to 30 minutes for inference.

Credits

Shadab Hafiz Choudhury

About

Generates parallel captions and audio from a single input image using multimodal models that handle image, text and audio.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published