🌔 moondream

a tiny vision language model that kicks ass and runs anywhere

Examples

Image

Example

What is the girl doing?
The girl is sitting at a table and eating a large hamburger.

What color is the girl's hair?
The girl's hair is white.

What is this?
This is a computer server rack, which is a device used to store and manage multiple computer servers. The rack is filled with various computer servers, each with their own dedicated space and power supply. The servers are connected to the rack via multiple cables, indicating that they are part of a larger system. The rack is placed on a carpeted floor, and there is a couch nearby, suggesting that the setup is in a living or entertainment area.

What is behind the stand?
Behind the stand, there is a brick wall.

Usage

Using transformers (recommended)

pip install transformers einops

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-08-26"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))

The model is updated regularly, so we recommend pinning the model version to a specific release as shown above.

To enable Flash Attention on the text model, pass in attn_implementation="flash_attention_2" when instantiating the model.

model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision,
    torch_dtype=torch.float16, attn_implementation="flash_attention_2"
).to("cuda")

Batch inference is also supported.

answers = moondream.batch_answer(
    images=[Image.open('<IMAGE_PATH_1>'), Image.open('<IMAGE_PATH_2>')],
    prompts=["Describe this image.", "Are there people in this image?"],
    tokenizer=tokenizer,
)

Using this repository

Clone this repository and install dependencies.

pip install -r requirements.txt

sample.py provides a CLI interface for running the model. When the --prompt argument is not provided, the script will allow you to ask questions interactively.

python sample.py --image [IMAGE_PATH] --prompt [PROMPT]

Use gradio_demo.py script to start a Gradio interface for the model.

python gradio_demo.py

webcam_gradio_demo.py provides a Gradio interface for the model that uses your webcam as input and performs inference in real-time.

python webcam_gradio_demo.py

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
.github/workflows		.github/workflows
assets		assets
clients/python		clients/python
moondream		moondream
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
batch_generate_example.py		batch_generate_example.py
gradio_demo.py		gradio_demo.py
hf_release.py		hf_release.py
requirements.txt		requirements.txt
sample.py		sample.py
webcam_gradio_demo.py		webcam_gradio_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌔 moondream

Examples

Usage

About

Contributors 14

Languages

License

vikhyat/moondream

Folders and files

Latest commit

History

Repository files navigation

🌔 moondream

Examples

Usage

About

Resources

License

Stars

Watchers

Forks

Contributors 14

Languages