Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



12 Commits

Repository files navigation

Image Captioning using BLIP

This project demonstrates the use of BLIP (Bootstrapping Language-Image Pre-training) for generating image captions. It includes the process of generating both conditional and unconditional captions for a given image and calculating the BLEU score to evaluate the generated captions against reference captions.


BLIP (Bootstrapping Language-Image Pre-training) is a framework for pre-training vision-language models. This project uses the BlipProcessor and BlipForConditionalGeneration classes from the transformers library to generate captions for images.


  • Python 3.7+
  • PyTorch
  • Transformers library from Hugging Face
  • NLTK
  • Pillow


  1. Install the required Python packages:

    pip install torch transformers nltk pillow
  2. Download the NLTK data:

    import nltk'punkt')


  1. Load the Image:

    Load an image using Pillow and convert it to RGB format.

    from PIL import Image
    img_path = "data/images/test1.jpg"
    raw_image ='RGB')
  2. Convert Image to Tensor:

    Convert the image to a numpy array and then to a PyTorch tensor.

    import numpy as np
    import torch
    image_np = np.array(raw_image)
    image_tensor = torch.tensor(image_np, dtype=torch.float32).unsqueeze(0)
  3. Load the Processor and Model:

    Load the BLIP processor and model from the Hugging Face transformers library.

    from transformers import BlipProcessor, BlipForConditionalGeneration
    processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
    model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
  4. Generate Captions:

    Generate conditional and unconditional captions for the image.

    # Conditional Caption
    inputs = processor(images=image_tensor, text=["a photo "], return_tensors="pt")
    out = model.generate(**inputs, max_new_tokens=20)
    generated_caption_conditional = processor.decode(out[0], skip_special_tokens=True)
    print("Generated Caption (Conditional):", generated_caption_conditional)
    # Unconditional Caption
    inputs = processor(images=image_tensor, return_tensors="pt")
    out = model.generate(**inputs, max_new_tokens=20, num_return_sequences=1, temperature=0.7)
    generated_caption_unconditional = processor.decode(out[0], skip_special_tokens=True)
    print("Generated Caption (Unconditional):", generated_caption_unconditional)
  5. Calculate BLEU Score:

    • BLEU: Biligual Evaluation Understudy Score

    Calculate the BLEU score to evaluate the generated captions.

    from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
    def calculate_bleu(reference_captions, generated_caption):
        reference_captions = [nltk.word_tokenize(caption.lower()) for caption in reference_captions]
        generated_caption = nltk.word_tokenize(generated_caption.lower())
        smoothie = SmoothingFunction().method4
        score = sentence_bleu(reference_captions, generated_caption, smoothing_function=smoothie)
        return score
    ref_txt = "data/captions/test1.txt"
    with open(ref_txt, "r") as f:
        reference_captions = f.readlines()
    bleu_score_conditional = calculate_bleu(reference_captions, generated_caption_conditional)
    bleu_score_unconditional = calculate_bleu(reference_captions, generated_caption_unconditional)
    print("BLEU Score (Conditional):", bleu_score_conditional)
    print("BLEU Score (Unconditional):", bleu_score_unconditional)

test1 스크린샷 2024-06-17 오후 4 25 27

File Structure

├── data
│   ├── captions
│   │   ├── test1.txt
│   │   └── test2.txt
│   └── images
│       ├── test1.jpg
│       └── test2.jpg
└── requirements.txt

Image Captioning using BLIP2

This project demonstrates the use of BLIP2 (Bootstrapping Language-Image Pre-training) for generating image captions on the COCO validation dataset. It includes the process of generating captions for a given image and calculating the BLEU score to evaluate the generated captions against reference captions.


BLIP2 is a framework for pre-training vision-language models. This project uses the load_model_and_preprocess method from the LAVIS library to generate captions for images in the COCO validation dataset and calculates BLEU scores to evaluate their accuracy.


Python 3.7+ PyTorch LAVIS library NLTK Pillow


BLIP Series model을 활용한 이미지 캡셔닝






No releases published


No packages published