Image Captioning using BLIP

This project demonstrates the use of BLIP (Bootstrapping Language-Image Pre-training) for generating image captions. It includes the process of generating both conditional and unconditional captions for a given image and calculating the BLEU score to evaluate the generated captions against reference captions.

Overview

BLIP (Bootstrapping Language-Image Pre-training) is a framework for pre-training vision-language models. This project uses the BlipProcessor and BlipForConditionalGeneration classes from the transformers library to generate captions for images.

Prerequisites

Python 3.7+
PyTorch
Transformers library from Hugging Face
NLTK
Pillow

Installation

Install the required Python packages:

pip install torch transformers nltk pillow

Download the NLTK data:
```
import nltk
nltk.download('punkt')
```

Usage

Load the Image:

Load an image using Pillow and convert it to RGB format.

from PIL import Image
img_path = "data/images/test1.jpg"
raw_image = Image.open(img_path).convert('RGB')

Convert Image to Tensor:

Convert the image to a numpy array and then to a PyTorch tensor.

import numpy as np
import torch

image_np = np.array(raw_image)
image_tensor = torch.tensor(image_np, dtype=torch.float32).unsqueeze(0)

Load the Processor and Model:

Load the BLIP processor and model from the Hugging Face transformers library.

from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

Generate Captions:

Generate conditional and unconditional captions for the image.

# Conditional Caption
inputs = processor(images=image_tensor, text=["a photo "], return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=20)
generated_caption_conditional = processor.decode(out[0], skip_special_tokens=True)
print("Generated Caption (Conditional):", generated_caption_conditional)

# Unconditional Caption
inputs = processor(images=image_tensor, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=20, num_return_sequences=1, temperature=0.7)
generated_caption_unconditional = processor.decode(out[0], skip_special_tokens=True)
print("Generated Caption (Unconditional):", generated_caption_unconditional)

Calculate BLEU Score:

BLEU: Biligual Evaluation Understudy Score

Calculate the BLEU score to evaluate the generated captions.

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def calculate_bleu(reference_captions, generated_caption):
    reference_captions = [nltk.word_tokenize(caption.lower()) for caption in reference_captions]
    generated_caption = nltk.word_tokenize(generated_caption.lower())
    
    smoothie = SmoothingFunction().method4
    score = sentence_bleu(reference_captions, generated_caption, smoothing_function=smoothie)
    return score

ref_txt = "data/captions/test1.txt"
with open(ref_txt, "r") as f:
    reference_captions = f.readlines()

bleu_score_conditional = calculate_bleu(reference_captions, generated_caption_conditional)
bleu_score_unconditional = calculate_bleu(reference_captions, generated_caption_unconditional)

print("BLEU Score (Conditional):", bleu_score_conditional)
print("BLEU Score (Unconditional):", bleu_score_unconditional)

File Structure

.
├── README.md
├── blip_main.py
├── data
│   ├── captions
│   │   ├── test1.txt
│   │   └── test2.txt
│   └── images
│       ├── test1.jpg
│       └── test2.jpg
└── requirements.txt

Image Captioning using BLIP2

This project demonstrates the use of BLIP2 (Bootstrapping Language-Image Pre-training) for generating image captions on the COCO validation dataset. It includes the process of generating captions for a given image and calculating the BLEU score to evaluate the generated captions against reference captions.

Overview

BLIP2 is a framework for pre-training vision-language models. This project uses the load_model_and_preprocess method from the LAVIS library to generate captions for images in the COCO validation dataset and calculates BLEU scores to evaluate their accuracy.

Prerequisites

Python 3.7+ PyTorch LAVIS library NLTK Pillow

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
.DS_Store		.DS_Store
README.md		README.md
blip2_main.py		blip2_main.py
blip_main.py		blip_main.py
cocodatset.sh		cocodatset.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Captioning using BLIP

Overview

Prerequisites

Installation

Usage

File Structure

Image Captioning using BLIP2

Overview

Prerequisites

About

Releases

Packages

Languages

2eey10/image-captioning-BLIP

Folders and files

Latest commit

History

Repository files navigation

Image Captioning using BLIP

Overview

Prerequisites

Installation

Usage

File Structure

Image Captioning using BLIP2

Overview

Prerequisites

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages