Fine tune BLIP Image Captioning to custom dataset #37

MikeMACintosh · 2022-03-22T17:35:43Z

Hi, thanks for your amaizing work, i'm enjoy to use BLIP which demonstrate impressive results:)
Now i have a question: how can i fine tune BLIP for Image Captioning task on custom dataset?

My dataset consists of categories, each of which has pictures for this category, an example of categories: Chimneys, pedestrian crossings and more, I don’t have text captions for pictures, only the name of the categories, I can implement my plan?
I studied your article on arxiv but could not find the answer to my question.

LiJunnan1992 · 2022-03-22T23:56:57Z

Hi, to finetune BLIP's image captioning model on a custom dataset, you can prepare your annotation file in a similar format as the coco captioning file (coco_karpathy_train.json), and create your own dataset following coco_karpathy_dataset.py.

MikeMACintosh · 2022-03-28T14:38:37Z

Thank's for your answer, i will try this way.

labenz · 2022-06-13T15:55:17Z

@LiJunnan1992 – appreciate all your engagement in the discussion! I'm also interested in fine-tuning BLIP on a proprietary dataset in an effort to generate "contextual captions" and have a few questions :)

Would it be possible / make sense to fine-tune BLIP with variable prompt prefix? That is, instead of "a picture of " as a constant prompt prefix, I'd use a different prompt prefix for each image, incorporating contextual information I have about each image. For example, I might do something like "a picture from Vegan Delights website of ", and my hope would be that the text output would begin to reflect the content of the contextual prompt – for example by returning "a plate of vegan food" instead of "a plate of food"
If this does make sense, are there any limits to the prompt prefix length I should be aware of? I've tried to track this down and it seems (from the bert-base-uncased model card on huggingface) that the limit might be 512 tokens? "The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens."
If I took this approach, do you have any idea how many images I'd need to use in fine-tuning? I understand that more is better, but wondering if you have any rough guidance. OpenAI, for example, suggests 1000 examples for fine-tuning GPT3 – a basic rule of thumb like that would be super helpful.

Thanks again!

LiJunnan1992 · 2022-06-14T01:00:34Z

Hi @labenz, thanks for your question.

Yes it is possible to use variable prompt.
The maximum number of tokens that BLIP accepts is the same as BERT (512 tokens). However, BLIP is pretrained mostly on short sentences. To reduce memory cost, we have hard-coded the maximum text length as 40 (

BLIP/models/blip.py

Line 110 in 48211a1

text = self.tokenizer(caption, padding='longest', truncation=True, max_length=40, return_tensors="pt").to(image.device)

), but you can change it to other values.
It is hard for me to say how many samples are enough. Please also note that BLIP's text decoder is much much much smaller than GPT-3, so the "prompt magic" may not work as well in BLIP.

labenz · 2022-06-15T20:13:02Z

thanks very much for the feedback – appreciate it!!

ConorDoyle314 · 2023-04-28T20:35:13Z

@labenz Did you ever have and success with this approach? Looking to do a similar project.

labenz · 2023-04-28T20:58:34Z

We ended up using a different approach, which used BLIP image-text matching instead of captioning. (For context, our problem was “image selection”, so we found that generating “ideal captions” and then selecting images by ITM was more effective than selecting by caption, and this seemed likely to be true even if we had fine tuned, especially because our images are extremely diverse)

…

On Fri, Apr 28, 2023 at 4:35 PM ConorDoyle314 ***@***.***> wrote: @labenz <https://github.com/labenz> Did you ever have and success with this approach? Looking to do a similar project. — Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB73EXDEGFOWF5FCPNNXBTXDQSYZANCNFSM5RLRO76Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

FJGEODEV · 2023-06-17T19:38:22Z

Hi @labenz, thanks for your question.

Yes it is possible to use variable prompt.

The maximum number of tokens that BLIP accepts is the same as BERT (512 tokens). However, BLIP is pretrained mostly on short sentences. To reduce memory cost, we have hard-coded the maximum text length as 40 (

BLIP/models/blip.py

Line 110 in 48211a1

text = self.tokenizer(caption, padding='longest', truncation=True, max_length=40, return_tensors="pt").to(image.device)

), but you can change it to other values.

It is hard for me to say how many samples are enough. Please also note that BLIP's text decoder is much much much smaller than GPT-3, so the "prompt magic" may not work as well in BLIP.

Thanks for the replies, regrading to your answer 2. If I would like to fine tune BLIP model, but my text file is far more than 512 tokens, is there any solution to this without retrain BLIP (edit text length==40).

Thanks.

NielsRogge · 2023-08-06T19:46:42Z

Hi,

We do have a notebook on that here in case you'd like to fine-tune the Hugging Face version of BLIP: https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb.

We also have a notebook using PEFT (LoRa): https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb. This is more memory efficient since you only train a couple of linear projection layers, while keeping the model itself frozen.

pjerryhu · 2023-08-07T03:52:37Z

How big does the fine-tuning dataset need to be in order to have good performance for the image caption model?

NielsRogge · 2023-08-07T07:52:05Z

I would start with a couple of hundred but as always, the more the better.

andysingal · 2023-08-25T11:00:23Z

Hi, to finetune BLIP's image captioning model on a custom dataset, you can prepare your annotation file in a similar format as the coco captioning file (coco_karpathy_train.json), and create your own dataset following coco_karpathy_dataset.py.

hi @NielsRogge @LiJunnan1992 checking to see if it possible to do on custom images... i have seen BLIP works good on plain english terms but some terms are geographic specific and are not identified while using BLIP? Love to see if there is a sample notebook for training on custom images , as mentioned above,"you can prepare your annotation file in a similar format as the coco captioning file (coco_karpathy_train.json), and create your own dataset following coco_karpathy_dataset.py."

NielsRogge · 2023-08-25T11:28:47Z

Hi,

You can create a custom image captioning dataset as follows: https://huggingface.co/docs/datasets/image_dataset#image-captioning

andysingal · 2023-08-25T11:30:46Z

Hi,

You can create a custom image captioning dataset as follows: https://huggingface.co/docs/datasets/image_dataset#image-captioning

Hey All, checking to see if someone has experience annotating geographic specific images to create a custom BLIP model. Current BLIP model is good for American/British English terms. For example if i have a pic of pyramid in the Egypt, it has no way to detect it within the captions? . Looking forward to hear. Here are some images for reference: https://drive.google.com/drive/folders/1J_XGR6aKyxS0fLNgrs61zB30GEnPvgdh?usp=drive_link @LiJunnan1992 @NielsRogge

andysingal · 2023-08-26T04:24:50Z

Hi,

You can create a custom image captioning dataset as follows: https://huggingface.co/docs/datasets/image_dataset#image-captioning

yes, i am trying it out, thanks @NielsRogge

SuryaPrakash0201 · 2023-09-14T05:52:50Z

Hello,

I am trying to generate a detailed description of an image using blip model and it has to be more then 200 words. For example:

Is it possible to do using the same model and if not then what could be the other options that I can explore?

NielsRogge · 2023-09-14T06:38:57Z

I'm afraid BLIP is not able to read text from image in such a detailed way. I'd recommend taking a look at Pix2Struct which uses a much higher image resolution to be able to read such text.

katekats · 2024-01-12T11:43:31Z

Hello everyone,

I tried to use BLIP to generate image captions for images from surgical procedures. However, the generated captions are repetitions of words like a person, the, an etc. My dataset consists of hundreds of photos that belong to 44 different text descriptions. The learning rate is 0.000001. How can this be solved?

Thank you in advance!

shams2023 · 2024-03-15T07:45:25Z

你好，

如果您想微调BLIP 的Hugging Face 版本，我们确实有一个笔记本：https: //github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb。

我们还有一个使用 PEFT (LoRa) 的笔记本：https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb。这可以提高内存效率，因为您只训练几个线性投影层，同时保持模型本身冻结。

Hello!
How can I use my own image text dataset to fine tune the BLIP2 model. The task I need to perform is the image captioning task. I have found that using pre trained BLIP2 alone to generate text descriptions for my images does not work well. I would like to fine tune my dataset first before performing the captioning operation? May I ask how to implement the specific operation and which pre trained model can be fine tuned to achieve better results?
Looking forward to your reply, thank you again!
Good luck to you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tune BLIP Image Captioning to custom dataset #37

Fine tune BLIP Image Captioning to custom dataset #37

MikeMACintosh commented Mar 22, 2022

LiJunnan1992 commented Mar 22, 2022

MikeMACintosh commented Mar 28, 2022

labenz commented Jun 13, 2022

LiJunnan1992 commented Jun 14, 2022

labenz commented Jun 15, 2022

ConorDoyle314 commented Apr 28, 2023

labenz commented Apr 28, 2023 via email

FJGEODEV commented Jun 17, 2023

NielsRogge commented Aug 6, 2023 •

edited

Loading

pjerryhu commented Aug 7, 2023

NielsRogge commented Aug 7, 2023

andysingal commented Aug 25, 2023

NielsRogge commented Aug 25, 2023

andysingal commented Aug 25, 2023 •

edited

Loading

andysingal commented Aug 26, 2023

SuryaPrakash0201 commented Sep 14, 2023

NielsRogge commented Sep 14, 2023

katekats commented Jan 12, 2024

shams2023 commented Mar 15, 2024

Fine tune BLIP Image Captioning to custom dataset #37

Fine tune BLIP Image Captioning to custom dataset #37

Comments

MikeMACintosh commented Mar 22, 2022

LiJunnan1992 commented Mar 22, 2022

MikeMACintosh commented Mar 28, 2022

labenz commented Jun 13, 2022

LiJunnan1992 commented Jun 14, 2022

labenz commented Jun 15, 2022

ConorDoyle314 commented Apr 28, 2023

labenz commented Apr 28, 2023 via email

FJGEODEV commented Jun 17, 2023

NielsRogge commented Aug 6, 2023 • edited Loading

pjerryhu commented Aug 7, 2023

NielsRogge commented Aug 7, 2023

andysingal commented Aug 25, 2023

NielsRogge commented Aug 25, 2023

andysingal commented Aug 25, 2023 • edited Loading

andysingal commented Aug 26, 2023

SuryaPrakash0201 commented Sep 14, 2023

NielsRogge commented Sep 14, 2023

katekats commented Jan 12, 2024

shams2023 commented Mar 15, 2024

NielsRogge commented Aug 6, 2023 •

edited

Loading

andysingal commented Aug 25, 2023 •

edited

Loading