Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine tune BLIP Image Captioning to custom dataset #37

Open
MikeMACintosh opened this issue Mar 22, 2022 · 19 comments
Open

Fine tune BLIP Image Captioning to custom dataset #37

MikeMACintosh opened this issue Mar 22, 2022 · 19 comments

Comments

@MikeMACintosh
Copy link

Hi, thanks for your amaizing work, i'm enjoy to use BLIP which demonstrate impressive results:)
Now i have a question: how can i fine tune BLIP for Image Captioning task on custom dataset?

My dataset consists of categories, each of which has pictures for this category, an example of categories: Chimneys, pedestrian crossings and more, I don’t have text captions for pictures, only the name of the categories, I can implement my plan?
I studied your article on arxiv but could not find the answer to my question.

@LiJunnan1992
Copy link
Contributor

Hi, to finetune BLIP's image captioning model on a custom dataset, you can prepare your annotation file in a similar format as the coco captioning file (coco_karpathy_train.json), and create your own dataset following coco_karpathy_dataset.py.

@MikeMACintosh
Copy link
Author

Thank's for your answer, i will try this way.

@labenz
Copy link

labenz commented Jun 13, 2022

@LiJunnan1992 – appreciate all your engagement in the discussion! I'm also interested in fine-tuning BLIP on a proprietary dataset in an effort to generate "contextual captions" and have a few questions :)

  1. Would it be possible / make sense to fine-tune BLIP with variable prompt prefix? That is, instead of "a picture of " as a constant prompt prefix, I'd use a different prompt prefix for each image, incorporating contextual information I have about each image. For example, I might do something like "a picture from Vegan Delights website of ", and my hope would be that the text output would begin to reflect the content of the contextual prompt – for example by returning "a plate of vegan food" instead of "a plate of food"

  2. If this does make sense, are there any limits to the prompt prefix length I should be aware of? I've tried to track this down and it seems (from the bert-base-uncased model card on huggingface) that the limit might be 512 tokens? "The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens."

  3. If I took this approach, do you have any idea how many images I'd need to use in fine-tuning? I understand that more is better, but wondering if you have any rough guidance. OpenAI, for example, suggests 1000 examples for fine-tuning GPT3 – a basic rule of thumb like that would be super helpful.

Thanks again!

@LiJunnan1992
Copy link
Contributor

Hi @labenz, thanks for your question.

  1. Yes it is possible to use variable prompt.
  2. The maximum number of tokens that BLIP accepts is the same as BERT (512 tokens). However, BLIP is pretrained mostly on short sentences. To reduce memory cost, we have hard-coded the maximum text length as 40 (
    text = self.tokenizer(caption, padding='longest', truncation=True, max_length=40, return_tensors="pt").to(image.device)
    ), but you can change it to other values.
  3. It is hard for me to say how many samples are enough. Please also note that BLIP's text decoder is much much much smaller than GPT-3, so the "prompt magic" may not work as well in BLIP.

@labenz
Copy link

labenz commented Jun 15, 2022

thanks very much for the feedback – appreciate it!!

@ConorDoyle314
Copy link

@labenz Did you ever have and success with this approach? Looking to do a similar project.

@labenz
Copy link

labenz commented Apr 28, 2023 via email

@FJGEODEV
Copy link

Hi @labenz, thanks for your question.

  1. Yes it is possible to use variable prompt.
  2. The maximum number of tokens that BLIP accepts is the same as BERT (512 tokens). However, BLIP is pretrained mostly on short sentences. To reduce memory cost, we have hard-coded the maximum text length as 40 (
    text = self.tokenizer(caption, padding='longest', truncation=True, max_length=40, return_tensors="pt").to(image.device)

    ), but you can change it to other values.
  3. It is hard for me to say how many samples are enough. Please also note that BLIP's text decoder is much much much smaller than GPT-3, so the "prompt magic" may not work as well in BLIP.

Thanks for the replies, regrading to your answer 2. If I would like to fine tune BLIP model, but my text file is far more than 512 tokens, is there any solution to this without retrain BLIP (edit text length==40).

Thanks.

@NielsRogge
Copy link

NielsRogge commented Aug 6, 2023

Hi,

We do have a notebook on that here in case you'd like to fine-tune the Hugging Face version of BLIP: https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb.

We also have a notebook using PEFT (LoRa): https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb. This is more memory efficient since you only train a couple of linear projection layers, while keeping the model itself frozen.

@pjerryhu
Copy link

pjerryhu commented Aug 7, 2023

How big does the fine-tuning dataset need to be in order to have good performance for the image caption model?

@NielsRogge
Copy link

I would start with a couple of hundred but as always, the more the better.

@andysingal
Copy link

Hi, to finetune BLIP's image captioning model on a custom dataset, you can prepare your annotation file in a similar format as the coco captioning file (coco_karpathy_train.json), and create your own dataset following coco_karpathy_dataset.py.

hi @NielsRogge @LiJunnan1992 checking to see if it possible to do on custom images... i have seen BLIP works good on plain english terms but some terms are geographic specific and are not identified while using BLIP? Love to see if there is a sample notebook for training on custom images , as mentioned above,"you can prepare your annotation file in a similar format as the coco captioning file (coco_karpathy_train.json), and create your own dataset following coco_karpathy_dataset.py."

@NielsRogge
Copy link

Hi,

You can create a custom image captioning dataset as follows: https://huggingface.co/docs/datasets/image_dataset#image-captioning

@andysingal
Copy link

andysingal commented Aug 25, 2023

Hi,

You can create a custom image captioning dataset as follows: https://huggingface.co/docs/datasets/image_dataset#image-captioning

Hey All, checking to see if someone has experience annotating geographic specific images to create a custom BLIP model. Current BLIP model is good for American/British English terms. For example if i have a pic of pyramid in the Egypt, it has no way to detect it within the captions? . Looking forward to hear. Here are some images for reference: https://drive.google.com/drive/folders/1J_XGR6aKyxS0fLNgrs61zB30GEnPvgdh?usp=drive_link @LiJunnan1992 @NielsRogge

@andysingal
Copy link

Hi,

You can create a custom image captioning dataset as follows: https://huggingface.co/docs/datasets/image_dataset#image-captioning

yes, i am trying it out, thanks @NielsRogge

@SuryaPrakash0201
Copy link

Hello,

I am trying to generate a detailed description of an image using blip model and it has to be more then 200 words. For example:
image

Is it possible to do using the same model and if not then what could be the other options that I can explore?

@NielsRogge
Copy link

I'm afraid BLIP is not able to read text from image in such a detailed way. I'd recommend taking a look at Pix2Struct which uses a much higher image resolution to be able to read such text.

@katekats
Copy link

Hello everyone,

I tried to use BLIP to generate image captions for images from surgical procedures. However, the generated captions are repetitions of words like a person, the, an etc. My dataset consists of hundreds of photos that belong to 44 different text descriptions. The learning rate is 0.000001. How can this be solved?

Thank you in advance!

@shams2023
Copy link

你好,

如果您想微调BLIP 的Hugging Face 版本,我们确实有一个笔记本:https: //github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb

我们还有一个使用 PEFT (LoRa) 的笔记本:https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb。这可以提高内存效率,因为您只训练几个线性投影层,同时保持模型本身冻结。

Hello!
How can I use my own image text dataset to fine tune the BLIP2 model. The task I need to perform is the image captioning task. I have found that using pre trained BLIP2 alone to generate text descriptions for my images does not work well. I would like to fine tune my dataset first before performing the captioning operation? May I ask how to implement the specific operation and which pre trained model can be fine tuned to achieve better results?
Looking forward to your reply, thank you again!
Good luck to you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests