-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine tune BLIP Image Captioning to custom dataset #37
Comments
Hi, to finetune BLIP's image captioning model on a custom dataset, you can prepare your annotation file in a similar format as the coco captioning file (coco_karpathy_train.json), and create your own dataset following coco_karpathy_dataset.py. |
Thank's for your answer, i will try this way. |
@LiJunnan1992 – appreciate all your engagement in the discussion! I'm also interested in fine-tuning BLIP on a proprietary dataset in an effort to generate "contextual captions" and have a few questions :)
Thanks again! |
Hi @labenz, thanks for your question.
|
thanks very much for the feedback – appreciate it!! |
@labenz Did you ever have and success with this approach? Looking to do a similar project. |
We ended up using a different approach, which used BLIP image-text matching
instead of captioning.
(For context, our problem was “image selection”, so we found that
generating “ideal captions” and then selecting images by ITM was more
effective than selecting by caption, and this seemed likely to be true even
if we had fine tuned, especially because our images are extremely diverse)
…On Fri, Apr 28, 2023 at 4:35 PM ConorDoyle314 ***@***.***> wrote:
@labenz <https://github.com/labenz> Did you ever have and success with
this approach? Looking to do a similar project.
—
Reply to this email directly, view it on GitHub
<#37 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB73EXDEGFOWF5FCPNNXBTXDQSYZANCNFSM5RLRO76Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks for the replies, regrading to your answer 2. If I would like to fine tune BLIP model, but my text file is far more than 512 tokens, is there any solution to this without retrain BLIP (edit text length==40). Thanks. |
Hi, We do have a notebook on that here in case you'd like to fine-tune the Hugging Face version of BLIP: https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb. We also have a notebook using PEFT (LoRa): https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb. This is more memory efficient since you only train a couple of linear projection layers, while keeping the model itself frozen. |
How big does the fine-tuning dataset need to be in order to have good performance for the image caption model? |
I would start with a couple of hundred but as always, the more the better. |
hi @NielsRogge @LiJunnan1992 checking to see if it possible to do on custom images... i have seen BLIP works good on plain english terms but some terms are geographic specific and are not identified while using BLIP? Love to see if there is a sample notebook for training on custom images , as mentioned above,"you can prepare your annotation file in a similar format as the coco captioning file (coco_karpathy_train.json), and create your own dataset following coco_karpathy_dataset.py." |
Hi, You can create a custom image captioning dataset as follows: https://huggingface.co/docs/datasets/image_dataset#image-captioning |
Hey All, checking to see if someone has experience annotating geographic specific images to create a custom BLIP model. Current BLIP model is good for American/British English terms. For example if i have a pic of pyramid in the Egypt, it has no way to detect it within the captions? . Looking forward to hear. Here are some images for reference: https://drive.google.com/drive/folders/1J_XGR6aKyxS0fLNgrs61zB30GEnPvgdh?usp=drive_link @LiJunnan1992 @NielsRogge |
yes, i am trying it out, thanks @NielsRogge |
I'm afraid BLIP is not able to read text from image in such a detailed way. I'd recommend taking a look at Pix2Struct which uses a much higher image resolution to be able to read such text. |
Hello everyone, I tried to use BLIP to generate image captions for images from surgical procedures. However, the generated captions are repetitions of words like a person, the, an etc. My dataset consists of hundreds of photos that belong to 44 different text descriptions. The learning rate is 0.000001. How can this be solved? Thank you in advance! |
Hello! |
Hi, thanks for your amaizing work, i'm enjoy to use BLIP which demonstrate impressive results:)
Now i have a question: how can i fine tune BLIP for Image Captioning task on custom dataset?
My dataset consists of categories, each of which has pictures for this category, an example of categories: Chimneys, pedestrian crossings and more, I don’t have text captions for pictures, only the name of the categories, I can implement my plan?
I studied your article on arxiv but could not find the answer to my question.
The text was updated successfully, but these errors were encountered: