Segment importance of hints seen by model to natural language token 'Levels'
This project was transformed based on OFA Chinese and challenged the NICE (New frontiers for zero-shot Image Captioning Evaluation) challenge 2023, resulting in Track2 2nd/ Total 4th. (CVPR 2023 Workshop) NICE is an Image Captioning Task, which is a task to create appropriate captions for each photo provided by ShutterStock. Based on the intuition that the tone of caption in the NICE dataset feels unique, it was approached from the perspective of controlled dialogue generation.
본 프로젝트는 OFA Chinese를 기반으로 변형하여 NICE(New frontiers for zero-shot Image Captioning Evaluation) challenge 2023 를 도전하여 Track2 2nd/ Total 4th의 성과를 내었습니다. (CVPR 2023 Workshop) NICE는 Image Captioning Task 로, ShutterStock 사에서 제공한 각 사진에 알맞는 캡션을 생성하는 과제입니다. NICE dataset 에서 나타나는 말투가 특이하게 느껴진다는 직관을 바탕으로, 이를 controlled dialogue generation 관점에서 접근하였습니다.
📖English technical report
📖Korean technical report
Utilize preprocessed cosine similarities, trained models, etc.
You can check the submission creating procedure, output captions of each photo, input data format looking through model inferencing code below.
- Since this approach is a methodology that connects the features of image captions with well-trained image encoder features, I utilized the open license model OFA, which has proven high performance.
- I wanted to create and train normalized hint level tokens so that the model could understand them.
- model checkpoint transition from fairseq style to huggingface style checkpoint, I refer to the code below and give credit.
- Checkpoint transition fairseq style -> hf style
When looking at the groundtruth caption, there were many captions that explained the format of the photo in the prefix or described a specific location. To identify trends, manually tagging was performed on 5000 cases as follows. (6-8 hours) 👷♂️👷♂️
caption_gt | photo style prefix | location at the caption |
---|---|---|
Close up low angle view of Bicycles leaning against tree in wood | Close up low angle view of | NULL |
View of town and bridge spanning river on sunny day Jarnac and the Charente river West Central France | View of | Jarnac and the Charente river West Central France |
Sun beach and ocean at Gerrans Bay Cornwall United Kingdom | NULL | Gerrans Bay Cornwall United Kingdom |
🚋original validation set
🚆tagged validation set
Hypothesis
- Photos provided by the same supplier can be inferred through the information inherent in the image, and the subject/photo/caption method will be similar.
- Public id is shutterstock's upload number, and it is highly likely that the photos uploaded consecutively have the same supplier.
=> Learning by using similarity between photos and public id provided in Validation_set
I use the NICE validation dataset as training data. The dataset consists of two files: caption data and image data.
The training data consists of NICE validation data(5000 cases) and the test data consists of NICE test data (21377 cases).
Caption data stores hints constructed based on id similarity and image cosine similarity, and levels meaning the strength of the hint.
(click!)How to make encoder_prefix (Input data format using Levels)
Based on the degree of similarity in the encoder part of the model, i tried to provide captions of several similar photos and hint levels using special tokens to show how similar the corresponding photos and the querying photo are. Below are the criteria for judging the hint 'Levels'.
hint Levels(special tokens) | Degree of hint effect | criterion |
---|---|---|
[cosHint lv4] | Strong hints for nearly identical photos | cosine similarities >0.4 |
[cosHint lv3] | Same topic but expected to have different captions | cosine similarities >0.32 |
[cosHint lv2] | Similar photos but different captions | cosine similarities >0.29 |
[cosHint lv1] | Irrelevant photos | cosine similarities ≤ 0.29 |
[diffHint lv3] | The public_id difference between the photos is very small | id difference < 100 |
[diffHint lv2] | The public_id difference between the photos is small | id difference < 10000 |
[diffHint lv1] | The public_id difference between the photos is large | id difference ≥ 10000 |
The above hints were extracted from similar photos obtained based on cosine similarity, and the tagged shotstyles and locations were extracted from neighboring photos obtained based on id_difference.
caption data ,jsonl format:
{"image_id": "1813180760", "text": ["A vertical shot of sunset on a beach"], "encoder_prefix": "[cosHint lv3][diffHint lv1]A landscape shot of sunset at horizon over ocean[cosHint lv3][diffHint lv1]Sun beach and ocean at Gerrans Bay Cornwall United Kingdom[cosHint lv3][diffHint lv1]Vertical shot of a beautiful sunset over the sea[cosHint lv3][diffHint lv1]Sunrise near Los Islotes Baja California Sur Mexico"}
{"image_id": "1578946151", "text": ["A woman relaxing in a deck chair"], "encoder_prefix": "[cosHint lv3][diffHint lv2]A woman relaxing in a deck chair[cosHint lv3][diffHint lv1]Wide shot of a female in swimwear walking on the beach with an equipment bucket[cosHint lv3][diffHint lv1]A man meditating by a pool[cosHint lv2][diffHint lv1]Vertical shot of a woman in swimwear standing in water at the shore of a sunny beach"}
image data,tsv format (img_id, '\t', img_content)(base64 format):
1813180760 /9j/4AAQSkZJRgABAQAAAQABAAD/2w...
1578946151 /9j/4AAQSkZJRgABAQAAAQABAAD/2w...
Create a tokenizer that adds special tokens representing the strength of the hint as levels.
After adjusting 'train_args', put the picture and hint level into the encoder. Feed the image caption output into the decoder and start training to predict captions.
transformers==4.20.0
CUDA_VISIBLE_DEVICES=0 python train.py --train_args_file train_args/train_ofa.json
Model | introduction | Link & how to make |
---|---|---|
OFA captioning fit | Optimized checkpoints for image captioning in the OFA-SYS | https://huggingface.co/calisolo/OFA_huge_image_captioning |
Submission3 | 3rd submission | https://huggingface.co/calisolo/OFA_huge_NICE_captioning |
Submission4 | 4th submission | /submission4 |
Ensemble1 | Adjusting hyperparameters to adjust convergence speed | /candidate1_trainLess |
Ensemble2 | Adjusting hyperparameters to adjust convergence speed | /candidate2_short |
Ensemble3 | Adjusting hyperparameters to adjust convergence speed | /candidate3_lastcoin |
The final submission was created by voting on the five checkpoints above.
At each checkpoint, the caption results for 21377 photos are obtained and compared, and the final result is selected by voting based on the cosine similarity of natural language.
you can check the results in every checkpoints
YES IT IS! 😸
- data: Data (Cosine Similarities/ input data/ ground truth validation sets)
- images: input images (base64 format)
- component:
- ofa:ofa model architecture
- argument.py:train parameter
- datacollator.py
- dataset.py
- train_args:train parameter configuration
- vocab:tokenizer with 'Levels' token added
- convert_weight.py:Checkpoint transition/ but didn't found, didn't used 😿😿
- generate.py: model generate example/ didn't used
Backbone model
- OFA:Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
- OFA github
codebase
- The OFA-sys official codebase has a high degree of complexity to be compatible with several experimental configurations. OFA Chinese is a huggingface version of the fine-tuning code that leaves only the core logic.