This repo contains code and published data for the AINL2023 paper IMAD: IMage Augmented multi-modal Dialogue.
Our dataset serves for task of interpreting image in the context of dialogue. Published code could help with
- Classifying if utterance is replaceable with image
- Finding the best image for utterance
- Generation of utterance that was replaced with an image
IMAD Dataset created from mutlitple dialogue dataset sources and Unsplash images. Every sample from dataset is
- Context of dialogue
- Image
- Replaced utterance
Dataset is availible with images_id at HuggingFace. Due to the images specific license we are unable to publish them, but you can still obtain the full dataset without any difficulties. There are two ways of doing it:
- Request full image dataset from Unsplash and match via image_id in dataset
- Request full dataset directly via email from Contacts.
This tool performs classification if utterance could be potentially replaced with an image. For this purposes data should contain list of features:
- Image Score
- Sentence Similarity
- BLEU
- Maximum Entity Score
- Thresholding
Classification is performed with model from models directory. Generation of all the features is performed with models from features scripts directory. Example of usage is shown at Text Replacing Tutorial. Note that scripts are using Paths, which is essential to this script
This tool is capable of finding better image with the use of BLIP VQA. Long story short it finds top-N (N is specified) images that are closest to utterance and then scores them with VQA model. This is performed with models from features scripts directory. Example of usage is shown at Image Replacing Tutorial
This is a special dataclass, that contains all the paths that would be used in scripts
- dialog_features_path is the path to the directory where utterances embedding vectors are stored. Initially it could be empty and vectors will be generated during the script run. The example is shown in tutorial and default value is
'./feature_vectors/test_vectors/'
. Make sure you create new directory or clean it before running your examples - image_vectors_path is the path to the .pt file that contains images embedding vectors. Default value is
'./images/vectors.pt'
- output_path is the path to the output .json file. Script will save all the output to that path and also read from it sometimes. Default values is
'./outputs/test_output.json'
. - temporary_path is the path to the temporary .json file. It is used to store some outputs, that are not valuable at the end. Default values is
'./outputs/temporary_path.json'
- entity_vectors_path is the path to the directory where entities embedding vectors are stored. Initially it could be empty and vectors will be generated during the script run. Default value is
'./feature_vectors/entity_vectors/'
. - images_dataset_path : is the path to the dataset containing information about images. It should contain image ids, url, description and ai_desription. You can leave them blank except the id. Default value is
'./images/dataset.json'
. - path2images is the path to the directory, that contains raw images. Images should be named with id, that has been used in images_dataset_path. Default value is './images/full_images/'.
- path2images_features is the path to the directory, that contains images embedding vectors, that are named the same as id in images_dataset_path. Default value is
'./images/vectors'
- path2trained_model is the path to the trained model for Text Replacing. You can use the default value
'./models/random_forest.joblib'
All the code, except fine-tuned BLIP is licensed under Apache 2.0 license. Fine-tuned version of BLIP is licensed under BSD3. Text version of IMAD Dataset is licensed under Creative Commons NC, dataset with images is licensed under Unsplash License.
To cite this article please use this BibTex reference
@misc{viktor2023imad,
title={IMAD: IMage-Augmented multi-modal Dialogue},
author={Moskvoretskii Viktor and Frolov Anton and Kuznetsov Denis},
year={2023},
eprint={2305.10512},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Or via MLA Citation
Viktor, Moskvoretskii et al. “IMAD: IMage-Augmented multi-modal Dialogue.” (2023).
Feel free to reach out to us at [[email protected]] for inquiries, collaboration suggestions, or data requests related to our work.