Premise-based Multi-modal Reasoning(PMR) is a task that explores the ability of models to reason with both textual (from the premise) and visual(from images) clues.
Through manually annotation and adversarial generation, we create PMR dataset with 32,720 samples. Here are the stats for PMR, and you can explore it on our website.
Ori. | Adv. | Total~ | |||||
---|---|---|---|---|---|---|---|
Train | Val | Test | Train | Val | Test | ||
#samples | 12,080 | 1,538 | 1,742 | 12,080 | 1,538 | 1,742 | 30,720 |
#unique 1-gram | 9,882 | 3,819 | 4,101 | 8,046 | 3,071 | 3,359 | 11,041 |
#unique 2-gram | 72,048 | 17,678 | 19,292 | 50,526 | 12,236 | 13,453 | 84,365 |
Avg premise length | 9.48 | 9.47 | 9.54 | 9.48 | 9.47 | 9.54 | 9.49 |
Avg action text length | 14.38 | 14.41 | 14.45 | 14.20 | 14.42 | 14.31 | 14.31 |
Avg #objects mentioned | 1.92 | 1.91 | 1.94 | 2.42 | 2.43 | 2.38 | 2.17 |
#images | 9,536 | 1,213 | 1,370 | 9,536 | 1,213 | 1,370 | 12,119 |
#movies covered | 1,353 | 209 | 170 | 1,353 | 209 | 170 | 1,732 |
Dataset can be downloaded at Google Drive.
PMR has been selected as one of the evalution tasks on CCL2022, and we provide full train and validation sets(both original and adversarial samples) to train models. For model evalution, you can submit the predictions of model on test set(test-ori-without-label.jsonl
) by mailing at [email protected], and we will give feedback timely.
Here is a brief introduction to the data format.
{
"total_id": 98,
# Name of movie which the image is from.
"movie": "3051_NANNY_MCPHEE_RETURNS",
# Object tags from Fast RCNN
"objects": ["person", "person", "handbag", "spoon"],
# Path of the image
"img_fn": "lsmdc_3051_NANNY_MCPHEE_RETURNS/[email protected]",
# Id of the image
"img_id": "train-5244",
# Path of the file storing the information of bounding boxes
"metadata_fn": "lsmdc_3051_NANNY_MCPHEE_RETURNS/[email protected]",
# Tokenized premise, the integers in lists indicate the index of objects in the above list.
"premise": [[1], "and", [0], "are", "in", "good", "relationship", "."],
# Category of the premise.
"category": "character"
# Tokenized actions, the intergers in lists indicate the index of objects.
"answer_choices": [
[[1], "with", "a", "handbag", "will", "hug", [0], "tightly", "."],
[[1], "with", "a", "green", "handbag", "will", "shout", "at", [0], "in", "the", "kitchen", "."],
[[1], "with", "a", "handbag", "will", "shout", "at", [0], "in", "the", "kitchen", "."],
[[1], "with", "a", "green", "handbag", "will", "hug", [0], "tightly", "."]
],
# The types of answers in the order corresponding to the answer_choices
"answer_types": ["Action-True", "Distractor2", "Action-False", "Distractor1"],
# The index of the correct answer in answer_choices.
"answer_label": 0
# For original set, the total_id of the sample that has the same image as the current sample if it exists.(-1 is the default)
"pal_id":-1
# For adversarial set, the list of total_id which the four choices are from.
"answer_ori_ids":[14097, 12681, 387, 13170]
}
We provide baseline models here. They are adapted from three vision-language pretrained models which have great performance on multimodal understanding tasks.
Please consider citing this paper if you find this repository useful:
@article{PMR2022,
title = {Premise-based Multimodal Reasoning: {A} Human-like Cognitive Process},
author = {Qingxiu Dong and
Ziwei Qin and
Heming Xia and
Tian Feng and
Shoujie Tong and
Haoran Meng and
Lin Xu and
Tianyu Liu and
Zuifang Sui and
Weidong Zhan and
Sujian Li and
Zhongyu Wei},
journal = {CoRR},
volume = {abs/2105.07122},
year = {2021},
}