Skip to content

2030NLP/PMR

Repository files navigation

PMR

Introduction

Premise-based Multi-modal Reasoning(PMR) is a task that explores the ability of models to reason with both textual (from the premise) and visual(from images) clues.

Through manually annotation and adversarial generation, we create PMR dataset with 32,720 samples. Here are the stats for PMR, and you can explore it on our website.

Ori. Adv. Total~
Train Val Test Train Val Test
#samples 12,080 1,538 1,742 12,080 1,538 1,742 30,720
#unique 1-gram 9,882 3,819 4,101 8,046 3,071 3,359 11,041
#unique 2-gram 72,048 17,678 19,292 50,526 12,236 13,453 84,365
Avg premise length 9.48 9.47 9.54 9.48 9.47 9.54 9.49
Avg action text length 14.38 14.41 14.45 14.20 14.42 14.31 14.31
Avg #objects mentioned 1.92 1.91 1.94 2.42 2.43 2.38 2.17
#images 9,536 1,213 1,370 9,536 1,213 1,370 12,119
#movies covered 1,353 209 170 1,353 209 170 1,732

Dataset Access

Dataset can be downloaded at Google Drive.

PMR has been selected as one of the evalution tasks on CCL2022, and we provide full train and validation sets(both original and adversarial samples) to train models. For model evalution, you can submit the predictions of model on test set(test-ori-without-label.jsonl) by mailing at [email protected], and we will give feedback timely.

Data Format

Here is a brief introduction to the data format.

{
        "total_id": 98,
        # Name of movie which the image is from.
	"movie": "3051_NANNY_MCPHEE_RETURNS",
  
	# Object tags from Fast RCNN
	"objects": ["person", "person", "handbag", "spoon"],
  
	# Path of the image
	"img_fn": "lsmdc_3051_NANNY_MCPHEE_RETURNS/[email protected]",
	
	# Id of the image
	"img_id": "train-5244",
  
	# Path of the file storing the information of bounding boxes
	"metadata_fn": "lsmdc_3051_NANNY_MCPHEE_RETURNS/[email protected]",
  
	# Tokenized premise, the integers in lists indicate the index of objects in the above list.
	"premise": [[1], "and", [0], "are", "in", "good", "relationship", "."],
  
	# Category of the premise.
	"category": "character"
  
	# Tokenized actions, the intergers in lists indicate the index of objects.
	"answer_choices": [
		[[1], "with", "a", "handbag", "will", "hug", [0], "tightly", "."],
		[[1], "with", "a", "green", "handbag", "will", "shout", "at", [0], "in", "the", "kitchen", "."],
		[[1], "with", "a", "handbag", "will", "shout", "at", [0], "in", "the", "kitchen", "."],
		[[1], "with", "a", "green", "handbag", "will", "hug", [0], "tightly", "."]
		],
    
	# The types of answers in the order corresponding to the answer_choices
	"answer_types": ["Action-True", "Distractor2", "Action-False", "Distractor1"],
  
	# The index of the correct answer in answer_choices.
	"answer_label": 0
	
	# For original set, the total_id of the sample that has the same image as the current sample if it exists.(-1 is the default)
	"pal_id":-1
	
	# For adversarial set, the list of total_id which the four choices are from.
	"answer_ori_ids":[14097, 12681, 387, 13170]
}

Baseline Models

We provide baseline models here. They are adapted from three vision-language pretrained models which have great performance on multimodal understanding tasks.

  1. PMR-baseline-VL-BERT (source repo)
  2. UNITER
  3. ERINIE

Citation

Please consider citing this paper if you find this repository useful:

@article{PMR2022,
	title	= {Premise-based Multimodal Reasoning: {A} Human-like Cognitive Process},
	author  = {Qingxiu Dong and
               Ziwei Qin and
               Heming Xia and
               Tian Feng and
               Shoujie Tong and
               Haoran Meng and
               Lin Xu and
               Tianyu Liu and
               Zuifang Sui and
               Weidong Zhan and
               Sujian Li and
               Zhongyu Wei},
	journal = {CoRR},
	volume  = {abs/2105.07122},
	year    = {2021},
}