Data Download the VisDial v1.0 dialog json files and images from here. Download the word counts file for VisDial v1.0 train split from here. They are used to build the vocabulary. Use Faster-RCNN to extract image features from here. Use Large-Scale-VRD to extract visual relation embedding from here. Use Densecap to extract local captions from here. Generate ELMo word vectors from here. Download pre-trained GloVe word vectors from here.