Skip to content
/ LeaPRR Public

Learnable Pillar-based Re-ranking for Image-Text Retrieval. SIGIR'23

License

Notifications You must be signed in to change notification settings

LgQu/LeaPRR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learnable Pillar-based Re-ranking for Image-Text Retrieval (LeaPRR)

PyTorch code of the SIGIR'23 paper "Learnable Pillar-based Re-ranking for Image-Text Retrieval".

Introduction

Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Prior work usually focuses on the pairwise relations (i.e., whether a data sample matches another) but ignores the higher-order neighbor relations (i.e., a matching structure among multiple data samples). Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks. However, it is ineffective to directly extend existing re-ranking algorithms to image-text retrieval. In this paper, we analyze the reason from four perspectives, i.e., generalization, flexibility, sparsity, and asymmetry, and propose a novel learnable pillar-based re-ranking paradigm. Concretely, we first select top-ranked intra- and intermodal neighbors as pillars, and then reconstruct data samples with the neighbor relations between them and the pillars. In this way, each sample can be mapped into a multimodal pillar space only using similarities, ensuring generalization. After that, we design a neighbor-aware graph reasoning module to flexibly exploit the relations and excavate the sparse positive items within a neighborhood. We also present a structure alignment constraint to promote cross-modal collaboration and align the asymmetric modalities. On top of various base backbones, we carry out extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, demonstrating the effectiveness, superiority, generalization, and transferability of our proposed re-ranking paradigm.

model

Requirements

We recommended the following dependencies.

Download data

We use Flickr30k and MS-COCO datasets and splits produced by Andrej Karpathy. The raw images can be downloaded from their original sources here and here.

The base models used in this paper include SCAN, VSRN, CAMERA, VSE_infinity, and DIME. We first evaluate the base models and calculate initial ranking results and similarities using the provided checkpoints in these repositories. After that, we preprocess the initial ranking results and similarities in a batch fashion and save the structured data for model training. All the structured similarities and initial ranking results needed for reproducing the experiments in the paper, can be downloaded here.

We refer to the path of extracted files as $DATA_PATH.

Train new models

Take training the model based on DIME on the Flickr30k dataset for an example:

python main.py --logger_path $LOGGER_PATH --data_path $DATA_PATH --dataset flickr --base_model DIME_ensemble

By setting dataset and base_model, one can train models on other datasets and base models.

Evaluate pre-trained models

Download pretrained models here and put them in LOGGER_PATH. Run main.py by setting only_test:

python main.py --logger_path $LOGGER_PATH --data_path $DATA_PATH --dataset flickr --base_model DIME_ensemble --only_test checkpoint_best

Reference

@inproceedings{qu2023learnable,
  title={Learnable Pillar-based Re-ranking for Image-Text Retrieval},
  author={Qu, Leigang and Liu, Meng and Wang, Wenjie and Zheng, Zhedong and Nie, Liqiang and Chua, Tat-Seng},
  booktitle={Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  pages={1252--1261},
  year={2023}
}

About

Learnable Pillar-based Re-ranking for Image-Text Retrieval. SIGIR'23

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages