This is the official Pytorch implementation of KLITE:
"K-LITE: Learning Transferable Visual Models with External Knowledge", NeurIPS 2022 (Oral, 1%) by
Sheng Shen*, Chunyuan Li*, Xiaowei Hu*, Yujia Xie, Jianwei Yang, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach and Jianfeng Gao.
- Jan 17 2023. REACT extends KLITE from text knowledge from a dictionary to retrieved multimodal knoweldge from the web.
In this paper, we propose K-LITE, a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in text with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts. In evaluation, the text is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification (IC) and object detection (OD) in ELEVATER benchmark, on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show 6.29% average improvement on 20 IC tasks and 4.2% average improvement on 13 OD tasks in performance over existing methods.
We provide two illustrative examples why K-LITE could be helpful from Oxford-Flowers and Food-101 IC tasks.
Model | Training Set | ZS on IN-1K | ZS on 20 datasets | Download |
---|---|---|---|---|
Swin-T | IN-21K | 28.5 | 27.1 | ckpt/config |
Swin-T | IN-21K + GCC-15M | 46.9 | 39.8 | ckpt/config |
Swin-T | IN-21K + GCC-15M + YFCC-14M | 49.3 | 40.5 | ckpt/config |
Swin-B | IN-21K + GCC-15M | 50.0 | 39.4 | ckpt/config |
Swin-B | IN-21K + GCC-15M + YFCC-14M | 52.3 | 42.5 | ckpt/config |
Model | Training Set | ZS on IN-1K | ZD on IN-1k (+5 GPT-3 Knowledge) | ZS on 20 datasets | ZS on 20 datasets (+5 GPT-3 Knowledge) | Download |
---|---|---|---|---|---|---|
Swin-T | IN-21K | 30.5 | 32.0 | 33.5 | 33.8 | ckpt/config |
Swin-T | IN-21K + GCC-15M | 49.9 | 51.6 | 41.1 | 42.3 | ckpt/config |
Swin-T | IN-21K + GCC-15M + YFCC-14M | 49.6 | 51.9 | 40.3 | 41.6 | ckpt/config |
Swin-B | IN-21K + GCC-15M | 52.7 | 55.0 | 42.8 | 43.6 | ckpt/config |
Swin-B | IN-21K + GCC-15M + YFCC-14M | 55.8 | 58.0 | 42.5 | 44.8 | ckpt/config |
NOTE: Setting "ZS on 20 datasets" is used in the ICinW benchmark. All the above models are trained without strong data augmentations like mixup and cutmix.
To setup the environment, please run
pip install -r requirements.txt
pip install -e .
Note that for run main.py
for potential training and evaluation, you need to install apex.
Also, see klite/load_wiki for constructing image-text pairs or image-label data (train/validation) augmented by knowledge.
Please following DATA.md for data preparation.
To evaluate a pre-trained K-LITE on ImageNet val, run:
python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --master_port 12345 main.py --eval \
--cfg <config-file> --resume <checkpoint> --data-path <imagenet-path> --use_knowledge
or
MODE: pretrain method (klite or unicl)
NGPUS: number of gpus
CFG: model config (configs/klite_swin_tiny.yaml or configs/klite_swin_base.yaml)
CKPT_DIR: directory to the ckeckpoint
IMAGENETPATH: path to ImageNet
bash scripts/run_in1k_eval.sh $MODE $NGPUS $CFG $CKPT_DIR $IMAGENETPATH
For example, to evaluate the KLITE-Swin-Tiny trained on IN-21K + GCC-15M with a single GPU:
python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval \
--cfg configs/klite_swin_tiny.yaml --resume ckpt/klite/in21k_gcc15m/tiny/model_state_dict.pt --data-path <imagenet-path> --use_knowledge
For evaluating KLITE for downstream image classification tasks, and comparing performance on the same task suite, we include the evaluation toolkit here at klite/vision_bechmark/
. Please run the setup before evalutaing on the 20 ELEVATER Image Classification tasks.
Then, to evaluate a pre-trained K-LITE on 20 ELEVATER Image Classification tasks in a zero-shot way, run:
MODE: pretrain method (klite or unicl)
CFG: model config (clip_swin_tiny or clip_swin_base)
CKPT_PATH: path to the checkpoint
CKPT_ID: the dataset used to pretrain the model (in21k, in21k_gcc15m, in21k_gcc15m_yfcc14m)
bash scripts/run_elevater_eval.sh $MODE $CFG $CKPT_PATH $CKPT_ID
For example, to evaluate the KLITE-Swin-Tiny trained on IN-21K + GCC-15M with a single GPU:
CUDA_VISIBLE_DEVICES=0 bash scripts/run_elevater_eval.sh klite clip_swin_tiny ckpt/klite/in21k_gcc15m_yfcc14m/tiny/model_state_dict.pt
More details for ELEVATER benchmark can be found: [Benchmark] [Toolkit] [Paper]
If you find this repo useful to your project, please consider to cite it with following bib:
@inproceedings{shen2022k,
title={K-lite: Learning transferable visual models with external knowledge},
author={Shen, Sheng and Li, Chunyuan and Hu, Xiaowei and Xie, Yujia and Yang, Jianwei and Zhang, Pengchuan and Rohrbach, Anna and Gan, Zhe and Wang, Lijuan and Yuan, Lu and others},
booktitle={NeurIPS},
year={2022}
}
Our codebase is built based on UniCL and ELEVATER.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.