Explaining Medical Image Classifiers with Visual Question Answering Models:
a Visual Question Answering (VQA) Model trained on medical data
Deep learning has shown promising potential for Medical Image Classification and Diagnosing. But added to the limitations of annotated training data in the medical domain, explanations for the models's predictions are also desired in this field of application.
Using Flamingo, a Visual Language Model for Few-Shot Learning, we leverage big pre-trained language models and vision encoders to build a new VQA model that can answers question for Xray images.
You can find out available pre-trained Models under the following link
- Datasets
- Model Architecture
- Training and Testing
- Getting Started
- Demo and Deploy
- Future Work
- Contributing
- Acknowledgments
For Backbone training used datasets are as follows:
- Roco
- [Mimic-CXR] (https://physionet.org/content/mimic-cxr/2.0.0/)
For Medical VQA
- [Imageclef2019] You can download from here(https://github.com/Rodger-Huang/SYSU-HCP-at-ImageCLEF-VQA-Med-2021)
- [VQA-RAD] (https://osf.io/89kps/)
Using Flamingo's architecture elements, we built a model capable of taking an Xray image and any question as inputs in order to generate an answer to the asked question.
A simplified overview of our model architectures is given in the following figures :
- Hardware: 1 A40 GPU, 200 epochs with early stop on val loss (at around 140 each experiment)
- Learning Rate (LR): 1e-4
- LR Warmup: 863 Steps
- Loss: Cross Entropy Loss
Check out flamingo_clip_gpt2_vqa_rad_run.py
-
Hardware: 1 A40 GPU, 80 epochs with early stop on val loss (at around 40 each experiment)
-
Duration: ~30 mins
-
LR: 1e-5
-
LR Warmup: 30 Steps
-
Loss: Cross Entropy Loss
-
Testing: check out vqaRAD_flamingo_clip_gpt2_infer.ipynb:
- On identical answers (GT answer: “no”, predicted answer: “no” -> true positive)
- On embeddings: Used tokens before the last linear layer for GT and predicted answer → Cosine Similarity
Check out flamingo_clip_gpt2_imageclef_run.py:
-
Hardware: 1 A40 GPU, 200 epochs with early stop on val loss (at around 110 each experiment)
-
Duration: ~3 hours
-
LR: 1e-4
-
LR Warmup: 30 Steps
-
Loss: Cross Entropy Loss
-
Testing: check out Imageclef_flamingo_clip_gpt2_playground.ipynb:
- Identical Answer on identical answers (Ground Truth answer: “no”, predicted answer: “no” -> true positive)
- Classification Accuracy
- Evaluation: Accuracy, BLEU score
To make it easy for you to get started with our model, here's a list of recommended next steps:
- Clone this repository into a local folder.
cd local/path
git clone https://gitlab.lrz.de/CAMP_IFL/diva/mlmi-vqa
- Setup the python virtual environement using
conda
.
conda env create -f environment.yml
conda activate mlmi
- Check the playground notebooks for usage examples
You can check and try out our model in our demo page using the QR code To run the demo check demo_imageclef.ipynb
- Domain Specific Language Decoder
- Domain Specific Tokenizer
- Decoder with similar number of parameters to Chinchilla Language Family
- Optimize current approach
- Qualitative evaluation and comparison with other works
- Visualization of Attention Maps
At the moment we are still closed for contributions.
Authors: Fabian Scherer - Andrei Mancu - Alaeddine Mellouli - Çağhan Köksal
We thank the MLMI team and both Matthias Keicher and Kamilia Zaripova for their help and support.
Private Repository until further development.