Skip to content

An innovative deep learning framework leveraging the CAT (Convolutions, Attention & Transformers) architecture to seamlessly integrate visual and textual modalities. This model exploits the prowess of CNNs for image feature extraction and Transformers for intricate textual pattern recognition, setting a new paradigm in multimodal learning.

License

Notifications You must be signed in to change notification settings

XavierSpycy/CAT-ImageTextIntegrator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

English Version | δΈ­ζ–‡η‰ˆ

😻 CAT: Convolutions, Attention & Transformers

πŸš€ Dive into the world of CAT! Imagine if computers could understand and combine the essence of both pictures and words, just like we humans naturally do. By marrying the strengths of Convolutions (think of it as the magic behind image filters) and Transformers (the genius tech behind language models), our CAT framework stands as a bridge, seamlessly blending visual and textual realms. So, whether you're marveling at a sunset photo or reading a poetic description, CAT seeks to decode, understand, and bring them together in harmony.

Looking for a swift kick-off? Explore our Jupyter Notebook directly in Google Colab!
Open In Colab

πŸ“–

✨ 1. Introduction

In this experimental endeavor, an innovative model architecture is proposed that leverages the capabilities of Convolutional Neural Networks (CNNs) for the extraction of salient features from images, and Transformer-based models for gleaning intricate patterns from textual data. Termed the Convolutions, Attention & Transformers or the CAT framework, the architecture deftly integrates attention mechanisms. These mechanisms serve as an intermediate conduit, facilitating the seamless amalgamation of visual and textual modalities.


Hmmm...NOT this 'CAT'.


This is my 'CAT'!

✨ 2. Hyperparameters of the optimal model

Architecture
Extractor Modality Module Number of Unfrozen Blocks
Image DenseNet-121 2
Text TinyBert
Parallelism Property Module Number of Input Dimensions
Fully-connected Batch Normalization 896
ReLU
Dropout
Attention
Classifier Property Module Number of Input Dimensions
Linear 896*2
Training procedure
Class Details
Strategy Batch Size 16
Number of epochs 50
Optimization Loss Function Binary Cross Entropy With Logits
Optimizer AdamW
Learning Rate 1e-5
Bias Correction False
Auxiliary Learning Rate Scheduler Linear
Number of Warmup Steps 0
Number of Training Steps Total Number of Batches
Prediction Output threshold 0.39

✨ 3. Data processing

πŸ’‘ How to process multimodal data? That is a good point!

3.1 Image data


Image Data Augmentation


Image Data Normalization

3.2 Text data


Random swap

3.3 Multimodal data

✨ 4. Evaluation methods

  • F1 score
  • Model size

✨ 5. Experimental comparison

5.1 Image vs. Text classifiers

Image classifiers
Model Size(MB) Training/Validation Threshold F1 score Epochs Efficiency(second/epoch)
ResNet-18 42.74 Training 0.5 0.6712 20 199.75
Validation 0.6612
ResNet-34 81.36 Training 0.6406 157.99
Validation 0.6304
DenseNet-201 70.45 Training 0.6728 179.58
Validation 0.6594
ResNet-50 90.12 Training 0.7090 175.89
Validation 0.7063
Training 0.7283 50 163.43
Validation 0.7260
Text classifiers
Model Size(MB) Training/Validation Threshold F1 score Epochs Efficiency(second/epoch)
Tiny Bert 54.79 Training 0.5 0.5955 50 53.76
Validation 0.5975
Bert Tiny 16.76 Training 0.635 0.5960 50 17.30
Validation 0.5989

5.2 Unimodal vs. Multimodal models

Unimodal models(Optimal)
Model Size(MB) Training/Validation Threshold F1 score Epochs Efficiency(second/epoch)
ResNet-50 90.12 Training 0.5 0.7283 50 163.43
Validation 0.7260
Bert Tiny 16.76 Training 0.635 0.5960 17.30
Validation 0.5989
Multimodal models(Baseline)
Model Size(MB) Training/Validation Threshold F1 score Epochs Efficiency(second/epoch)
DensityBert(DenseNet-121 + TinyBert) 97.71 Training 0.35 0.8173 50 191.48
Validation 0.8173
Bensity(DenseNet-201 + BertTiny) 100.83 Training 0.33 0.7980 190.01
Validation 0.7980
ResT(ResNet-34 + BertTiny) 100.92 Training 0.38 0.7836 170.06
Validation 0.7766

5.3 Self-attention vs. Cross-attention mechanisms

Self-attention mechanism
Model Query Size(MB) Training/Validation Threshold F1 score Epochs Efficiency(second/epoch)
DensityBert / 97.71 Training 0.35 0.8173 50 191.48
Validation 0.8173
Bensity 100.83 Training 0.33 0.7980 190.01
Validation 0.7980
Cross-attention mechanism
Model Query Size(MB) Training/Validation Threshold F1 score Epochs Efficiency(second/epoch)
CDBert Text 93.81 Training 0.29 0.7599 50 181.51
Validation 0.7564
IMCDBert Image 91.02 Training 0.46 0.8026 147.75
Validation 0.7985
Censity Text 90.89 Training 0.33 0.7905 183.50
Validation 0.7901
IMCensity Image 81.14 Training 0.38 0.7869 174.49
Validation 0.7801

5.4 Variations of DensityBert

Variations
Model Size(MB) Training/Validation Threshold F1 score Epochs Efficiency(second/epoch)
DensityBert 97.71 Training 0.35 0.8173 50 191.48
Validation 0.8173
MoDensityBert 97.72 Training 0.38 0.8622 178.95
Validation 0.8179
WarmDBert 97.72 Training 0.38 0.8505 204.09
Validation 0.8310
WarmerDBert 97.72 Training 0.39 0.8567 258.34
Validation 0.8345
WWDBert 99.77 Training 0.40 0.8700 100 269.93
Validation 0.8464

✨ 6. Competition information

  • Competition link: Here
  • My ranking

✨ 7. Poster

✨ 8. Project structure

β”œβ”€β”€ cat          
|   β”œβ”€β”€ attentions.py     
|   β”œβ”€β”€ datasets.py
|   β”œβ”€β”€ evaluator.py
|   β”œβ”€β”€ multimodal.py
|   β”œβ”€β”€ predict.py
|   β”œβ”€β”€ trainer.py
|   └── __init__.py
β”œβ”€β”€ data
|   β”œβ”€β”€ data/
|   β”‚    └── *.jpg
|   β”œβ”€β”€ train.csv
|   └── test.csv
β”œβ”€β”€ model_hub      
|   └── *.pth
β”œβ”€β”€ outcomes/      
β”‚   └── *.jpg/*.png/*.jpeg
β”œβ”€β”€ LICENSE
β”œβ”€β”€ notebook.ipynb
β”œβ”€β”€ README.md
β”œβ”€β”€ README.zh-CN.md
└── requirements.txt

About

An innovative deep learning framework leveraging the CAT (Convolutions, Attention & Transformers) architecture to seamlessly integrate visual and textual modalities. This model exploits the prowess of CNNs for image feature extraction and Transformers for intricate textual pattern recognition, setting a new paradigm in multimodal learning.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published