CA4LTR

We developed a new network architecture that leverages image and text modalities, enhancing feature learning for long-tailed datasets. Our research shows that combining pre-trained image and text models via cross-modal attention compensates for the individual limitations of each model, significantly boosting long-tail recognition accuracy. Further experiments explored how text quality affects the model’s performance and identified key factors influencing multimodal model effectiveness. After the paper is accepted, we will release the code.

Text content quality on multimodal models

Table shows that image labels distill image content concisely and offer the most value in the multimodal fusion process. Although the descriptive text includes redundancies, its performance was still notable. The inclusion of nonsensical text somewhat impacted the multimodal model’s performance.

CIFAR-10-LT and CIFAR-100-LT

It can be observed from the table that our method achieves strong results across different types of methods. Taking CIFAR-100-LT (IF100) as an example from the table, our method reached an accuracy of 62.32%, superior to the multimodal training approach of CLIP2FL, which achieved 37.56%. Our method also performs better than generative methods, outperforming feature-based LDMLR(51.92%), label-based ProCo (52.80%), and sample-based DiffuLT (50.70%).

Tiny-ImageNet-LT

The Pure model indicates that we trained an image model, ResNet-32, from scratch. The blue bar chart represents our method. Since the class labels in this dataset are purely numerical, our textual content is also descriptive text generated by the BLIP-2 model. It can be observed from Figure 2 that our method enhances the classification performance for the tail categories while maintaining stability in the head categories.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CA4LTR

Text content quality on multimodal models

CIFAR-10-LT and CIFAR-100-LT

Tiny-ImageNet-LT

About

Releases

Packages

License

FZKChange/CA4LTR

Folders and files

Latest commit

History

Repository files navigation

CA4LTR

Text content quality on multimodal models

CIFAR-10-LT and CIFAR-100-LT

Tiny-ImageNet-LT

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages