On Numerosity Representations in Vision-Language Models

This repository contains code for the thesis on numerosity in language-vision models. The abstract reads as follows:

Large vision-language models, such as CLIP, learn a joint image-text embedding space, which can be employed in a variety of downstream tasks, such as zero-shot classification or text-to-image generation. However, they are not strong in certain compositional aspects, such as numerosity, which results in poor performance in tasks in the domain of counting. We investigate the numerosity representations in the visual encoder of CLIP, the ViT, and find that numerosity is encoded better in the middle layers of the ViT. We perform probing on different latent layers in the ViT for the number of different concepts related to the input images. Also, we introduce the binding problem as the task of distinguishing between different objects in an image, which we regard as a necessary first step in counting. We find that this task can be solved for artificially generated images from the MALeViC dataset, even when the objects have the same color and shape, and are further apart. Then, we construct a counting algorithm that uses the results of the binding probe, to construct numerosity information from the latent representations. We find that also this counter performs best when using representations from the middle layers of the ViT.

The code is distributed amongst three folders:

finetune_clip. This folder contains code to finetune CLIP to count, as proposed by Paiss et al., 2023, https://teaching-clip-to-count.github.io/.
malevic-master. This folder contains code regarding the MALeViC dataset, which is from Pezzelle & Fernández (2019), https://github.com/sandropezzelle/malevic. The folder also contains code to perform amnesic probing, which is based on Ravfogel et al., 2022, https://github.com/shauli-ravfogel/adv-kernel-removal. Additionally, the folder contains code to train and evaluate probes to reproduce the experiments mentioned in the thesis. Finally, the folder contains some notebooks to visualize counting and bounding boxes of the MALeViC data.
style_package. This folder contains no functional code, but the matplotlib style of the plots in the report.

Note that the files are a combination of re-used and original code.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.vscode		.vscode
finetune_clip		finetune_clip
malevic-master		malevic-master
style_package		style_package
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
environment_gpu.yml		environment_gpu.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On Numerosity Representations in Vision-Language Models

About

Releases

Packages

Languages

angelavansprang/numerosity_thesis

Folders and files

Latest commit

History

Repository files navigation

On Numerosity Representations in Vision-Language Models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages