Skip to content

angelavansprang/numerosity_thesis

Repository files navigation

On Numerosity Representations in Vision-Language Models

This repository contains code for the thesis on numerosity in language-vision models. The abstract reads as follows:

Large vision-language models, such as CLIP, learn a joint image-text embedding space, which can be employed in a variety of downstream tasks, such as zero-shot classification or text-to-image generation. However, they are not strong in certain compositional aspects, such as numerosity, which results in poor performance in tasks in the domain of counting. We investigate the numerosity representations in the visual encoder of CLIP, the ViT, and find that numerosity is encoded better in the middle layers of the ViT. We perform probing on different latent layers in the ViT for the number of different concepts related to the input images. Also, we introduce the binding problem as the task of distinguishing between different objects in an image, which we regard as a necessary first step in counting. We find that this task can be solved for artificially generated images from the MALeViC dataset, even when the objects have the same color and shape, and are further apart. Then, we construct a counting algorithm that uses the results of the binding probe, to construct numerosity information from the latent representations. We find that also this counter performs best when using representations from the middle layers of the ViT.

The code is distributed amongst three folders:

  • finetune_clip. This folder contains code to finetune CLIP to count, as proposed by Paiss et al., 2023, https://teaching-clip-to-count.github.io/.
  • malevic-master. This folder contains code regarding the MALeViC dataset, which is from Pezzelle & Fernández (2019), https://github.com/sandropezzelle/malevic. The folder also contains code to perform amnesic probing, which is based on Ravfogel et al., 2022, https://github.com/shauli-ravfogel/adv-kernel-removal. Additionally, the folder contains code to train and evaluate probes to reproduce the experiments mentioned in the thesis. Finally, the folder contains some notebooks to visualize counting and bounding boxes of the MALeViC data.
  • style_package. This folder contains no functional code, but the matplotlib style of the plots in the report.

Note that the files are a combination of re-used and original code.

About

Repo for thesis on numerosity of language-vision models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published