Add a linear layer to squeeze all patch embeddings into a single image embedding? #107

brunosan · 2024-01-04T09:31:28Z

The current bottleneck of the Unet architecture are patch embeddings, for each section of the image. When we create the embedding of the image, we use an average of all patch embeddgins.

However, this approach is very lossy an smooth version of the image semantics, specially when trying to reconstruct an image from its embedding (location and time). Moreover, each patch embedding does not include the semantics of the patch, rather is trained to capture the self-attention weighted semantics of all other available patches within the image. This also makes the collection of patch embeddings highly redundant.

Can we Introduce one more feedforward layer to aggregate all patch embeddings, location, and time data into a single, image-wide embedding? (one on the encoder to go down from patch semantics into image semantics, and one the decoder to expand from image semantics into patch semantics)

This embedding would encapsulate the entire semantic context of the image at a specific location and time. It would also allow us to reconstruct the image from the image embedding.
I suspect it will also need to convey where in the image the semantics are, so the decoder can place those differences within the image correctly. This is also highly desirable for downstream tasks that need to locate inter-image semantics.

yellowcap · 2024-01-18T10:17:19Z

We can consider to store the raw encoder output alongside with the average embeddings. With the raw encoder output any kind of re-combination, including using a linear layer for creating the "best" combination, can be performed.

brunosan · 2024-01-18T14:23:13Z

Tagging here something I learned today, which is that for each self-attention patch, the 13 layers are grouped into 6 groups, and we create one embedding per group.

I still think it would make sense to roll all layers group into a single embedding, at the self-attention patch, and use that as the end of the encoder.

brunosan · 2024-03-14T15:33:25Z

I keep coming up with the need to do this.
@yellowcap and I had a great conversation where we decided NOT to do this for now.

The reason being that currently these band groups represent a wider, capacity to learn features, and grouped by band (e.g. optical features, DEM features, SAR features). This is much richer than just one vector. We don't know if squeezing it further looses quality, and answering that question seems lower priority right now (versus the cost of extra space for the embeddings).

If we do need to squeeze to one vector, we can always make a "decoder" that squeezes these into one vector while keeping most of the information (loss function TBD). We can also take the embeddings instead of the econder.

It is true that doing it this way creates the need to average the patch-level embeddings, or it creates 16x sizes of patch-level embedding, which sometimes we might need, like in #168.

brunosan · 2024-04-19T12:40:04Z

Bumping this up again.

Specially as we move away from fixed bands and band groups, and we focus on e.g. similarly search at the patch level, it seems critical that we do not create embeddings at the neck of the Unet per patch AND per band.

1)how would that even work when the input data can have different bands?
2) when we do an average across bands to create the patch embedding, we are forcing a brute force reduction, of all independent bands into single vector. May be "houses" is a semantic on some dimension on one band, but other dimensions in another, and making the average doesn't even make sense semantically.

I propose (again) the reduce now the neck of the Unet to one embedding per patch.

Cc @yellowcap @srmsoumya

brunosan · 2024-04-19T15:53:52Z

Talking with @yellowcap, it seems we are already merging all bands into a single patch embedding.

@srmsoumya to confirm and close here.

yellowcap · 2024-06-05T10:49:38Z

Closing as out of date, feel free to re-open if appropriate. We have a class token in v1, which represents a learned way to compress the band embeddings into a single embedding. So that adderesses the issu.e

brunosan assigned srmsoumya Jan 4, 2024

brunosan mentioned this issue Jan 8, 2024

4x semantic resolution: Reduce tile size to 256 (16x semantic resolution with 128 tile size, and patch size to 16) #110

Closed

brunosan assigned yellowcap Jan 12, 2024

yellowcap mentioned this issue Jan 18, 2024

Run v0.1 embeddings to store raw encoder output #127

Closed

brunosan mentioned this issue Apr 19, 2024

Patch embeddings are not meant for similarity search #223

Closed

yellowcap closed this as completed Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a linear layer to squeeze all patch embeddings into a single image embedding? #107

Add a linear layer to squeeze all patch embeddings into a single image embedding? #107

brunosan commented Jan 4, 2024

yellowcap commented Jan 18, 2024

brunosan commented Jan 18, 2024

brunosan commented Mar 14, 2024

brunosan commented Apr 19, 2024

brunosan commented Apr 19, 2024

yellowcap commented Jun 5, 2024 •

edited

Add a linear layer to squeeze all patch embeddings into a single image embedding? #107

Add a linear layer to squeeze all patch embeddings into a single image embedding? #107

Comments

brunosan commented Jan 4, 2024

yellowcap commented Jan 18, 2024

brunosan commented Jan 18, 2024

brunosan commented Mar 14, 2024

brunosan commented Apr 19, 2024

brunosan commented Apr 19, 2024

yellowcap commented Jun 5, 2024 • edited

yellowcap commented Jun 5, 2024 •

edited