You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A key use case for Clay is to find similar stuff. Give it a few examples of parking lots, and find more of those. Very quickly, the challenge becomes that small stuff is much smaller than the image. E.g. the image size is 512x512 at Sentinel2 resolution, so 5kmx5km, and you might want to find dams, or airports, or aquaculture, which might be ~100m. This is a dual problem:
You can only select whole images, so the code will have trouble understanding what of the many small things you actually wanted. The only solution here is to give it a few positive and negative examples, to filter down the actual semantic intended. We've been using that. Is like playing who is who selecting attributes across samples.
Even if you do find the right stuff in other images, you don't know where your stuff is within the image.
We've been moderately successful with patch embeddings similarity but there is one underlying fundamental issue. Patch embeddings are literally designed to depend on their context. The whole point of self-attention is to understand the presence and distribution of not only the semantics of the patch, but who that related to the ones around it: The same exact helipad image will have different patch embeding if on a ship, hospital or airport.
Transformers force word embeddings to distinguish among semantics given the context, and then we try to find the same word and struggle when they are different, as we forced them to be. the word "bank" is our patch, and we struggle when given "world bank" cannot find the "similar" case "river bank". In EO, it doesn't matter than our tokens (the patch) is actually an image that might have whole isolated semantics (like a car), it is forced to distinguish the same car given the context.
It is only at the image level, not the patch that we get whole semantics.
With v0, the image size was fixed, and large, hence we needed the patch level. For v1 we are doing several resolutions, and several image sizes. This should enable us to generate embeddings for images much closer to the size of the semantics we are looking for.
My question:
how to merge all those patch embeddings into a single image embedding (or how word embeddings are combined to get a sentence embedding). Is the average acceptable?
Do we then need to generate embeddings at several image sizes so we can find stuff at different sizes? E.g. embeddings for stuff at 10m, 100m, 1km, ...
This discussion was converted from issue #223 on May 03, 2024 08:46.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
A key use case for Clay is to find similar stuff. Give it a few examples of parking lots, and find more of those. Very quickly, the challenge becomes that small stuff is much smaller than the image. E.g. the image size is
512x512
at Sentinel2 resolution, so 5kmx5km, and you might want to find dams, or airports, or aquaculture, which might be ~100m. This is a dual problem:We've been moderately successful with patch embeddings similarity but there is one underlying fundamental issue. Patch embeddings are literally designed to depend on their context. The whole point of self-attention is to understand the presence and distribution of not only the semantics of the patch, but who that related to the ones around it: The same exact helipad image will have different patch embeding if on a ship, hospital or airport.
Transformers force word embeddings to distinguish among semantics given the context, and then we try to find the same word and struggle when they are different, as we forced them to be. the word "bank" is our patch, and we struggle when given "world
bank
" cannot find the "similar" case "riverbank
". In EO, it doesn't matter than our tokens (the patch) is actually an image that might have whole isolated semantics (like a car), it is forced to distinguish the same car given the context.It is only at the image level, not the patch that we get whole semantics.
With v0, the image size was fixed, and large, hence we needed the patch level. For v1 we are doing several resolutions, and several image sizes. This should enable us to generate embeddings for images much closer to the size of the semantics we are looking for.
My question:
@leothomas @MaceGrim @yellowcap @srmsoumya
related #222 #107
Beta Was this translation helpful? Give feedback.
All reactions