-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing the masking out #222
Comments
I chatted with @yellowcap and @lukaskondmann and I think I was wrong.
This, combined with the fact that we can, in v1, input smaller chip sizes, means that the patch embeddings is less relevant. The underlying factor is that patch embddings are not designed to be used isolated. The opposite, the are desgined to contain the context around them, therefore are not well suited to be used for isolated similarity search. Opening a ticket on that now. |
A MAE with a Unet like ours is actually a dual learning strategy. 1) creating accurate embeddings at patch level to reconstruct the input image, and 2) masking out to learn semantics across patches through interpolation.
The latter, masking out works really well for semantics that cover several patches, but can fault terribly for semantics fully contained on one patch without much relation elsewhere. E.g. a small forest clearing or fire, an aquaculture, ... Moreover in those cases where some neighbors are semantically mostly empty (water) the masked self-attention might put more aquaculture semantics on the empty water patch next to it, that the patch itself which also has the coast and other stuff within it.
The current 75% masking ratio overly emphasizes interpolation, diluting the model's focus on learning discrete, isolated semantic features critical for our applications.
I propose we greatly reduce (10% tops) or eliminate masking to prioritize direct learning from unmasked, full patch data. May be even tightening the self attention weights.
Lowering or removing the masking ratio will allow the model to more effectively learn and retain high-fidelity semantic information from each individual patch, aligning with our priority of achieving precise semantic understanding at the patch level, specially when fully contained.
@leothomas @MaceGrim @yellowcap @srmsoumya
The text was updated successfully, but these errors were encountered: