Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan another training run for Clay v1 #283

Open
yellowcap opened this issue Jun 27, 2024 · 7 comments
Open

Plan another training run for Clay v1 #283

yellowcap opened this issue Jun 27, 2024 · 7 comments
Assignees

Comments

@yellowcap
Copy link
Member

Ideas to add are

  • Train a larger model version
  • Use SAM as teacher model
  • Add Satellogic data to reduce bias on high res training
@rbavery
Copy link

rbavery commented Jun 27, 2024

Possibly relevant? I think the authors make a good case for focusing efforts on improving vision representations in multi modal llms to make truly flexible models in terms of accepted inputs and tasks they can address. and improving evaluation benchmarks like COCO to test more than just MaP

https://arxiv.org/pdf/2406.16860
https://twitter.com/sainingxie/status/1805862015778341123

@rbavery
Copy link

rbavery commented Jul 2, 2024

https://x.com/osanseviero/status/1807679660328620099

^ Possible funding source for model distillation work.

@lauracchen
Copy link
Member

Ok @brunosan could you help determine a priority list here? The other thing I'd love to see is if we can add MODIS data, as long as the architecture won't need to change

@brunosan
Copy link
Member

brunosan commented Jul 30, 2024

Update: MODIS has been just added. #311
Not adding Satellogic to the foundational training due to license (we understand clay would then need to carry a "cc-by satellogic" on the trained model).
We are now securing compute block

@rbavery
Copy link

rbavery commented Jul 30, 2024

Curious why CC-By is a blocker? this article indicates that the model can be used, even commercially, with attribution. https://satellogic.com/2024/05/01/satellogic-open-source-release-a-large-dataset-of-high-resolution-imagery-for-ai-model-training/

For any high res dataset sourced from a commercial provider, I expect they will at least want this kind of attribution. Having a model that understands submeter resolutions in addition to coarser resolution public imagery would be very valuable.

@brunosan
Copy link
Member

Curious why CC-By is a blocker?

This is for foundational training. If we train with data that requires attribution, we understand it means that the attributions carries over to the trained model, and all users of Clay need also to attribute it, which would bring higher friction. E.g. If Planet incorporates Clay on the pipeline, they might need to attribute Satellogic when using Clay.

This of course does not prevent us, or anyone, to make a finetuned version of Clay with Satellogic, or Maxar or Planet data. That version would carry the licenses of the data used.

fwiw, Clay is trained with NAIP and LINZ, which are both well under 1 meter. (32% of the 70 million chips)

PS: AFAIK it is not legally settled if the license of each training data carries over to the trained model. In LLMs the practice seems not to, but we choose to take the safer position and only use fully open data.

@srmsoumya
Copy link
Collaborator

Update: We are conducting another model run for CLAY with the following updates:

  • Added MODIS to the list of sensors. Document MODIS data sampling #311
  • Implemented MRL on CLAY embeddings.
  • Introduced SAM as the new teacher model.
  • Using Fused Transformers as the Encoder/Decoder backbone.
  • Switching to Fused Adam and 8-bit Adam as optimizers.
  • Reduced the decoder size for MAE.
  • Randomly dropping latitude, longitude, and time information.
  • Randomly dropping some or all channels.
  • Converting Sentinel-1 data from raw values to dB scale.

We are running several experiments with these changes, and based on the results, the successful adjustments will be included in the new model run. Keep track of the changes in dev branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants