Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why stair-like loss curve? #101

Closed
ChenDelong1999 opened this issue May 30, 2022 · 10 comments
Closed

Why stair-like loss curve? #101

ChenDelong1999 opened this issue May 30, 2022 · 10 comments

Comments

@ChenDelong1999
Copy link

As in here as well as my own implementation, stair-like loss curves are observed. Any possible reason for this?

@rwightman
Copy link
Collaborator

I can't speak to the training run for the graph as I didn't do them, @mitchellnw would have a better idea... but looks like it could be a shuffling issue (as in not properly shuffling)

@mitchellnw
Copy link
Contributor

My guess is also a shuffling issue with webdataset when these were run

@rom1504
Copy link
Collaborator

rom1504 commented Jun 1, 2022

If the data is not preshuffled you need both shards shuffling and local shuffling

@ChenDelong1999
Copy link
Author

Perhaps it is not a shuffling issue with webdataset, since I train the model on CC3M (csv dataset), and observed the following curves.

image

which looks very similar to this curve in open_clip/docs/clip_conceptual_captions.md:
image

Loss increases within each epoch, then decrease after each epoch...

@rom1504
Copy link
Collaborator

rom1504 commented Jun 2, 2022

@ChenDelong1999 did you preshuffle the dataset (sort randomly the dataset) ?

@rwightman
Copy link
Collaborator

CsvDataset should be shuffled every epoch, pre-shuffle isn' really relevant. Might be worth checking that

sampler.set_epoch(epoch)
set_epoch is def being called in distributed case...

@mitchellnw
Copy link
Contributor

mitchellnw commented Jun 9, 2022

Looking at this again I wonder if it is caused by the scale param, which also exhibits stair-like behaviour.

I would expect that stair-like scale => stair-like loss.

But I have no guesses for why scale has stair-like behaviour.

To test this hypothesis I would use a 10x smaller learning rate on the scale parameter and see if this resolves the issue.

@rwightman
Copy link
Collaborator

@mitchellnw I've noticed that the scale param has interesting relationship with the LR/loss, I wonder if it's almost behaving in a slightly oscillatory control systems fashion. The scale is strongly impacted by the LR as well, if the LR is high enough the scale will not converge to 100 until it lowers

@mitchellnw
Copy link
Contributor

Interesting. I wonder how accuracy/loss would be impacted if this learnable param was replaced by a scheduled param---something like 100 - k*cosine_decay(iteration).

@viyjy
Copy link

viyjy commented Aug 28, 2022

here

Hi, thanks for your answer, but the scale param does not exhibit stair-like behaviour during the training process, isn't it? On the other hand, scale is learnable, it shouldn't be stair-like, right?

image

@mlfoundations mlfoundations locked and limited conversation to collaborators Nov 28, 2022
@rom1504 rom1504 converted this issue into discussion #262 Nov 28, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants