Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing loss #17

Open
fradino opened this issue Feb 3, 2023 · 9 comments
Open

Increasing loss #17

fradino opened this issue Feb 3, 2023 · 9 comments

Comments

@fradino
Copy link

fradino commented Feb 3, 2023

Hello,
I try to train the VAE, follow the step
image
but the loss is increasing
image

@ZENGXH
Copy link
Collaborator

ZENGXH commented Feb 3, 2023

Hi, this is expected since we 1) increase KL loss weight from 1e-7 to 0.5 throughout the training, i.e., the magnitude of the KL loss is increasing and 2) initialize the VAE as an identity mapping, i.e., it will have perfect reconstruction at the early iteration. As the KL weight increase, you will see the reconstruction loss getting higher as well. As a result, the loss curve will keep increasing until the KL weight reach 0.5.
This is the loss curve for my experiment train on car using the default hyper-parameter:
image

@fradino
Copy link
Author

fradino commented Feb 4, 2023

Thank you!It‘s helpful to me. Could you show me the loss curve of train diffusion prior?

@fradino
Copy link
Author

fradino commented Feb 6, 2023

And the loss becomes NAN after the step 3332
image

@ZENGXH
Copy link
Collaborator

ZENGXH commented Feb 20, 2023

this is my epoch loss:
image

Are you using the default config? Could you share some visualization (target, reconstruction, latent points) of the VAE training and some samples of the prior training?

@fradino
Copy link
Author

fradino commented Feb 20, 2023

I'm training VAE with the default config, and I find
1676867794078
x_0_pred becomes inf after training

@ZENGXH
Copy link
Collaborator

ZENGXH commented Feb 20, 2023

Thanks for the sharing. This looks wired. I didn't see this before. Could you try if reducing the learning rate by half can fix this issue or not?

@ZENGXH ZENGXH mentioned this issue Feb 20, 2023
@fradino
Copy link
Author

fradino commented Feb 20, 2023

Thanks for the sharing. This looks wired. I didn't see this before. Could you try if reducing the learning rate by half can fix this issue or not?

The only change I made was to change the BS from 32 to 16. I will try to reduce the learning rate by half. @ZENGXH

@yuanzhen2020
Copy link

As you mentioned, the weight of KL loss will increase as the training progresses, and the reconstruction loss will also increase. I have a question about how to evaluate the performance of the trained VAE model or is there an indicator to evaluation throughout the training? Another question is: do you have some advises how to optimize this training parameters? @ZENGXH

@ZENGXH
Copy link
Collaborator

ZENGXH commented Apr 12, 2023

@yuanzhen2020 I usually look at the reconstructed point cloud and the latent points. A VAE that is well trained need to 1) has smooth latent points, the points will close to a Gaussian distribution and 2) maintain a good reconstruction (by checking both visualization and the reconstructed EMD and CD metric); we need to achieve a good trade off between 1) and 2).

In general vae training, another thing that may be helpful is to track the un-weighted KL + reconstruction loss, i.e., the ELBO value. The value should be decreasing through the training. I didn't track this since in LION the KL value is much larger than reconstruction loss: it will dominate too much in the ELBO.

Eventually, we care about the sample quality. So the ultimate way to verify whether a VAE is good enough or not is to train the prior and compare the sample metric on it. (but this is expansive).

In terms of training parameters, it seems tuning the dropout ratio, and the model size can make some difference in the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants