Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why the training loss always none? #17

Open
lucasjinreal opened this issue Feb 15, 2019 · 14 comments
Open

why the training loss always none? #17

lucasjinreal opened this issue Feb 15, 2019 · 14 comments

Comments

@lucasjinreal
Copy link

I got some loss like this:


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 424/424 [04:10<00:00,  2.24it/s]
[train] Epoch: 22/100 Loss: nan Acc: 0.010870849580527
Execution time: 250.25667172999238

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [00:26<00:00,  5.16it/s]
[val] Epoch: 22/100 Loss: nan Acc: 0.011121408711770158
Execution time: 26.448329468010343

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 424/424 [04:09<00:00,  2.23it/s]
[train] Epoch: 23/100 Loss: nan Acc: 0.010870849580527
Execution time: 249.90277546200377

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [00:26<00:00,  5.09it/s]
[val] Epoch: 23/100 Loss: nan Acc: 0.011121408711770158
Execution time: 26.87914375399123

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 424/424 [04:09<00:00,  2.24it/s]
[train] Epoch: 24/100 Loss: nan Acc: 0.010870849580527
Execution time: 249.9237438449927

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [00:26<00:00,  5.16it/s]
[val] Epoch: 24/100 Loss: nan Acc: 0.011121408711770158
Execution time: 26.460865497996565

It;s all nan, for what reason maybe?

@lizhongguo
Copy link

This happens to me , too . the version of Pytorch is 0.4.1 .
`100%|█████████████████████████████████████████████████████████████████████████████████| 423/423 [09:39<00:00, 1.34s/it]
[train] Epoch: 100/100 Loss: nan Acc: 0.010874704491725768
Execution time: 579.1260393778794

100%|█████████████████████████████████████████████████████████████████████████████████| 108/108 [01:02<00:00, 2.30it/s]
[val] Epoch: 100/100 Loss: nan Acc: 0.0111162575266327
Execution time: 62.677289011888206

Save model at /media/ext/lizhongguo/ActionRecognition/pytorch-video-recognition/run/run_1/models/C3D-ucf101_epoch-99.pth.tar

100%|█████████████████████████████████████████████████████████████████████████████████| 136/136 [01:16<00:00, 3.15it/s]
[test] Epoch: 100/100 Loss: nan Acc: 0.010736764161421697
Execution time: 76.43733210070059
`

@jfzhang95
Copy link
Owner

Hi, you may reduce the learning rate.

@KyuminHwang
Copy link

i also suffered from Loss:Nan..
I reduce learning rate from 1e-3 to 1e-1, but results is same(Loss : nan).

If Loss is nan, then cannot store weights. so model cant increase accuracy....
Anybody solved this problem?

@lizhongguo
Copy link

lizhongguo commented Feb 26, 2019

I checked the code from https://github.com/facebookresearch/VMZ/blob/master/lib/models/c3d_model.py , and added BatchNorm layer between Conv layer and Relu layer . Now it seems working on UCF-101 dataset .

@lucasjinreal
Copy link
Author

@lizhongguo let me have a look

@wave-transmitter
Copy link

wave-transmitter commented Feb 26, 2019

i also suffered from Loss:Nan..
I reduce learning rate from 1e-3 to 1e-1, but results is same(Loss : nan).

If Loss is nan, then cannot store weights. so model cant increase accuracy....
Anybody solved this problem?

Reducing learning rate means selecting a rate lower than 1e-3, such as 1e-5 or 0.5e-3. Personally I trained the model from scratch on UCF101 with learning rate equal to 1e-3, without having any NaN issues.

@KyuminHwang
Copy link

@wave-transmitter Thank you for comment ! i solved this problem using learning rate.
i reduced learning rate to 1e-5, then it worked correctly !

@ilovekj
Copy link

ilovekj commented May 2, 2019

however, when i reduce Learning rate, the acc is just 0.20, what should i do

@KyuminHwang
Copy link

@ilovekj
i recommend to find your proper learning rate !
i control to several times, and found proper rate.
how about augment your dataset ?

@ilovekj
Copy link

ilovekj commented May 7, 2019

@makeastir but there is another question, it seems that they are splitting the dataset randomly, which is not allowed, there are three official splits, and when I use this code, it performance poor

@KyuminHwang
Copy link

KyuminHwang commented May 8, 2019

@ilovekj i also used this code and i got efficient performance. In this code has augmentation module so that this code should make dataset more useful. how about increase to your dataset quantity ? In my case, Non-True is 400 , True is 150. Or reduce to features of dataset ?

@ilovekj
Copy link

ilovekj commented May 8, 2019

@makeastir but you didn't use the official splits

@ziqi-zhang
Copy link

@ilovekj Hi. I used official split and corresponding dataloader and I only got 1% accuracy. But the same code on the random split is 98%. I wonder did you figure out the problem?

@ilovekj
Copy link

ilovekj commented May 9, 2019

maybe we didn't use pretrain model, but i am not sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants