why the training loss always none? #17

lucasjinreal · 2019-02-15T06:15:51Z

I got some loss like this:


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 424/424 [04:10<00:00,  2.24it/s]
[train] Epoch: 22/100 Loss: nan Acc: 0.010870849580527
Execution time: 250.25667172999238

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [00:26<00:00,  5.16it/s]
[val] Epoch: 22/100 Loss: nan Acc: 0.011121408711770158
Execution time: 26.448329468010343

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 424/424 [04:09<00:00,  2.23it/s]
[train] Epoch: 23/100 Loss: nan Acc: 0.010870849580527
Execution time: 249.90277546200377

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [00:26<00:00,  5.09it/s]
[val] Epoch: 23/100 Loss: nan Acc: 0.011121408711770158
Execution time: 26.87914375399123

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 424/424 [04:09<00:00,  2.24it/s]
[train] Epoch: 24/100 Loss: nan Acc: 0.010870849580527
Execution time: 249.9237438449927

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [00:26<00:00,  5.16it/s]
[val] Epoch: 24/100 Loss: nan Acc: 0.011121408711770158
Execution time: 26.460865497996565

It;s all nan, for what reason maybe?

The text was updated successfully, but these errors were encountered:

lizhongguo · 2019-02-18T06:21:11Z

This happens to me , too . the version of Pytorch is 0.4.1 .
`100%|█████████████████████████████████████████████████████████████████████████████████| 423/423 [09:39<00:00, 1.34s/it]
[train] Epoch: 100/100 Loss: nan Acc: 0.010874704491725768
Execution time: 579.1260393778794

100%|█████████████████████████████████████████████████████████████████████████████████| 108/108 [01:02<00:00, 2.30it/s]
[val] Epoch: 100/100 Loss: nan Acc: 0.0111162575266327
Execution time: 62.677289011888206

Save model at /media/ext/lizhongguo/ActionRecognition/pytorch-video-recognition/run/run_1/models/C3D-ucf101_epoch-99.pth.tar

100%|█████████████████████████████████████████████████████████████████████████████████| 136/136 [01:16<00:00, 3.15it/s]
[test] Epoch: 100/100 Loss: nan Acc: 0.010736764161421697
Execution time: 76.43733210070059
`

jfzhang95 · 2019-02-22T04:39:03Z

Hi, you may reduce the learning rate.

KyuminHwang · 2019-02-26T05:39:27Z

i also suffered from Loss:Nan..
I reduce learning rate from 1e-3 to 1e-1, but results is same(Loss : nan).

If Loss is nan, then cannot store weights. so model cant increase accuracy....
Anybody solved this problem?

lizhongguo · 2019-02-26T08:31:51Z

I checked the code from https://github.com/facebookresearch/VMZ/blob/master/lib/models/c3d_model.py , and added BatchNorm layer between Conv layer and Relu layer . Now it seems working on UCF-101 dataset .

lucasjinreal · 2019-02-26T08:34:06Z

@lizhongguo let me have a look

wave-transmitter · 2019-02-26T08:51:18Z

i also suffered from Loss:Nan..
I reduce learning rate from 1e-3 to 1e-1, but results is same(Loss : nan).

If Loss is nan, then cannot store weights. so model cant increase accuracy....
Anybody solved this problem?

Reducing learning rate means selecting a rate lower than 1e-3, such as 1e-5 or 0.5e-3. Personally I trained the model from scratch on UCF101 with learning rate equal to 1e-3, without having any NaN issues.

KyuminHwang · 2019-02-27T00:17:29Z

@wave-transmitter Thank you for comment ! i solved this problem using learning rate.
i reduced learning rate to 1e-5, then it worked correctly !

ilovekj · 2019-05-02T12:55:49Z

however， when i reduce Learning rate, the acc is just 0.20, what should i do

KyuminHwang · 2019-05-05T14:58:29Z

@ilovekj
i recommend to find your proper learning rate !
i control to several times, and found proper rate.
how about augment your dataset ?

ilovekj · 2019-05-07T04:53:34Z

@makeastir but there is another question, it seems that they are splitting the dataset randomly, which is not allowed, there are three official splits, and when I use this code, it performance poor

KyuminHwang · 2019-05-08T04:42:57Z

@ilovekj i also used this code and i got efficient performance. In this code has augmentation module so that this code should make dataset more useful. how about increase to your dataset quantity ? In my case, Non-True is 400 , True is 150. Or reduce to features of dataset ?

ilovekj · 2019-05-08T04:48:01Z

@makeastir but you didn't use the official splits

ziqi-zhang · 2019-05-09T02:28:49Z

@ilovekj Hi. I used official split and corresponding dataloader and I only got 1% accuracy. But the same code on the random split is 98%. I wonder did you figure out the problem?

ilovekj · 2019-05-09T06:33:05Z

maybe we didn't use pretrain model, but i am not sure

leftthomas mentioned this issue Apr 8, 2019

VideoDataset has a bug about data normalization #23

Open

hongbo-miao mentioned this issue May 8, 2021

Loss is nan stanford-action-recognition/ar#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why the training loss always none? #17

why the training loss always none? #17

lucasjinreal commented Feb 15, 2019

lizhongguo commented Feb 18, 2019

jfzhang95 commented Feb 22, 2019

KyuminHwang commented Feb 26, 2019

lizhongguo commented Feb 26, 2019 •

edited

Loading

lucasjinreal commented Feb 26, 2019

wave-transmitter commented Feb 26, 2019 •

edited

Loading

KyuminHwang commented Feb 27, 2019

ilovekj commented May 2, 2019

KyuminHwang commented May 5, 2019

ilovekj commented May 7, 2019

KyuminHwang commented May 8, 2019 •

edited

Loading

ilovekj commented May 8, 2019

ziqi-zhang commented May 9, 2019

ilovekj commented May 9, 2019

why the training loss always none? #17

why the training loss always none? #17

Comments

lucasjinreal commented Feb 15, 2019

lizhongguo commented Feb 18, 2019

jfzhang95 commented Feb 22, 2019

KyuminHwang commented Feb 26, 2019

lizhongguo commented Feb 26, 2019 • edited Loading

lucasjinreal commented Feb 26, 2019

wave-transmitter commented Feb 26, 2019 • edited Loading

KyuminHwang commented Feb 27, 2019

ilovekj commented May 2, 2019

KyuminHwang commented May 5, 2019

ilovekj commented May 7, 2019

KyuminHwang commented May 8, 2019 • edited Loading

ilovekj commented May 8, 2019

ziqi-zhang commented May 9, 2019

ilovekj commented May 9, 2019

lizhongguo commented Feb 26, 2019 •

edited

Loading

wave-transmitter commented Feb 26, 2019 •

edited

Loading

KyuminHwang commented May 8, 2019 •

edited

Loading