CTCloss gets the nan loss when training with a custom Chinese dataset. #66

AnddyWang · 2019-09-10T07:05:38Z

------------ Options -------------
experiment_name: TPS-VGG-BiLSTM-CTC-Seed2222
manualSeed: 2222
workers: 16
batch_size: 192
num_iter: 300000
valInterval: 300000
continue_model:
adam: False
lr: 0.1
lr_decay_steps: 100000
lr_decay_rate: 0.8
beta1: 0.9
rho: 0.95
eps: 1e-08
grad_clip: 5
select_data: ['train']
batch_ratio: ['1']
total_data_usage_ratio: 1.0
batch_max_length: 64
imgH: 32
imgW: 256
rgb: True
sensitive: True
PAD: True
data_filtering_off: False
Transformation: TPS
FeatureExtraction: VGG
SequenceModeling: BiLSTM
Prediction: CTC
num_fiducial: 20
input_channel: 3
output_channel: 512
hidden_size: 256
num_gpu: 1
num_class: 6885

Loss
[38/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.22594 train_loss: nan
[39/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.23453 train_loss: nan
[40/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.25824 train_loss: nan
[41/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.25702 train_loss: nan
[42/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.28295 train_loss: nan
[43/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.28247 train_loss: nan
[44/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.27586 train_loss: nan
[45/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.25553 train_loss: 8.42399
[46/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.22859 train_loss: nan
[47/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.25175 train_loss: 8.32840
[48/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.24148 train_loss: nan
[49/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.22841 train_loss: nan
[50/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.27223 train_loss: nan
[51/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.23665 train_loss: 8.47187
[52/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.22846 train_loss: nan
[53/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.24092 train_loss: nan
[54/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.25575 train_loss: 8.26231
[55/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.26092 train_loss: 8.02194
[56/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.25898 train_loss: nan
[57/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.22861 train_loss: nan
[58/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.27106 train_loss: nan
[59/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.24483 train_loss: nan
[60/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.25403 train_loss: nan
[61/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.24929 train_loss: nan
[62/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.27895 train_loss: nan
[63/300000] lr: 0.1 0.1 single_train_elapsed_time: 0.24706 train_loss: nan

MengLcool · 2019-09-11T01:44:52Z

I met the same problem, but I still finished training the model. This situation only accurs when using CTC Loss.

AnddyWang · 2019-09-11T01:59:56Z

I met the same problem, but I still finished training the model. This situation only accurs when using CTC Loss.

I also use CTC loss，do your trained model works well?

MengLcool · 2019-09-11T02:19:54Z

I met the same problem, but I still finished training the model. This situation only accurs when using CTC Loss.

I also use CTC loss，do your trained model works well?

The final result is ok, train_loss can descent normally during training. But I don't how why, maybe something wrong when using CTCLoss().

13438960761 · 2019-09-11T05:34:11Z

@AnddyWang @MengLcool did you use own data set to train?

13438960761 · 2019-09-11T05:36:28Z

thanks for your code, can i use my own data set to train? if yes, What do I need to pay attention to？

AnddyWang · 2019-09-11T06:37:01Z

I met the same problem, but I still finished training the model. This situation only accurs when using CTC Loss.

I also use CTC loss，do your trained model works well?

The final result is ok, train_loss can descent normally during training. But I don't how why, maybe something wrong when using CTCLoss().

I will also train the model，can you help us solve the nan loss? @ku21fan

ku21fan · 2019-09-11T08:57:54Z

Hello,
Even though Pytorch developer have solved many bugs of nn.CTCLoss(), I think some bugs still exist.
I met NAN with CTCLoss by using our previous code with Pytorch 1.2.0 version as described here

So, I have 2 questions

Did you use latest code of this repository?
Could you tell me your Pytorch and CUDA version?

MengLcool · 2019-09-11T09:08:10Z

Hello,
Even though Pytorch developer have solved many bugs of nn.CTCLoss(), I think some bugs still exist.
I met NAN with different code of CTCLoss with Pytorch 1.2.0 version as described here

So, I have 2 questions

Did you use latest code of this repository?

Could you tell me your Pytorch and CUDA version?

I tried the latest code but still met NAN
I use Pytorch 1.1.0 and CUDA 9.0

MengLcool · 2019-09-11T09:14:16Z

@AnddyWang @MengLcool did you use own data set to train?

yes, I prepare my own dataset just follow the README ^_^

AnddyWang · 2019-09-11T11:27:14Z

Hello,
Even though Pytorch developer have solved many bugs of nn.CTCLoss(), I think some bugs still exist.
I met NAN with CTCLoss by using our previous code with Pytorch 1.2.0 version as described here

So, I have 2 questions

Did you use latest code of this repository?

Could you tell me your Pytorch and CUDA version?

I use the latest code but the loss is NAN
pytorch 1.1.0 and cuda-10.0

ku21fan · 2019-09-11T11:48:52Z

Thanks.
one more question
The train command below with this dataset, which we released, works fine in your environment?

CUDA_VISIBLE_DEVICES=0 python3 train.py \
--train_data data_lmdb_release/training --valid_data data_lmdb_release/validation \
--select_data MJ-ST --batch_ratio 0.5-0.5 \
--Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC

AnddyWang · 2019-09-12T08:01:23Z

Thanks.
one more question
The train command below with this dataset, which we released, works fine in your environment?
CUDA_VISIBLE_DEVICES=0 python3 train.py \
--train_data data_lmdb_release/training --valid_data data_lmdb_release/validation \
--select_data MJ-ST --batch_ratio 0.5-0.5 \
--Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC

use the released datasets, without nan using tps or not.
I using the art datasets which occurs nan.

13438960761 · 2019-09-17T06:23:31Z

i have the same problem, when i use CTS loss

ku21fan · 2019-09-17T06:58:44Z

@AnddyWang @MengLcool @13438960761
In this case (= NAN did not occur with released English datasets, and NAN occurs with ArT dataset),
I guess that NAN was derived from the characteristic of CTC loss.

In general, CTCloss has some limitations and one of them is "input length >= target length".
In this case, the output of BiLSTM (= input of CTCloss) has input length (63) with VGG and imgW 256.
Thus, the limit of the target length is less than 64.
and I guess some of the training data exceed the length(63).

Thus, set 'batch_max_length = 63' and then the data whose length is longer than 63 will be filtered with these codes.
https://github.com/clovaai/deep-text-recognition-benchmark/blob/master/dataset.py#L137-L140

Best

AnddyWang · 2019-09-18T08:38:26Z

@AnddyWang @MengLcool @13438960761
In this case (= NAN did not occur with released English datasets, and NAN occurs with ArT dataset),
I guess that NAN was derived from the characteristic of CTC loss.

In general, CTCloss has some limitations and one of them is "input length >= target length".
In this case, the output of BiLSTM (= input of CTCloss) has input length (63) with VGG and imgW 256.
Thus, the limit of the target length is less than 64.
and I guess some of the training data exceed the length(63).

Thus, set 'batch_max_length = 63' and then the data whose length is longer than 63 will be filtered with these codes.
https://github.com/clovaai/deep-text-recognition-benchmark/blob/master/dataset.py#L137-L140

Best

Thanks for your reply.
I try to set 'batch_max_length = 63'，but it does not work. The loss is also nan.

SealQ · 2019-10-12T09:11:13Z

I wonder how long it took you to train a model?

WenmuZhou · 2019-10-12T12:03:49Z

I think it's a bug of ctcloss in pytorch @AnddyWang, you can try pytorch1.2+

13438960761 · 2019-10-18T10:49:25Z

@WenmuZhou did you try to run the model of CTC in use pytorch1.2+? is its loss nan?

AnddyWang · 2019-10-25T06:36:22Z

@WenmuZhou did you try to run the model of CTC in use pytorch1.2+? is its loss nan?

pytorch1.3 works fine.

Ffmydy · 2020-05-19T09:24:44Z

Can it be used for double line text recognition

ku21fan changed the title ~~train the model，but get the nan loss~~ CTCloss gets the nan loss when training with a custom Chinese dataset. Oct 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTCloss gets the nan loss when training with a custom Chinese dataset. #66

CTCloss gets the nan loss when training with a custom Chinese dataset. #66

AnddyWang commented Sep 10, 2019

MengLcool commented Sep 11, 2019

AnddyWang commented Sep 11, 2019

MengLcool commented Sep 11, 2019

13438960761 commented Sep 11, 2019

13438960761 commented Sep 11, 2019

AnddyWang commented Sep 11, 2019

ku21fan commented Sep 11, 2019 •

edited

Loading

MengLcool commented Sep 11, 2019

MengLcool commented Sep 11, 2019

AnddyWang commented Sep 11, 2019

ku21fan commented Sep 11, 2019

AnddyWang commented Sep 12, 2019

13438960761 commented Sep 17, 2019

ku21fan commented Sep 17, 2019

AnddyWang commented Sep 18, 2019

SealQ commented Oct 12, 2019

WenmuZhou commented Oct 12, 2019

13438960761 commented Oct 18, 2019

AnddyWang commented Oct 25, 2019

Ffmydy commented May 19, 2020

CTCloss gets the nan loss when training with a custom Chinese dataset. #66

CTCloss gets the nan loss when training with a custom Chinese dataset. #66

Comments

AnddyWang commented Sep 10, 2019

MengLcool commented Sep 11, 2019

AnddyWang commented Sep 11, 2019

MengLcool commented Sep 11, 2019

13438960761 commented Sep 11, 2019

13438960761 commented Sep 11, 2019

AnddyWang commented Sep 11, 2019

ku21fan commented Sep 11, 2019 • edited Loading

MengLcool commented Sep 11, 2019

MengLcool commented Sep 11, 2019

AnddyWang commented Sep 11, 2019

ku21fan commented Sep 11, 2019

AnddyWang commented Sep 12, 2019

13438960761 commented Sep 17, 2019

ku21fan commented Sep 17, 2019

AnddyWang commented Sep 18, 2019

SealQ commented Oct 12, 2019

WenmuZhou commented Oct 12, 2019

13438960761 commented Oct 18, 2019

AnddyWang commented Oct 25, 2019

Ffmydy commented May 19, 2020

ku21fan commented Sep 11, 2019 •

edited

Loading