Skip to content
This repository has been archived by the owner on May 24, 2018. It is now read-only.

ImageNet Example : train error does not decrease! #235

Open
fangli1992 opened this issue Aug 31, 2015 · 23 comments
Open

ImageNet Example : train error does not decrease! #235

fangli1992 opened this issue Aug 31, 2015 · 23 comments

Comments

@fangli1992
Copy link

Hi,

I am training ImageNet using the default configuer file:ImageNet.conf.
I use the latest version cxxnet downloaded from https://github.com/dmlc/cxxnet ,and I got result like this:

[1] train-error:0.999173    train-rec@1:0.00181927  train-rec@5:0.00576143  test-error:0.999    test-rec@1:0.00106  test-rec@5:0.0051
[2] train-error:0.998985    train-rec@1:0.000984172 train-rec@5:0.00498642  test-error:0.999    test-rec@1:0.0013   test-rec@5:0.00538
[3] train-error:0.998985    train-rec@1:0.00102632  train-rec@5:0.00492242  test-error:0.999    test-rec@1:0.00102  test-rec@5:0.00448
[4] train-error:0.998985    train-rec@1:0.000982611 train-rec@5:0.00496066  test-error:0.999    test-rec@1:0.00098  test-rec@5:0.00444
[5] train-error:0.998985    train-rec@1:0.00105441  train-rec@5:0.00507695  test-error:0.999    test-rec@1:0.00112  test-rec@5:0.00566
[6] train-error:0.998985    train-rec@1:0.000970124 train-rec@5:0.00502935  test-error:0.999    test-rec@1:0.00098  test-rec@5:0.0046
[7] train-error:0.998985    train-rec@1:0.00096466  train-rec@5:0.0049271   test-error:0.999    test-rec@1:0.00078  test-rec@5:0.005
[8] train-error:0.998985    train-rec@1:0.00104271  train-rec@5:0.00509178  test-error:0.999    test-rec@1:0.001    test-rec@5:0.00484

I found a similar issues #84 but did not find right answer.
Here is the further information about my cxxnet and training(maybe this can help):

# ImageNet.conf
data = train
iter = imgrec
#  image_list = "../../NameList.train"
  image_rec  = "./data/train.bin"
#  image_root = "../../data/resize256/"
  image_mean = "models/image_net_mean.bin"
  rand_crop=1
  rand_mirror=1
  shuffle = 1
iter = threadbuffer
iter = end

eval = test
iter = imgrec
#  image_list = "../../NameList.test"
  image_rec = "./data/val.bin"
#  image_root = "../../data/resize256/"
  image_mean = "models/image_net_mean.bin"
# no random crop and mirror in test
iter = end
...
...
  • I trained LeNet on MNIST with a conf file I converted from Caffe, and it works well!(default MNIST.conf works well too)
  • I did not use CUDNN(USE_CUDNN = 0)
  • I create the image_list_file in format like this:
# for train.bin (of course, this line is not in image_list_file)
1   0   n01440764/n01440764_10026.JPEG
2   0   n01440764/n01440764_10027.JPEG
3   0   n01440764/n01440764_10029.JPEG
4   0   n01440764/n01440764_10040.JPEG
5   0   n01440764/n01440764_10042.JPEG
...
63341   48  n01695060/n01695060_6356.JPEG
63342   48  n01695060/n01695060_6360.JPEG
63343   48  n01695060/n01695060_6371.JPEG
63344   48  n01695060/n01695060_6389.JPEG
63345   48  n01695060/n01695060_64.JPEG
63346   48  n01695060/n01695060_6400.JPEG
63347   48  n01695060/n01695060_6403.JPEG
...

# for test.bin(In my conf, it is val.bin. Of course, this line is not in image_list_file. )
1   65  ILSVRC2012_val_00000001.JPEG
2   970     ILSVRC2012_val_00000002.JPEG
3   230     ILSVRC2012_val_00000003.JPEG
4   809     ILSVRC2012_val_00000004.JPEG
5   516     ILSVRC2012_val_00000005.JPEG
6   57  ILSVRC2012_val_00000006.JPEG
@antinucleon
Copy link
Contributor

Please other configure file. AlexNet conf file is out of dated.
On Sun, Aug 30, 2015 at 20:43 fangli1992 [email protected] wrote:

Hi,

I am training ImageNet using the default configuer file:ImageNet.conf.
I use the latest version cxxnet downloaded from
https://github.com/dmlc/cxxnet ,and I got result like this:

[1] train-error:0.999173 train-rec@1:0.00181927 train-rec@5:0.00576143 test-error:0.999 test-rec@1:0.00106 test-rec@5:0.0051
[2] train-error:0.998985 train-rec@1:0.000984172 train-rec@5:0.00498642 test-error:0.999 test-rec@1:0.0013 test-rec@5:0.00538
[3] train-error:0.998985 train-rec@1:0.00102632 train-rec@5:0.00492242 test-error:0.999 test-rec@1:0.00102 test-rec@5:0.00448
[4] train-error:0.998985 train-rec@1:0.000982611 train-rec@5:0.00496066 test-error:0.999 test-rec@1:0.00098 test-rec@5:0.00444
[5] train-error:0.998985 train-rec@1:0.00105441 train-rec@5:0.00507695 test-error:0.999 test-rec@1:0.00112 test-rec@5:0.00566
[6] train-error:0.998985 train-rec@1:0.000970124 train-rec@5:0.00502935 test-error:0.999 test-rec@1:0.00098 test-rec@5:0.0046
[7] train-error:0.998985 train-rec@1:0.00096466 train-rec@5:0.0049271 test-error:0.999 test-rec@1:0.00078 test-rec@5:0.005
[8] train-error:0.998985 train-rec@1:0.00104271 train-rec@5:0.00509178 test-error:0.999 test-rec@1:0.001 test-rec@5:0.00484

I found a similar issues #84 #84
but did not find right answer.
Here is the further information about my cxxnet and training(maybe this
can help):

ImageNet.conf

data = train
iter = imgrec

image_list = "../../NameList.train"

image_rec = "./data/train.bin"

image_root = "../../data/resize256/"

image_mean = "models/image_net_mean.bin"
rand_crop=1
rand_mirror=1
shuffle = 1
iter = threadbuffer
iter = end

eval = test
iter = imgrec

image_list = "../../NameList.test"

image_rec = "./data/val.bin"

image_root = "../../data/resize256/"

image_mean = "models/image_net_mean.bin"

no random crop and mirror in test

iter = end
...
...

  • I trained LeNet on MNIST with a conf file I converted from Caffe,
    and it works well!(default MNIST.conf works well too)
  • I did not use CUDNN(USE_CUDNN = 0)
  • I create the image_list_file in format like this:

for train.bin (of course, this line is not in image_list_file)

1 0 n01440764/n01440764_10026.JPEG
2 0 n01440764/n01440764_10027.JPEG
3 0 n01440764/n01440764_10029.JPEG
4 0 n01440764/n01440764_10040.JPEG
5 0 n01440764/n01440764_10042.JPEG
...
63341 48 n01695060/n01695060_6356.JPEG
63342 48 n01695060/n01695060_6360.JPEG
63343 48 n01695060/n01695060_6371.JPEG
63344 48 n01695060/n01695060_6389.JPEG
63345 48 n01695060/n01695060_64.JPEG
63346 48 n01695060/n01695060_6400.JPEG
63347 48 n01695060/n01695060_6403.JPEG
...

for test.bin(In my conf, it is val.bin. Of course, this line is not in image_list_file. )

1 65 ILSVRC2012_val_00000001.JPEG
2 970 ILSVRC2012_val_00000002.JPEG
3 230 ILSVRC2012_val_00000003.JPEG
4 809 ILSVRC2012_val_00000004.JPEG
5 516 ILSVRC2012_val_00000005.JPEG
6 57 ILSVRC2012_val_00000006.JPEG


Reply to this email directly or view it on GitHub
#235.

@fangli1992
Copy link
Author

Thanks. @antinucleon
Is there something wrong with the ImageNet.conf ? I have check the net configure but did not found something wrong.

@antinucleon
Copy link
Contributor

The reason seems is the random initialization. You can try xavier
initialization method with different seed
On Sun, Aug 30, 2015 at 20:53 fangli1992 [email protected] wrote:

Thanks. @antinucleon https://github.com/antinucleon
Is there something wrong with the ImageNet.conf ? I have check the net
configure but did not found something wrong.


Reply to this email directly or view it on GitHub
#235 (comment).

@fangli1992
Copy link
Author

Thanks.@antinucleon
Now I still have no idea on how to configure the seed of xavier ==! Should I read and modify the source code?
Or chang it to gaussian like caffe and retry?
I am confused why LeNet works well.

@leoxiaobin
Copy link

I also use the master version to train the ImageNet data. I used the kaiming.conf and Inception-BN.conf. The train error rate and val error rate both did not decrease.

My system is Windows Server.

@fangli1992
Copy link
Author

I see your @ommiissyu problems submitted in April and it has been closed by @winstywang in June. Do you have some ideas on this issue? My system is Ubuntu 14.04LTS .
I have tried the gaussian initialization but got the same result.

@winstywang
Copy link
Contributor

@fangli1992 Have you tried xavier on kaiming.conf or googlenet? If it does not work well, try to add clip_gradient = 10 at the end of the config.

@fangli1992
Copy link
Author

Thank for your advice @winstywang , I tried clip_gradient = 10 and got the output like this

[1]     train-error:0.99599     train-rec@1:0.00401005  train-rec@5:0.00447365  test-error:0.999        test-rec@1:0.001        test-rec@5:0.005
[2]     train-error:1   train-rec@1:0   train-rec@5:0   test-error:0.999        test-rec@1:0.001        test-rec@5:0.005
[3]     train-error:1   train-rec@1:0   train-rec@5:0   test-error:0.999        test-rec@1:0.001        test-rec@5:0.005
[4]     train-error:1   train-rec@1:0   train-rec@5:0   test-error:0.999        test-rec@1:0.001        test-rec@5:0.005
[5]     train-error:1   train-rec@1:0   train-rec@5:0   test-error:0.999        test-rec@1:0.001        test-rec@5:0.005
[6]     train-error:1   train-rec@1:0   train-rec@5:0   test-error:0.999        test-rec@1:0.001        test-rec@5:0.005
[7]     train-error:1   train-rec@1:0   train-rec@5:0   test-error:0.999        test-rec@1:0.001        test-rec@5:0.005

I do not try kaiming.conf or googlenet.conf, but @ommiissyu did this #236

@winstywang
Copy link
Contributor

As replied above, try xavier initialization. I am not sure whether the
issue is caused by windows version.

On Tuesday, September 1, 2015, fangli1992 [email protected] wrote:

Thank for your advice @winstywang https://github.com/winstywang , I
tried clip_gradient = 10 and got the output like this

[1] train-error:0.99599 train-rec@1:0.00401005 train-rec@5:0.00447365 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[2] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[3] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[4] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[5] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[6] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[7] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005

I do not try kaiming.conf or googlenet.conf, but @ommiissyu
https://github.com/ommiissyu did this #236
#236


Reply to this email directly or view it on GitHub
#235 (comment).

@leoxiaobin
Copy link

@fangli1992 , in April, I used the Linux system, we solved the problem. But these days I used the cxxnet on the Windows platform, I got the error rate not decreasing problem too.

@antinucleon
Copy link
Contributor

thanks for the info. maybe it is related to rand_r on windows. We are busy
developing next generation data flow tools MXNet, in MXNet we will use
CXX11 to avoid random number inconsistent problem.
On Mon, Aug 31, 2015 at 21:12 Leo Xiao [email protected] wrote:

@fangli1992 https://github.com/fangli1992 , in April, I used the Linux
system, we solved the problem. But these I used the cxxnet on the Windows
platform, I got the error rate not decreasing problem too.


Reply to this email directly or view it on GitHub
#235 (comment).

@fangli1992
Copy link
Author

@winstywang ok, I am trying kaiming.conf with xavier and will submit the result as soon as possible.
By the way, I am using Ubuntu 14.04 LTS @antinucleon but meet with this problems.
@ommiissyu could you please give me some further suggestions? In April, you used cxxnet-v1?
Thanks a lot!

@antinucleon
Copy link
Contributor

@fangli1992 Sorry in cxxnet I won't have time to solve it because my own
network always works well. Once mxnet is finishing, cxxnet will be replaced
totally.
On Mon, Aug 31, 2015 at 21:25 fangli1992 [email protected] wrote:

@winstywang https://github.com/winstywang ok, I am trying kaiming.conf
with xavier and will submit the result as soon as possible.
By the way, I am using Ubuntu 14.04 LTS @antinucleon
https://github.com/antinucleon but meet with this problems.
@ommiissyu https://github.com/ommiissyu could you please give me some
further suggestions? In April, you used cxxnet-v1?
Thanks a lot!


Reply to this email directly or view it on GitHub
#235 (comment).

@leoxiaobin
Copy link

@fangli1992 ,on the linux platform, it works well. On the windows platform I got the same problem as yours.

@leoxiaobin
Copy link

my result:

round        0:[   10010] 11767 sec elapsed[1]  train-rec@1:0.00114183  train-rec@5:0.00508788  val-
rec@1:0.001     val-rec@5:0.005
round        1:[   10010] 23840 sec elapsed[2]  train-rec@1:0   train-rec@5:0   val-rec@1:0.001 val-
rec@5:0.005
round        2:[   10010] 35863 sec elapsed[3]  train-rec@1:0   train-rec@5:0   val-rec@1:0.001 val-
rec@5:0.005
round        3:[   10010] 47897 sec elapsed[4]  train-rec@1:0   train-rec@5:0   val-rec@1:0.001 val-
rec@5:0.005
round        4:[   10010] 59918 sec elapsed[5]  train-rec@1:0   train-rec@5:0   val-rec@1:0.001 val-
rec@5:0.005
round        5:[   10010] 71943 sec elapsed[6]  train-rec@1:0   train-rec@5:0   val-rec@1:0.001 val-
rec@5:0.005
round        6:[   10010] 83983 sec elapsed[7]  train-rec@1:0   train-rec@5:0   val-rec@1:0.001 val-
rec@5:0.005

@fangli1992
Copy link
Author

Thanks @antinucleon, you mean that cxxnet will be given up soon after MXNet? Would you please introduce me MXNet? When will it be Published?

@antinucleon
Copy link
Contributor

@fangli1992 https://github.com/dmlc/mxnet https://mxnet.readthedocs.org/en/latest/ It is still on going and no exact timeline but we are confident to finish it soon.

@fangli1992
Copy link
Author

@antinucleon @winstywang These days I did a test and I think the result is helpful to find the bugs.
I just tried the raw image with iter=img , that is, I didn't use rec format file or bin format file. Though the training speed was extremely slow, the result was fine.

round        0:[    5000] 21859 sec elapsed[1]  train-error:0.988323    train-rec@1:0.0116774   train-rec@5:0.0427986   test-error:0.94268  test-rec@1:0.05732  test-rec@5:0.17228
round        1:[    5000] 44233 sec elapsed[2]  train-error:0.909059    train-rec@1:0.0909411   train-rec@5:0.23927 test-error:0.85578  test-rec@1:0.14422  test-rec@5:0.33428
round        2:[    5000] 66558 sec elapsed[3]  train-error:0.839233    train-rec@1:0.160767    train-rec@5:0.361801    test-error:0.79018  test-rec@1:0.20982  test-rec@5:0.43284
round        3:[    5000] 88912 sec elapsed[4]  train-error:0.787381    train-rec@1:0.212619    train-rec@5:0.438686    test-error:0.74626  test-rec@1:0.25374  test-rec@5:0.48924
round        4:[    5000] 111285 sec elapsed[5] train-error:0.747077    train-rec@1:0.252923    train-rec@5:0.494042    test-error:0.71226  test-rec@1:0.28774  test-rec@5:0.53446
round        5:[    5000] 133646 sec elapsed[6] train-error:0.713261    train-rec@1:0.286739    train-rec@5:0.536074    test-error:0.68064  test-rec@1:0.31936  test-rec@5:0.56912
round        6:[    5000] 155997 sec elapsed[7] train-error:0.688767    train-rec@1:0.311233    train-rec@5:0.564116    test-error:0.67778  test-rec@1:0.32222  test-rec@5:0.573
round        7:[    5000] 178354 sec elapsed[8] train-error:0.691887    train-rec@1:0.308113    train-rec@5:0.55988 test-error:0.66232  test-rec@1:0.33768  test-rec@5:0.58872
round        8:[    5000] 200695 sec elapsed[9] train-error:0.665245    train-rec@1:0.334755    train-rec@5:0.589703    test-error:0.6387   test-rec@1:0.3613   test-rec@5:0.6173
round        9:[    5000] 223037 sec elapsed[10]        train-error:0.640461    train-rec@1:0.359539    train-rec@5:0.616859    test-error:0.62932  test-rec@1:0.37068  test-rec@5:0.62178
round       10:[    5000] 245377 sec elapsed[11]        train-error:0.621268    train-rec@1:0.378732    train-rec@5:0.637484    test-error:0.60522  test-rec@1:0.39478  test-rec@5:0.64822
round       11:[    5000] 267710 sec elapsed[12]        train-error:0.603612    train-rec@1:0.396388    train-rec@5:0.655111    test-error:0.59288  test-rec@1:0.40712  test-rec@5:0.66024
round       12:[    5000] 290041 sec elapsed[13]        train-error:0.588543    train-rec@1:0.411457    train-rec@5:0.670691    test-error:0.58648  test-rec@1:0.41352  test-rec@5:0.66906

I guess there must be something wrong with img2rec or imrec iterator .

@superzrx
Copy link
Contributor

superzrx commented Sep 7, 2015

@fangli1992
It seems that the list used to generate rec file is not shuffled.
Try to get it shuffled before using im2rec may help.
Use shuffle in iter do not works for im2rec for it just do shuffle in a page( about 3000 pics).
So if your label keeps same for a lot of continous examples it will fall to overfitting.

..

@leoxiaobin
Copy link

I shuffled the list before generating the rec file. It works well for me now. Thank you @fangli1992 .

@fangli1992
Copy link
Author

Thank you all @superzrx @antinucleon @ommiissyu @winstywang , I followed @superzrx advice and now my Alex Net works well.

@fhowen
Copy link

fhowen commented Sep 21, 2015

@fangli1992 @ommiissyu we have met the same problem like you, and did you mean you shuffled the list and get the well result? How should we shuffle the list? Does it have some special requirments?

@fhowen
Copy link

fhowen commented Sep 21, 2015

@fangli1992 by the way, have you compiled the cxxnet with ps-lite? we use it and use kaiming.con in the example folder, and just change it with the input file route and the gpu number, and get the same problem you mentioned above.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants