Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VADA or DIRT-T failed #1

Closed
MattiasM80 opened this issue Jun 17, 2018 · 9 comments
Closed

VADA or DIRT-T failed #1

MattiasM80 opened this issue Jun 17, 2018 · 9 comments

Comments

@MattiasM80
Copy link

Hi,
I got a vanilla copy of the master branch and downloaded the data using the 2 scripts in the repo.
Then I ran VADA, but got an error while executing DIRT-T following that.

The logs says the following. What do they tell me?

Model name: model=dirtt_src=mnist_trg=svhn_nn=small_trim=5_dw=1e-02_bw=1e-02_sw=1e+00_tw=1e-02_dirt=05000_run=0000
Traceback (most recent call last):
  File "run_dirtt.py", line 77, in <module>
    saver.restore(M.sess, path)
  File "/dccstor/mattiasm_dl/Anaconda3_x86_64/envs/mattiasm-dev-p2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1796, in restore
    raise ValueError("Can't load save_path when it is None.")

The folder 'checkpoints' only contain one folder 'model=dirtt_src=mnist_trg=svhn_nn=small_trim=5_dw=1e-02_bw=0e+00_sw=1e+00_tw=1e-02_dirt=00000_run=0000' and its empty. My guess is that VADA failed to complete. Am I right?

@RuiShu
Copy link
Owner

RuiShu commented Jun 17, 2018

If the folder is empty, then it sounds like either VADA failed to run or failed to save. Check the tensorboard log to see if the model ran at all. To debug VADA, run

python run_dirtt.py --datadir data --run 999 --src mnist --trg svhn --dirt 0

this will print a progress bar.

@MattiasM80
Copy link
Author

MattiasM80 commented Jun 17, 2018

Thanks. I did try with Tensorboard, but when I call it the Tensorboard server reports:
W0617 03:47:00.925464 Thread-7 program.py:292] 'EPIPE caused by 9.123.345.567:59010 in HTTP serving
Any clue what could have caused that?

Running with the argument '--run 999' does indeed show a progress bar
image

@MattiasM80
Copy link
Author

MattiasM80 commented Jun 18, 2018

I now see this displayed:
image
but yet nothing is written to the filesystem (except for the creation of an empty folder in the checkpoints dir and the logs dir with a single file events.out.tfevents.1529260382.dccxc205).

My user has write permissions to all subfolders of "dirt-t".

@RuiShu
Copy link
Owner

RuiShu commented Jun 18, 2018

It sounds like it's running just fine? It takes a few hours for VADA to run to completion on a 1080Ti.

@MattiasM80
Copy link
Author

How often is the code generating a checkpoint?
At what frequency is the log written?
Where in the code are the two above configured?

I'm using a K80 (which is weaker than the 1080Ti), and indeed the program is still running seemingly fine. Yet after running the whole night, I'd expected that some results would have been stored to disk. Was that a incorrect assumption perhaps?

@RuiShu
Copy link
Owner

RuiShu commented Jun 18, 2018

Those configs are available in train.py. Based on your screenshot, it seems like you only went through 2054 mini-batch updates after a whole night; that sounds pretty slow. Training completes after 80k updates.

@MattiasM80
Copy link
Author

It appears that the GPU card was not utilized. I updated drivers to CUDA 9.
Will update on progress.

@RuiShu
Copy link
Owner

RuiShu commented Jun 19, 2018

I'm a bit puzzled by what you described. Have you used tensorboard before?

@MattiasM80
Copy link
Author

Hi again,
Code runs and reproduces the results of the paper.
Tensorboard works also as expected. I suffered from some server limitations that only gave me partial access. Once I access the tensorboard server from the same machine that runs the tensorboard server the visualizations in the browser works excellent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants