Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

update multistep_optimizer for tensorflow gpu #1773

Merged
merged 8 commits into from
Jun 16, 2020

Conversation

AgoloCuongHoang
Copy link
Contributor

@AgoloCuongHoang AgoloCuongHoang commented Dec 17, 2019

I found by using the original class MultistepAdamOptimizer, I got a problem of invalidArgumentError: Cannot assign a device for operation (see details below). This problem is only for tensorflow-gpu. On tensorflow it works fine. I did a lot of investigation on this (e.g. try adding normal soft placement tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)) - the trick people recommended without success).

After investigation I found two actions needed to fix this issue. 1. Convert int to float, and 2. The class inherits directly from optimizer.Optimizer, not AdamOptimizer. With this I found it solved the issue. Note that without any of these twos it will not work.

Let me know if you have any question regarding to this pull request. Meanwhile let me know if you need the code to replicate the issue of invalidArgumentError. Feel free if you think you could find a better solution.

Finally, I tagged @fstahlberg as well as he wrote the original MultistepAdamOptimizer class so that he will aware of this issue.

Thank you!


A subpiece of the error:


Traceback (most recent call last):
  File "/home/cuong.hoang/anaconda2/envs/py36_env_tensor_gpu_pip_local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/cuong.hoang/anaconda2/envs/py36_env_tensor_gpu_pip_local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
    self._extend_graph()
  File "/home/cuong.hoang/anaconda2/envs/py36_env_tensor_gpu_pip_local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation training/beta1_power/IsInitialized/VarIsInitializedOp: Could not satisfy explicit device specification '' because the node {{colocation_node training/beta1_power/IsInitialized/VarIsInitializedOp}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:GPU:0]. 

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added the cla: no PR author has not signed CLA label Dec 17, 2019
@AgoloCuongHoang
Copy link
Contributor Author

@googlebot I signed it!

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: yes PR author has signed CLA and removed cla: no PR author has not signed CLA labels Dec 17, 2019
@afrozenator
Copy link
Contributor

Hi @AgoloCuongHoang -- Thanks for the nice investigation, but Travis seems to report that the tensor2tensor/utils/multistep_optimizer_test.py test failed, -- the logs are here, can you investigate?

https://travis-ci.org/tensorflow/tensor2tensor/jobs/626315428?utm_medium=notification&utm_source=github_status

@afrozenator
Copy link
Contributor

To make Travis trigger - I will close and reopen the pull request.

@lukaszkaiser - Do you have thoughts on this? I think changing MultistepAdamOptimizer to not inherit from AdamOptimizer adds a lot of code, whatever the root cause is, in theory it shouldn't be because of the inheritance itself I feel. But obviously after much work the root cause wasn't found, but a fix made.

@afrozenator
Copy link
Contributor

It seems like Travis did run and fail - https://travis-ci.org/tensorflow/tensor2tensor/jobs/627551248?utm_medium=notification&utm_source=github_status

So there is no need to do the close and reopen dance.

@AgoloCuongHoang
Copy link
Contributor Author

@afrozenator: Sorry for my late response. I have some important work to do recently.

I just fixed the file and I think it passed the test (multistep_optimizer_test.py). It however does not pass some certain task that I indeed have no ideas why it is (I never touched on them I believe).
What could I do next?

@AgoloCuongHoang
Copy link
Contributor Author

Please see my latest commit which fixes the issue and passes the multistep_optimizer_test.py

@AgoloCuongHoang
Copy link
Contributor Author

@afrozenator: Any update on this? To be clear I am OK if the pull request reject was rejected. I am just curious what was going on. Thx.

@lukaszkaiser
Copy link
Contributor

@AgoloCuongHoang : could you make this new optimizer in a separate file and in a separate class? Just so we have the old one too for compatibility with old code?

@AgoloCuongHoang
Copy link
Contributor Author

@lukaszkaiser: Done. Let me know if you need me to do any thing further

@afrozenator
Copy link
Contributor

Thanks a lot @AgoloCuongHoang merging this now!

@afrozenator afrozenator merged commit 94a3c0e into tensorflow:master Jun 16, 2020
tensorflow-copybara pushed a commit that referenced this pull request Jun 16, 2020
PiperOrigin-RevId: 316746422
@jchwenger jchwenger mentioned this pull request Oct 29, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla: yes PR author has signed CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants