Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix build with latest TF nightly #3864

Merged
merged 6 commits into from
Apr 4, 2023

Conversation

ashahba
Copy link
Contributor

@ashahba ashahba commented Mar 15, 2023

Signed-off-by: Abolfazl Shahbazi [email protected]

This PR fixes horovod install issue for nightly TensorFlow as described in #3861
Here are the steps to get a passing install:

git clone https://github.com/horovod/horovod.git
cd horovod

export HOROVOD_WITHOUT_PYTORCH=1
export HOROVOD_WITHOUT_MXNET=1
export HOROVOD_WITH_TENSORFLOW=1
export HOROVOD_VERSION=b1d0ce8f7fe66c690916ca4a318dd5ded77005a2

git reset --hard ${HOROVOD_VERSION}
git submodule update --init --recursive

Then apply the patch provided by this PR and run the installation as below:

python3 -m pip install --no-cache-dir -v -e .

I've only tested this on the following configuration:

  • Ubuntu 20.04.5 LTS
  • gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  • cmake version 3.16.3
  • Python 3.9.16
  • mpirun (Open MPI) 4.0.3

Signed-off-by: Abolfazl Shahbazi <[email protected]>
@github-actions
Copy link

github-actions bot commented Mar 15, 2023

Unit Test Results

  1 193 files  +   222    1 193 suites  +222   13h 38m 53s ⏱️ + 2h 0m 0s
     886 tests ±       0       831 ✔️ +     61       54 💤  -      62  1 +1 
26 628 runs  +5 017  18 852 ✔️ +3 792  7 775 💤 +1 224  1 +1 

For more details on these failures, see this check.

Results for commit e9c76ec. ± Comparison against base commit 9b42eda.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Mar 15, 2023

Unit Test Results (with flaky tests)

  1 313 files  +   149    1 313 suites  +149   15h 9m 49s ⏱️ + 2h 24m 25s
     886 tests ±       0       830 ✔️ +     61       54 💤  -   62  2 +1 
29 599 runs  +3 202  20 628 ✔️ +2 793  8 968 💤 +407  3 +2 

For more details on these failures, see this check.

Results for commit e9c76ec. ± Comparison against base commit 9b42eda.

♻️ This comment has been updated with latest results.

@maxhgerlach maxhgerlach linked an issue Mar 16, 2023 that may be closed by this pull request
@maxhgerlach maxhgerlach changed the title Fix for issue #3861 Fix build with latest TF nightly Mar 16, 2023
@maxhgerlach
Copy link
Collaborator

maxhgerlach commented Mar 16, 2023

Thanks for the fix, @ashahba! I went ahead and added a commit to the PR branch so compatibility with TF <= 2.12 will be retained.

@ashahba
Copy link
Contributor Author

ashahba commented Mar 16, 2023

Thanks @maxhgerlach for timely respond.

@maxhgerlach
Copy link
Collaborator

After yesterday's PyTorch release we pick up torch==2.0.0 now, rather than a nightly/dev version, which blocks the CI:

2023-03-16T17:52:42.3497842Z #44 [39/51] RUN /horovod/assert-package-versions.sh
2023-03-16T17:52:42.3498192Z #44 sha256:94c0c4138de63c79ca985f9905edc3b9f598aa742087af4998fd87a12805d32f
2023-03-16T17:52:43.1000727Z #44 0.739 tf-nightly==2.13.0.dev20230316 �[32mfound�[0m (matches tf-nightly / tf-nightly==.*\.dev20\d{6})
2023-03-16T17:52:44.0024362Z #44 1.680 torch==2.0.0 �[31mfound BUT�[0m torch-nightly-cu116 / torch==.*\.dev20\d{6}.* �[31mexpected�[0m
2023-03-16T17:52:44.4534918Z #44 2.151 pytorch-lightning==1.5.9 �[32mfound�[0m
2023-03-16T17:52:45.3372508Z #44 3.095 torchvision==0.15.1 �[31mfound BUT�[0m torchvision / torchvision==.*\.dev20\d{6}.* �[31mexpected�[0m
2023-03-16T17:52:45.9391226Z #44 3.572 mxnet-cu112==2.0.0b20230316 �[32mfound�[0m (matches mxnet-nightly-cu112 / mxnet-cu112==.*20\d{6})
2023-03-16T17:52:46.3132460Z #44 4.041 pyspark==3.3.1 �[32mfound�[0m
2023-03-16T17:52:46.3133493Z #44 ERROR: executor failed running [/bin/bash -euo pipefail -c /horovod/assert-package-versions.sh]: exit code: 1

I can only find earlier torch-2.0.0.dev* packages on: https://download.pytorch.org/whl/nightly/cu116/torch_nightly.html

Probably they stopped building for CUDA 11.6. There are newer packages torch-2.1.0.dev* on https://download.pytorch.org/whl/nightly/cu118/torch_nightly.html, so upgrading to CUDA 11.8 would be one way to go forward here.

On tf-nightly this allows to use an overloaded
bool operator==(const ::tensorflow::error::Code& c1, const absl::StatusCode& c2)
for backward compatibility.

Signed-off-by: Max H. Gerlach <[email protected]>
@maxhgerlach
Copy link
Collaborator

I've added another fix for tf-head (related to a refactoring towards absl::StatusCode) and added some return status code handling to fix compiler warnings.

@ashahba
Copy link
Contributor Author

ashahba commented Mar 21, 2023

Thanks @maxhgerlach but I'm seeing the python3.7 not found issue here:
https://readthedocs.org/projects/horovod/builds/19856118/

I'm thinking that we should probably switch to Python 3.8 there anyway so I posted a PR here: #3874 for that.

@maxhgerlach
Copy link
Collaborator

maxhgerlach commented Mar 22, 2023

I think the readthedocs deployment was broken for a while, several similar issues: readthedocs/readthedocs.org#10173

Edit: I don't see an easy way to trigger a new doc build from GitHub. We should just ignore the (transient) problem for this PR that doesn't change any docs.

@maxhgerlach
Copy link
Collaborator

maxhgerlach commented Mar 22, 2023

The build works fine now and tests pass. Thanks for kicking this off, @ashahba!

Here are the (nightly) package version used by the pipelines:

2023-03-21T21:02:02.3107453Z #44 [39/51] RUN /horovod/assert-package-versions.sh
2023-03-21T21:02:02.3107781Z #44 sha256:7860f3d27327a14caeda7b3009946906d524ed825add575840a80006b54b3dd6
2023-03-21T21:02:03.0564602Z #44 0.721 tf-nightly==2.13.0.dev20230320 �[32mfound�[0m (matches tf-nightly / tf-nightly==.*\.dev20\d{6})
2023-03-21T21:02:03.3344592Z #44 1.147 torch==2.1.0.dev20230321+cu118 �[32mfound�[0m (matches torch-nightly-cu118 / torch==.*\.dev20\d{6}.*)
2023-03-21T21:02:03.7732923Z #44 1.586 pytorch-lightning==1.5.9 �[32mfound�[0m
2023-03-21T21:02:04.1939200Z #44 2.006 torchvision==0.16.0.dev20230321+cu118 �[32mfound�[0m (matches torchvision / torchvision==.*\.dev20\d{6}.*)
2023-03-21T21:02:04.6137069Z #44 2.426 mxnet-cu112==2.0.0b20230321 �[32mfound�[0m (matches mxnet-nightly-cu112 / mxnet-cu112==.*20\d{6})
2023-03-21T21:02:05.0307524Z #44 2.843 pyspark==3.3.1 �[32mfound�[0m

Could you look over the C++ changes made in the PR, @romerojosh?

@maxhgerlach maxhgerlach added this to the v0.28.0 milestone Mar 22, 2023
@ashahba
Copy link
Contributor Author

ashahba commented Mar 22, 2023

Thanks @maxhgerlach
Hopefully this one passes now. It'll be a while for all tests to finish.

@maxhgerlach
Copy link
Collaborator

All right, but I think it was really OK as it was. All tests passed. The docs build failure was just due to transient flakiness over at Readthedocs.

@nvcastet nvcastet merged commit f72503e into horovod:master Apr 4, 2023
@maxhgerlach
Copy link
Collaborator

Thanks for the approval and merge, @nvcastet! 👍

@ashahba ashahba deleted the ashahba/issue-3861-fix branch April 6, 2023 05:48
@ashahba
Copy link
Contributor Author

ashahba commented Apr 6, 2023

Thanks you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Build error with latest Tensorflow head commit
3 participants