Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Distributed.Worker threadsafe #37905

Merged
merged 4 commits into from
Oct 17, 2020
Merged

Conversation

jonas-schulze
Copy link
Contributor

@jonas-schulze jonas-schulze commented Oct 6, 2020

I don't know how to test this more specifically, but the included tests failed reliably (on my machine) prior to the fix. If you have suggestions, I am happy to adjust/replace the tests.

The included commit by Kristoffer is an artifact I needed for development. I forgot to replace the calls to notify, which caused the build to hung up on creating some precompile statements (without printing any errors). Loading the changes to the stdlib using Revise didn't work on my machine, I don't know why. I can rebase and drop that commit if you like.

Fixes JuliaLang/Distributed.jl#73

@jonas-schulze
Copy link
Contributor Author

Whoops, I lost a commit during rebase. While fixing that, I dropped the commit mentioned above. Sorry if I started two CI runs!

@JeffBezanson JeffBezanson added the domain:multithreading Base.Threads and related functionality label Oct 6, 2020
Copy link
Sponsor Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failure looks relevant:

  From worker 2:	  ArgumentError: Active workers with lazy=false. Cannot set lazy=true

@jonas-schulze
Copy link
Contributor Author

Thanks for looking into the logs, @vchuravy! Should be fixed now. As the tests remove all the workers they added, I just moved them above the topology tests.

@jonas-schulze
Copy link
Contributor Author

I have no idea why the new tests are that flaky.

@jonas-schulze
Copy link
Contributor Author

jonas-schulze commented Oct 7, 2020

The problem might be the shared in and outstreams:

Edit: I suspected an error in locking the socket during the handshake, but I can't tell for sure. On Linux it fails in libuv IOError: connect: address not available (EADDRNOTAVAIL), on BSD due to could not load symbol "jl_task_get_next": Shared object "libgmp" not found, required by "julia".

  • tester_linux32, tester_win64 and tester_linuxaarch64: Distributed test suite passes
  • tester_freebsd64 and tester_linux64 each have one test set fail in the following pattern:
Test Summary:                                            | Pass  Error  Total
RemoteChannel allows put!/take! from thread other than 1 |   56      2     58
...
    from thread *.1 to *.1                               |    4             4
    from thread *.2 to *.1                               |           1      1
    from thread *.1 to *.2                               |           1      1
    from thread *.2 to *.2                               |    4             4
...

@JeffBezanson JeffBezanson reopened this Oct 8, 2020
@jonas-schulze
Copy link
Contributor Author

I have no idea how to fix this. Even though the failure is clearly caused by the new tests, the reason seems unrelated to the changes proposed in this PR. Should I document the failed builds in a separate issue and move the tests to be a comment?

@vchuravy
Copy link
Sponsor Member

Can you rebase onto master one more time?

so that it is guaranteed that lazy workers can be added. Also, the tests
make sure to remove all workers that had been added, so that subsequent
tests can determine the laziness of the cluster themselves.
@jonas-schulze
Copy link
Contributor Author

@vchuravy some tests are still failing, but it looks like for different reasons. This time they failed to connect to the workers (within the Distributed test suite).

@vchuravy vchuravy merged commit 444aa87 into JuliaLang:master Oct 17, 2020
@vchuravy
Copy link
Sponsor Member

Thank you Jonas!

@jonas-schulze jonas-schulze deleted the js/fix37706 branch October 18, 2020 10:45
vtjnash added a commit that referenced this pull request Oct 20, 2020
vtjnash added a commit that referenced this pull request Oct 21, 2020
vchuravy pushed a commit that referenced this pull request Oct 21, 2020
Keno pushed a commit that referenced this pull request Jun 5, 2024
Keno pushed a commit that referenced this pull request Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concurrency violation on interplay between Distributed and Base.Threads
3 participants