Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add some missing timeouts in Distributed #34502

Merged
merged 1 commit into from
Jan 28, 2020
Merged

Conversation

JeffBezanson
Copy link
Sponsor Member

This will hopefully fix some of the intermittent hangs in mac CI. @Keno believes we are running out of file descriptors there, which I agree is a likely cause. That was causing an exception at an unexpected point, leaving some processes waiting forever. This should turn it into a hard fail. Then we just need to stop leaking descriptors, or get more :)

fixes #34486 (by converting the hang into an exception)

@JeffBezanson JeffBezanson added domain:parallelism Parallel or distributed computation domain:ci Continuous integration labels Jan 24, 2020
@StefanKarpinski
Copy link
Sponsor Member

When we run out of file descriptors, I wonder if it would be too clever to try doing a full gc and then trying again...

@JeffBezanson
Copy link
Sponsor Member Author

That sounds pretty good to me actually; potentially many places might need to be modified to do the error check and retry though. We could start by trying that in the specific places in Distributed that are failing.

@JeffBezanson JeffBezanson merged commit 4e2a6e7 into master Jan 28, 2020
@JeffBezanson JeffBezanson deleted the jb/distributedtimeouts branch January 28, 2020 21:00
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Mar 10, 2020
fix typo in code that deals in timing out worker setup (introduced in JuliaLang#34502)
KristofferC pushed a commit that referenced this pull request Apr 11, 2020
tanmaykm added a commit to tanmaykm/julia that referenced this pull request May 13, 2020
also the additional async task for timeout introduced in JuliaLang#34502 will not be required, because this PR handles that already and also differentiates between timeout and error.
tanmaykm added a commit to tanmaykm/julia that referenced this pull request May 13, 2020
also the additional async task for timeout introduced in JuliaLang#34502 will not be required, because this PR handles that already and also differentiates between timeout and error.
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Apr 14, 2021
also the additional async task for timeout introduced in JuliaLang#34502 will not be required, because this PR handles that already and also differentiates between timeout and error.
Keno pushed a commit that referenced this pull request Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:ci Continuous integration domain:parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ClusterManager hangs when worker dies after connection is made but before setup is done
3 participants