-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broadcast-like operations are poorly scheduled (widely-shared dependencies) #6570
Comments
Do we see this graph structure when using any high level collection like dataframes, arrays or bags? Does this also happen with work-stealing disabled? |
See my above example using dask array. I'm sure I could make a similar one with dataframes. As I said, this is basically going to affect any sort of broadcast operation. I don't think work-stealing has an effect here; if anything it might help. I need to double-check though. The basic problem is that the
|
Graphs like this are not currently scheduled well:
The
.
tasks should definitely take into account the location of the*
data when scheduling. But if we have 5 workers, every worker will have*
data on it, but only 2 workers will have ana
orb
. In scheduling the first few.
s, there's a tug-of-war between thea
and the*
—which do we want to schedule near? We want a way to disregard thea
.Say
(*, 0)
completes first, anda
is already complete, on a different worker. Each*
is the same size (or smaller than)a
. We now schedule(., 0)
. If we choose to go toa
, we might have a short-term gain, but we've taken a spot that could have gone to better use in the near future. Say the worker holdinga
is already running(*, 6)
. Now,(., 6)
may get scheduled on yet another worker, because(., 0)
is already running where it should have gone, and the scheduler prioritizes "where can I start this task soonest" over "how can I minimize data transfer".This can cascade through all the
.
s, until we've transferred most root tasks to different workers (on top ofa
, which we have to transfer everywhere no matter what).What could have been a nearly-zero-transfer operation is instead likely to transfer every piece of input data to a different worker, greatly increasing memory usage.
This pattern will occur anytime you broadcast one thing against another in a binary operation (which can occur in arrays, dataframes, bags, etc.).
In the above case, the
mul
tasks will tend to "dogpile" onto the one worker that holds the middlerandom_sample
task (x
).@crusaderky has also observed cases where this "dogpile" effect can cause what should be an embarrassingly-parallell operation to all get scheduled on one worker, overwhelming it.
#5325 was a heuristic attempt to fix this, but there are probably better ways to approach it.
The text was updated successfully, but these errors were encountered: