Broadcast-like operations are poorly scheduled (widely-shared dependencies) #6570

gjoseph92 · 2022-06-13T22:23:05Z

Graphs like this are not currently scheduled well:

. . . . . . . .   . . . . . . . .
|\|\|\|\|/|/|/|   |\|\|\|\|/|/|/|
| | | | a | | |   | | | | b | | |
* * * * * * * *   * * * * * * * *

The . tasks should definitely take into account the location of the * data when scheduling. But if we have 5 workers, every worker will have * data on it, but only 2 workers will have an a or b. In scheduling the first few .s, there's a tug-of-war between the a and the *—which do we want to schedule near? We want a way to disregard the a.

Say (*, 0) completes first, and a is already complete, on a different worker. Each * is the same size (or smaller than) a. We now schedule (., 0). If we choose to go to a, we might have a short-term gain, but we've taken a spot that could have gone to better use in the near future. Say the worker holding a is already running (*, 6). Now, (., 6) may get scheduled on yet another worker, because (., 0) is already running where it should have gone, and the scheduler prioritizes "where can I start this task soonest" over "how can I minimize data transfer".

This can cascade through all the .s, until we've transferred most root tasks to different workers (on top of a, which we have to transfer everywhere no matter what).

What could have been a nearly-zero-transfer operation is instead likely to transfer every piece of input data to a different worker, greatly increasing memory usage.

This pattern will occur anytime you broadcast one thing against another in a binary operation (which can occur in arrays, dataframes, bags, etc.).

import dask.array as da
a = da.random.random(100, chunks=10)
x = da.random.random(1)
r = (a[1:] * x)  # `[1:]` slicing prevents blockwise fusion
r.visualize(optimize_graph=True, collapse_outputs=True)

In the above case, the mul tasks will tend to "dogpile" onto the one worker that holds the middle random_sample task (x).

@crusaderky has also observed cases where this "dogpile" effect can cause what should be an embarrassingly-parallell operation to all get scheduled on one worker, overwhelming it.

#5325 was a heuristic attempt to fix this, but there are probably better ways to approach it.

The text was updated successfully, but these errors were encountered:

fjetter · 2022-06-14T08:32:39Z

Do we see this graph structure when using any high level collection like dataframes, arrays or bags?

Does this also happen with work-stealing disabled?

gjoseph92 · 2022-06-14T19:47:28Z

Do we see this graph structure when using any high level collection like dataframes, arrays or bags?

See my above example using dask array. I'm sure I could make a similar one with dataframes. As I said, this is basically going to affect any sort of broadcast operation.

I don't think work-stealing has an effect here; if anything it might help. I need to double-check though.

The basic problem is that the decide_worker objective function only considers the sizes of the inputs, not the bigger picture of whether some tasks are more worth moving (for parallelism) even if they're larger. See b4ebbee:

It's more meant to discourage transferring keys that could have just stayed in one place. The goal is that if A and B are on different workers, and we're the only task that will ever need A, but plenty of other tasks will need B, we should schedule alongside A even if B is a bit larger to move.

gjoseph92 added performance stability Issue or feature related to cluster stability (e.g. deadlock) memory labels Jun 13, 2022

This was referenced Jun 13, 2022

Ease memory pressure by deprioritizing root tasks? #6360

Open

Almost embarrassingly parallel workload piles up on a single worker coiled/benchmarks#141

Open

fjetter added the scheduling label Jun 14, 2022

gjoseph92 mentioned this issue Jun 20, 2022

[WIP] Queue root tasks on scheduler, not workers [with co-assignment] #6598

Draft

2 tasks

gjoseph92 mentioned this issue Jul 7, 2022

Integration test: common physical science workload coiled/benchmarks#174

Closed

gjoseph92 mentioned this issue Aug 20, 2022

Factor out and instrument task categorization logic - static graph analysis #6922

Open

gjoseph92 mentioned this issue Feb 9, 2023

Make root-ish definition static #7531

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcast-like operations are poorly scheduled (widely-shared dependencies) #6570

Broadcast-like operations are poorly scheduled (widely-shared dependencies) #6570

gjoseph92 commented Jun 13, 2022

fjetter commented Jun 14, 2022

gjoseph92 commented Jun 14, 2022

Broadcast-like operations are poorly scheduled (widely-shared dependencies) #6570

Broadcast-like operations are poorly scheduled (widely-shared dependencies) #6570

Comments

gjoseph92 commented Jun 13, 2022

fjetter commented Jun 14, 2022

gjoseph92 commented Jun 14, 2022