-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix backfill stalling if first tick does not successfully submit runs #21454
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @jamiedemaria and the rest of your teammates on Graphite |
@clairelin135 @sryza before i go through and update/add unit tests, I just want to get a quick check that this direction makes sense since i'm still getting familiar with everything going on in the backfill daemon |
|
||
if not all_roots_requested: | ||
backfill_roots = asset_backfill_data.get_target_root_asset_partitions(instance_queryer) | ||
# if the previous tick of the backfill did not successfully submit runs (ex code serer unavailable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# if the previous tick of the backfill did not successfully submit runs (ex code serer unavailable) | |
# if the previous tick of the backfill did not successfully submit runs (ex code server unavailable) |
] | ||
if not_yet_requested: | ||
initial_candidates.update(not_yet_requested) | ||
all_roots_requested = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove setting all_roots_requested
here since it has to be False already to hit this code path
|
||
yield None | ||
|
||
updated_materialized_subset = AssetGraphSubset() | ||
failed_and_downstream_subset = AssetGraphSubset() | ||
next_latest_storage_id = _get_next_latest_storage_id(instance_queryer) | ||
else: | ||
if all_roots_requested: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a comment on looking at all_roots_requested
instead of an else
since all_roots_requested
can change to True in the above condition
bumping @clairelin135 and @sryza for review! |
@clairelin135 are you up for taking this one? You're the original author / expert on the code that makes the backfill daemon resilient to code server unavailability, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation in this PR looks like it'll work, but looks like it will have an extra iteration to set all_roots_requested
to True
where no runs are actually kicked off.
We already have logic that catches a retryable error upon submitting run requests and adjusts the requested subset accordingly: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_core/execution/asset_backfill.py#L708-L744
Can we adjust this existing logic to change requested_runs_for_target_roots
to be previous_asset_backfill_data.requested_runs_for_target_roots
if a retryable error was raised? This will save us from having to do that extra iteration and keep all the "retry state" logic in the same place.
It would also be nice to add a test, similar to this one: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster_tests/daemon_tests/test_backfill.py#L1220
ec655db
to
276df9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, looks great to me. Added one small nit
asset_backfill_data.get_target_root_asset_partitions(instance_queryer) | ||
) | ||
target_roots = asset_backfill_data.get_target_root_asset_partitions(instance_queryer) | ||
# check if any roots have already been requested, this make the bfs search more efficent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we mention in the comment why some roots may have been requested? i.e.
Because the code server may have failed while requesting roots, some roots may have already been requested. Checking here will reduce the amount of BFS work later in the iteration.
276df9f
to
3c846a3
Compare
3c846a3
to
0324f91
Compare
Summary & Motivation
If the first tick of the backfill daemon is unable to request runs (for example, the user code server is unavailable) the backfill will stall since it has marked the roots as requested, but when evaluating if any children can be materialized, it repeatedly sees that no parents have completed materializations so requests no additional runs.
This PR resolves this by making the following changes:
AssetBackfillData.requested_runs_for_target_roots
to TrueAssetBackfillData.requested_subset
. if some roots are not, this means the run request was not successfully submitted and we will try to submit the missing roots again. if all roots are inrequested_subset
then we setAssetBackfillData.requested_runs_for_target_roots
to True.This change means there is an iteration over the root assets to see if they are in the requested runs. i looked through how we determine if an
AssetKeyPartitionKey
is in anAssetGraphSubset
and didn't see anything that would be computationally expensive, but i definitely want some second eyes on this because i'm not as familiar with this part of the code base and all its nuances yet.How I Tested These Changes