Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix backfill stalling if first tick does not successfully submit runs #21454

Merged
merged 3 commits into from
May 3, 2024

Conversation

jamiedemaria
Copy link
Contributor

@jamiedemaria jamiedemaria commented Apr 26, 2024

Summary & Motivation

If the first tick of the backfill daemon is unable to request runs (for example, the user code server is unavailable) the backfill will stall since it has marked the roots as requested, but when evaluating if any children can be materialized, it repeatedly sees that no parents have completed materializations so requests no additional runs.

This PR resolves this by making the following changes:

  • on the first tick, do not update AssetBackfillData.requested_runs_for_target_roots to True
  • on the next tick, do an additional check to see that all of the backfill roots are included in AssetBackfillData.requested_subset. if some roots are not, this means the run request was not successfully submitted and we will try to submit the missing roots again. if all roots are in requested_subset then we set AssetBackfillData.requested_runs_for_target_roots to True.
  • the above step will repeat until all roots have been successfully requested. then the following ticks will proceed as usual.

This change means there is an iteration over the root assets to see if they are in the requested runs. i looked through how we determine if an AssetKeyPartitionKey is in an AssetGraphSubset and didn't see anything that would be computationally expensive, but i definitely want some second eyes on this because i'm not as familiar with this part of the code base and all its nuances yet.

How I Tested These Changes

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @jamiedemaria and the rest of your teammates on Graphite Graphite

@jamiedemaria jamiedemaria marked this pull request as ready for review April 26, 2024 19:40
@jamiedemaria
Copy link
Contributor Author

@clairelin135 @sryza before i go through and update/add unit tests, I just want to get a quick check that this direction makes sense since i'm still getting familiar with everything going on in the backfill daemon


if not all_roots_requested:
backfill_roots = asset_backfill_data.get_target_root_asset_partitions(instance_queryer)
# if the previous tick of the backfill did not successfully submit runs (ex code serer unavailable)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# if the previous tick of the backfill did not successfully submit runs (ex code serer unavailable)
# if the previous tick of the backfill did not successfully submit runs (ex code server unavailable)

]
if not_yet_requested:
initial_candidates.update(not_yet_requested)
all_roots_requested = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove setting all_roots_requested here since it has to be False already to hit this code path


yield None

updated_materialized_subset = AssetGraphSubset()
failed_and_downstream_subset = AssetGraphSubset()
next_latest_storage_id = _get_next_latest_storage_id(instance_queryer)
else:
if all_roots_requested:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a comment on looking at all_roots_requested instead of an else since all_roots_requested can change to True in the above condition

@jamiedemaria
Copy link
Contributor Author

bumping @clairelin135 and @sryza for review!

@sryza
Copy link
Contributor

sryza commented Apr 30, 2024

@clairelin135 are you up for taking this one? You're the original author / expert on the code that makes the backfill daemon resilient to code server unavailability, right?

Copy link
Contributor

@clairelin135 clairelin135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation in this PR looks like it'll work, but looks like it will have an extra iteration to set all_roots_requested to True where no runs are actually kicked off.

We already have logic that catches a retryable error upon submitting run requests and adjusts the requested subset accordingly: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_core/execution/asset_backfill.py#L708-L744

Can we adjust this existing logic to change requested_runs_for_target_roots to be previous_asset_backfill_data.requested_runs_for_target_roots if a retryable error was raised? This will save us from having to do that extra iteration and keep all the "retry state" logic in the same place.

It would also be nice to add a test, similar to this one: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster_tests/daemon_tests/test_backfill.py#L1220

Copy link
Contributor

@clairelin135 clairelin135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looks great to me. Added one small nit

asset_backfill_data.get_target_root_asset_partitions(instance_queryer)
)
target_roots = asset_backfill_data.get_target_root_asset_partitions(instance_queryer)
# check if any roots have already been requested, this make the bfs search more efficent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we mention in the comment why some roots may have been requested? i.e.

Because the code server may have failed while requesting roots, some roots may have already been requested. Checking here will reduce the amount of BFS work later in the iteration.

@jamiedemaria jamiedemaria merged commit 957cc2e into master May 3, 2024
1 check passed
@jamiedemaria jamiedemaria deleted the jamie/backfill_first_tick branch May 3, 2024 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants