Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[db io managers] support self dependent assets in db io managers #12700

Merged
merged 13 commits into from
Mar 10, 2023

Conversation

jamiedemaria
Copy link
Contributor

@jamiedemaria jamiedemaria commented Mar 3, 2023

Summary & Motivation

This PR adds support for self-dependent assets. For a self-dependent asset, the "base case" partition doesn't depend on previous partitions, so the InputContext.asset_partition_keys will be an empty list. We check to see if this is the case and if so we return an empty DataFrame. Otherwise we perform the SELECT query as normal


Original comment:

Starting to think about how we'll manage self-dependent assets (see #11845). This is one option where we attempt to select from the table, catch the exception if the table doesn't exist, and then confirm that the asset depends on itself. We then make the assumption that that means it's the "base case" of the self dependent asset and return an empty dataframe. I'm not 100% sure what the expectations are around self-dependent asset behavior, so would like some feedback!

I implemented this for the duckdb io manager and pandas type handler as an example. if this approach makes sense i'll implement for the remaining io managers/type handlers

One issue with this approach: imagine we have a self-dependent asset with daily partition starting on 2023-01-01
where each partition is dependent on the one before it. partition 2023-01-01 is the "base case". if we materialize partition 2023-01-15 first, the io manager will return an empty dataframe (since 2023-01-14 hasn't been materialized yet), which i think (?) would be unexpected behavior.

If in the case above, returning an empty dataframe is expected, then this approach would work. However, if we want to only return the empty dataframe for the first partition in the partition definition i'll need to do some more work to make that determination in the io manager.

I think comparing the partition key of the asset to the start of the partitions definition could work in all cases except for static partitions. For static partitions i'm not sure how we know what partition is considered the "first"

How I Tested These Changes

@vercel
Copy link

vercel bot commented Mar 3, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Ignored Deployments
Name Status Preview Comments Updated
dagit-storybook ⬜️ Ignored (Inspect) Mar 9, 2023 at 3:48PM (UTC)
dagster ⬜️ Ignored (Inspect) Mar 9, 2023 at 3:48PM (UTC)

@jamiedemaria
Copy link
Contributor Author

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@sryza
Copy link
Contributor

sryza commented Mar 6, 2023

In the base case, the InputContext.asset_partition_keys should return an empty collection of partitions - can we just avoid doing queries when there's an empty collection of partitions?

@jamiedemaria
Copy link
Contributor Author

oh awesome, i didn't know that! that should make this a lot easier - ty!

@jamiedemaria jamiedemaria marked this pull request as ready for review March 6, 2023 21:51
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some chance we'd want to return None in this case instead of an empty dataframe? I think it's probably worse, but curious about the pros/cons.

@jamiedemaria
Copy link
Contributor Author

jamiedemaria commented Mar 8, 2023

If we return None it'll get messy with how we use typing to determine the TypeHandler to use. If we allow the load_input method to return None, we then have to change the asset function signature to look like this

def self_dependent_asset(context, self_dependent_asset: Optional[pd.DataFrame]) -> pd.DataFrame:

And when the DB io manager tries to load the input, it can't pick the right type handler because the type of the input is Optional[pd.DataFrame] instead of pd.DataFrame. We could modify the DB IO manager so it can handle this case, but i don't see a benefit to returning None instead of an empty DataFrame that justifies it. If you think of a reason returning None makes more sense let me know though!

@jamiedemaria jamiedemaria changed the title [rfc] support self dependent assets in db io managers support self dependent assets in db io managers Mar 8, 2023
@jamiedemaria jamiedemaria changed the title support self dependent assets in db io managers [db io managers] support self dependent assets in db io managers Mar 8, 2023
@jamiedemaria jamiedemaria requested a review from sryza March 9, 2023 15:48
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@jamiedemaria jamiedemaria merged commit cf997c0 into master Mar 10, 2023
@jamiedemaria jamiedemaria deleted the jamie/db-io/self-partitions branch March 10, 2023 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants