-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[db io managers] support self dependent assets in db io managers #12700
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ |
Current dependencies on/for this PR: This comment was auto-generated by Graphite. |
..._modules/libraries/dagster-duckdb-pandas/dagster_duckdb_pandas/duckdb_pandas_type_handler.py
Outdated
Show resolved
Hide resolved
In the base case, the |
oh awesome, i didn't know that! that should make this a lot easier - ty! |
04556ad
to
04b2930
Compare
04b2930
to
04637aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some chance we'd want to return None
in this case instead of an empty dataframe? I think it's probably worse, but curious about the pros/cons.
If we return None it'll get messy with how we use typing to determine the TypeHandler to use. If we allow the
And when the DB io manager tries to load the input, it can't pick the right type handler because the type of the input is |
e91394e
to
f6ff4eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
Summary & Motivation
This PR adds support for self-dependent assets. For a self-dependent asset, the "base case" partition doesn't depend on previous partitions, so the
InputContext.asset_partition_keys
will be an empty list. We check to see if this is the case and if so we return an empty DataFrame. Otherwise we perform the SELECT query as normalOriginal comment:
Starting to think about how we'll manage self-dependent assets (see #11845). This is one option where we attempt to select from the table, catch the exception if the table doesn't exist, and then confirm that the asset depends on itself. We then make the assumption that that means it's the "base case" of the self dependent asset and return an empty dataframe. I'm not 100% sure what the expectations are around self-dependent asset behavior, so would like some feedback!
I implemented this for the duckdb io manager and pandas type handler as an example. if this approach makes sense i'll implement for the remaining io managers/type handlers
One issue with this approach: imagine we have a self-dependent asset with daily partition starting on
2023-01-01
where each partition is dependent on the one before it. partition
2023-01-01
is the "base case". if we materialize partition2023-01-15
first, the io manager will return an empty dataframe (since2023-01-14
hasn't been materialized yet), which i think (?) would be unexpected behavior.If in the case above, returning an empty dataframe is expected, then this approach would work. However, if we want to only return the empty dataframe for the first partition in the partition definition i'll need to do some more work to make that determination in the io manager.
I think comparing the partition key of the asset to the start of the partitions definition could work in all cases except for static partitions. For static partitions i'm not sure how we know what partition is considered the "first"
How I Tested These Changes