[db io managers] support self dependent assets in db io managers #12700

jamiedemaria · 2023-03-03T20:47:07Z

Summary & Motivation

This PR adds support for self-dependent assets. For a self-dependent asset, the "base case" partition doesn't depend on previous partitions, so the InputContext.asset_partition_keys will be an empty list. We check to see if this is the case and if so we return an empty DataFrame. Otherwise we perform the SELECT query as normal

Original comment:

Starting to think about how we'll manage self-dependent assets (see #11845). This is one option where we attempt to select from the table, catch the exception if the table doesn't exist, and then confirm that the asset depends on itself. We then make the assumption that that means it's the "base case" of the self dependent asset and return an empty dataframe. I'm not 100% sure what the expectations are around self-dependent asset behavior, so would like some feedback!

I implemented this for the duckdb io manager and pandas type handler as an example. if this approach makes sense i'll implement for the remaining io managers/type handlers

One issue with this approach: imagine we have a self-dependent asset with daily partition starting on 2023-01-01
where each partition is dependent on the one before it. partition 2023-01-01 is the "base case". if we materialize partition 2023-01-15 first, the io manager will return an empty dataframe (since 2023-01-14 hasn't been materialized yet), which i think (?) would be unexpected behavior.

If in the case above, returning an empty dataframe is expected, then this approach would work. However, if we want to only return the empty dataframe for the first partition in the partition definition i'll need to do some more work to make that determination in the io manager.

I think comparing the partition key of the asset to the start of the partitions definition could work in all cases except for static partitions. For static partitions i'm not sure how we know what partition is considered the "first"

How I Tested These Changes

vercel · 2023-03-03T20:47:10Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Ignored Deployments

Name	Status	Preview	Comments	Updated
dagit-storybook	⬜️ Ignored (Inspect)			Mar 9, 2023 at 3:48PM (UTC)
dagster	⬜️ Ignored (Inspect)			Mar 9, 2023 at 3:48PM (UTC)

jamiedemaria · 2023-03-03T20:47:19Z

Current dependencies on/for this PR:

master
- PR [db io managers] support self dependent assets in db io managers #12700 👈

This comment was auto-generated by Graphite.

python_modules/dagster/dagster/_core/storage/db_io_manager.py

..._modules/libraries/dagster-duckdb-pandas/dagster_duckdb_pandas/duckdb_pandas_type_handler.py

sryza · 2023-03-06T16:53:01Z

In the base case, the InputContext.asset_partition_keys should return an empty collection of partitions - can we just avoid doing queries when there's an empty collection of partitions?

jamiedemaria · 2023-03-06T17:01:44Z

oh awesome, i didn't know that! that should make this a lot easier - ty!

sryza

Is there some chance we'd want to return None in this case instead of an empty dataframe? I think it's probably worse, but curious about the pros/cons.

python_modules/dagster/dagster/_core/storage/db_io_manager.py

jamiedemaria · 2023-03-08T16:08:28Z

If we return None it'll get messy with how we use typing to determine the TypeHandler to use. If we allow the load_input method to return None, we then have to change the asset function signature to look like this

def self_dependent_asset(context, self_dependent_asset: Optional[pd.DataFrame]) -> pd.DataFrame:

And when the DB io manager tries to load the input, it can't pick the right type handler because the type of the input is Optional[pd.DataFrame] instead of pd.DataFrame. We could modify the DB IO manager so it can handle this case, but i don't see a benefit to returning None instead of an empty DataFrame that justifies it. If you think of a reason returning None makes more sense let me know though!

sryza

Nice

jamiedemaria requested review from sryza, benpankow and clairelin135 March 6, 2023 15:53

jamiedemaria commented Mar 6, 2023

View reviewed changes

python_modules/dagster/dagster/_core/storage/db_io_manager.py Outdated Show resolved Hide resolved

jamiedemaria commented Mar 6, 2023

View reviewed changes

..._modules/libraries/dagster-duckdb-pandas/dagster_duckdb_pandas/duckdb_pandas_type_handler.py Outdated Show resolved Hide resolved

jamiedemaria force-pushed the jamie/db-io/self-partitions branch from 04556ad to 04b2930 Compare March 6, 2023 21:51

jamiedemaria marked this pull request as ready for review March 6, 2023 21:51

jamiedemaria force-pushed the jamie/db-io/self-partitions branch from 04b2930 to 04637aa Compare March 7, 2023 20:28

sryza reviewed Mar 8, 2023

View reviewed changes

python_modules/dagster/dagster/_core/storage/db_io_manager.py Outdated Show resolved Hide resolved

jamiedemaria changed the title ~~[rfc] support self dependent assets in db io managers~~ support self dependent assets in db io managers Mar 8, 2023

jamiedemaria changed the title ~~support self dependent assets in db io managers~~ [db io managers] support self dependent assets in db io managers Mar 8, 2023

JamieDeMaria added 12 commits March 8, 2023 17:45

wip support self dependent assets

f1d6f0c

refactor to db io manager level

52fc0fc

restack

579927e

update duckdb

e60c7f5

update tests

881dd37

correct schema

3ebb9ab

fixes

8e0afdb

more fixes

063711e

fixes

8354071

fix config values

c89aec5

restore db io manager file formatting

bbc144c

updates from main

f6ff4eb

jamiedemaria force-pushed the jamie/db-io/self-partitions branch from e91394e to f6ff4eb Compare March 8, 2023 22:47

fix ordering

4a792b7

jamiedemaria requested a review from sryza March 9, 2023 15:48

sryza approved these changes Mar 10, 2023

View reviewed changes

jamiedemaria merged commit cf997c0 into master Mar 10, 2023

jamiedemaria deleted the jamie/db-io/self-partitions branch March 10, 2023 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[db io managers] support self dependent assets in db io managers #12700

[db io managers] support self dependent assets in db io managers #12700

jamiedemaria commented Mar 3, 2023 •

edited

Loading

vercel bot commented Mar 3, 2023 •

edited

Loading

jamiedemaria commented Mar 3, 2023

sryza commented Mar 6, 2023

jamiedemaria commented Mar 6, 2023

sryza left a comment

jamiedemaria commented Mar 8, 2023 •

edited

Loading

sryza left a comment

[db io managers] support self dependent assets in db io managers #12700

[db io managers] support self dependent assets in db io managers #12700

Conversation

jamiedemaria commented Mar 3, 2023 • edited Loading

Summary & Motivation

How I Tested These Changes

vercel bot commented Mar 3, 2023 • edited Loading

jamiedemaria commented Mar 3, 2023

sryza commented Mar 6, 2023

jamiedemaria commented Mar 6, 2023

sryza left a comment

Choose a reason for hiding this comment

jamiedemaria commented Mar 8, 2023 • edited Loading

sryza left a comment

Choose a reason for hiding this comment

jamiedemaria commented Mar 3, 2023 •

edited

Loading

vercel bot commented Mar 3, 2023 •

edited

Loading

jamiedemaria commented Mar 8, 2023 •

edited

Loading