Standardize existing row count metadata to use `dagster/row_count` key #21524

benpankow · 2024-04-30T16:18:50Z

Summary

Moves the existing row count metadata to be a subset of the TableMetadataSet, so it is prefixed as dagster/row_count. This identifies it as a special metadata type, lets us specially handle it in the UI, and also ensures it remains the same across integrations.

Introduces dagster/partition_row_count for partitioned asset materializations, as well.

Test Plan

Brief unit test of TableMetadataSet, update other unit tests.

benpankow · 2024-04-30T16:19:06Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @benpankow and the rest of your teammates on Graphite

sryza

The main question for me here is: how do we name this in a way that users can't confuse it with "the number of rows added during this particular materialization". Maybe "full_row_count"? "total_row_count"? "table_row_count"?

Going to start a quick naming slack thread

examples/project_fully_featured/project_fully_featured/resources/parquet_io_manager.py

sryza · 2024-04-30T21:28:05Z

...on_modules/dagster/dagster_tests/definitions_tests/metadata_tests/test_table_metadata_set.py

@@ -11,7 +11,7 @@ def error_on_warning():
 raise_exception_on_warnings()


-def test_table_metadata_set():
+def test_table_metadata_set() -> None:


why this type annotation out of curiosity?

Without a type annotation on the function signature, pyright will skip analysis of the underlying function. This is because we have analyzeUnannotatedFunctions set to False:

dagster/pyproject.toml

Lines 44 to 47 in db84034

# Set to false to help us during the transition from mypy to pyright. Mypy does

# not analyze unannotated functions by default, and so as of 2023-02 the codebase contains a large

# number of type errors in unannotated functions. Eventually we can turn off this setting.

analyzeUnannotatedFunctions = false

So everything will appear as Any in your local editor unless you type the function.

Wow, did not know this

benpankow · 2024-05-03T21:30:27Z

~~Updated to total_row_count but open to a subsequent update if we arrive at consensus~~

Settled on row_count after talking with @sryza offline. We felt that total_row_count can be misleading (e.g. does total refer to the total in the table, or total of "good" and "bad" rows, etc?) and that non-total values would be prefixed (e.g. sync_row_count, updated_row_count etc). Just row_count is simplest and avoids any extra baggage.

sryza · 2024-05-07T21:14:35Z

...on_modules/dagster/dagster_tests/definitions_tests/metadata_tests/test_table_metadata_set.py

@@ -11,7 +11,7 @@ def error_on_warning():
 raise_exception_on_warnings()


-def test_table_metadata_set():
+def test_table_metadata_set() -> None:


Wow, did not know this

python_modules/dagster/dagster/_core/definitions/metadata/metadata_set.py

…data_set.py Co-authored-by: Sandy Ryza <[email protected]>

…t` execution (#21542) ## Summary Adds a new `fetch_table_metadata` experimental flag to `DbtCliResource.cli` which allows us to fetch `dagster/total_row_count` (introduced in #21524) to dbt-built tables: ```python @dbt_assets(manifest=dbt_manifest) def jaffle_shop_dbt_assets( context: AssetExecutionContext, dbt: DbtCliResource, ): yield from dbt.cli( ["build"], context=context, fetch_table_metadata=True, ).stream() ``` <img width="534" alt="Screenshot 2024-05-03 at 11 03 19 AM" src="https://github.com/dagster-io/dagster/assets/10215173/c3e64633-5fc3-44e4-99e3-601f0c7a0856"> Under the hood, this PR uses dbt's `dbt.adapters.base.impl.BaseAdapter` abstraction to let Dagster connect to the user's warehouse using the dbt-provided credentials. Right now, we just run a simple `select count(*)` on the tables specified in each `AssetMaterialization` and `Output`, but this lays some groundwork we could use for fetching other data as well. There are a few caveats: - When using duckdb, we wait for the dbt run to conclude, since duckdb does not allow simultaneous connections when a write connection is open (e.g. when dbt is running) - We don't query row counts on views, since they may include non-trivial sql which could be expensive to query ## Test Plan Tested locally w/ duckdb, bigquery, and snowflake. Introduced basic pytest test to test against duckdb.

dagster-io#21524) ## Summary Moves the existing row count metadata to be a subset of the `TableMetadataSet`, so it is prefixed as `dagster/row_count`. This identifies it as a special metadata type, lets us specially handle it in the UI, and also ensures it remains the same across integrations. Introduces `dagster/partition_row_count` for partitioned asset materializations, as well. ## Test Plan Brief unit test of `TableMetadataSet`, update other unit tests.

…t` execution (dagster-io#21542) ## Summary Adds a new `fetch_table_metadata` experimental flag to `DbtCliResource.cli` which allows us to fetch `dagster/total_row_count` (introduced in dagster-io#21524) to dbt-built tables: ```python @dbt_assets(manifest=dbt_manifest) def jaffle_shop_dbt_assets( context: AssetExecutionContext, dbt: DbtCliResource, ): yield from dbt.cli( ["build"], context=context, fetch_table_metadata=True, ).stream() ``` <img width="534" alt="Screenshot 2024-05-03 at 11 03 19 AM" src="https://github.com/dagster-io/dagster/assets/10215173/c3e64633-5fc3-44e4-99e3-601f0c7a0856"> Under the hood, this PR uses dbt's `dbt.adapters.base.impl.BaseAdapter` abstraction to let Dagster connect to the user's warehouse using the dbt-provided credentials. Right now, we just run a simple `select count(*)` on the tables specified in each `AssetMaterialization` and `Output`, but this lays some groundwork we could use for fetching other data as well. There are a few caveats: - When using duckdb, we wait for the dbt run to conclude, since duckdb does not allow simultaneous connections when a write connection is open (e.g. when dbt is running) - We don't query row counts on views, since they may include non-trivial sql which could be expensive to query ## Test Plan Tested locally w/ duckdb, bigquery, and snowflake. Introduced basic pytest test to test against duckdb.

benpankow changed the title ~~standardize row count meta~~ Standardize existing row count metadata to use dagster/row_count key Apr 30, 2024

benpankow requested review from sryza and rexledesma April 30, 2024 16:43

benpankow force-pushed the benpankow/row-count-meta branch 2 times, most recently from c1ed43c to 563fccc Compare May 1, 2024 20:09

benpankow mentioned this pull request May 1, 2024

Experimental flag to attach row count metadata as part of dagster-dbt execution #21542

Merged

sryza reviewed May 1, 2024

View reviewed changes

benpankow force-pushed the benpankow/row-count-meta branch 3 times, most recently from 375cf8b to c3ae946 Compare May 3, 2024 18:37

benpankow marked this pull request as ready for review May 3, 2024 18:40

benpankow requested a review from sryza May 3, 2024 21:25

benpankow changed the title ~~Standardize existing row count metadata to use dagster/row_count key~~ Standardize existing row count metadata to use dagster/total_row_count key May 3, 2024

benpankow force-pushed the benpankow/row-count-meta branch from d70aa25 to 9d957ab Compare May 6, 2024 18:54

benpankow changed the title ~~Standardize existing row count metadata to use dagster/total_row_count key~~ Standardize existing row count metadata to use dagster/row_count key May 6, 2024

This was referenced May 6, 2024

add small util to modify Output metadata #21666

Merged

unify metadata replacement api across event types #21667

Merged

benpankow force-pushed the benpankow/row-count-meta branch from 9d957ab to 72635fe Compare May 7, 2024 20:29

sryza approved these changes May 7, 2024

View reviewed changes

sryza reviewed May 7, 2024

View reviewed changes

python_modules/dagster/dagster/_core/definitions/metadata/metadata_set.py Outdated Show resolved Hide resolved

benpankow and others added 8 commits May 7, 2024 15:10

standardize row count meta

0673bb0

update tests

d0d80f1

update examples

c4f1501

total_row_count

3ab5438

total_row_count to row_count

584b990

rm dbt test

e2ab83c

add partition_row_count meta

94234ba

Update python_modules/dagster/dagster/_core/definitions/metadata/meta…

eab52cd

…data_set.py Co-authored-by: Sandy Ryza <[email protected]>

more tests

bb7f6cb

benpankow force-pushed the benpankow/row-count-meta branch from c13c4cf to bb7f6cb Compare May 7, 2024 22:11

benpankow merged commit 5637b05 into master May 8, 2024
1 check passed

benpankow deleted the benpankow/row-count-meta branch May 8, 2024 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize existing row count metadata to use `dagster/row_count` key #21524

Standardize existing row count metadata to use `dagster/row_count` key #21524

benpankow commented Apr 30, 2024 •

edited

Loading

benpankow commented Apr 30, 2024 •

edited

Loading

sryza left a comment

sryza Apr 30, 2024

rexledesma May 1, 2024

sryza May 7, 2024

benpankow commented May 3, 2024 •

edited

Loading

sryza May 7, 2024

	# Set to false to help us during the transition from mypy to pyright. Mypy does
	# not analyze unannotated functions by default, and so as of 2023-02 the codebase contains a large
	# number of type errors in unannotated functions. Eventually we can turn off this setting.
	analyzeUnannotatedFunctions = false

Standardize existing row count metadata to use dagster/row_count key #21524

Standardize existing row count metadata to use dagster/row_count key #21524

Conversation

benpankow commented Apr 30, 2024 • edited Loading

Summary

Test Plan

benpankow commented Apr 30, 2024 • edited Loading

sryza left a comment

Choose a reason for hiding this comment

sryza Apr 30, 2024

Choose a reason for hiding this comment

rexledesma May 1, 2024

Choose a reason for hiding this comment

sryza May 7, 2024

Choose a reason for hiding this comment

benpankow commented May 3, 2024 • edited Loading

sryza May 7, 2024

Choose a reason for hiding this comment

Standardize existing row count metadata to use `dagster/row_count` key #21524

Standardize existing row count metadata to use `dagster/row_count` key #21524

benpankow commented Apr 30, 2024 •

edited

Loading

benpankow commented Apr 30, 2024 •

edited

Loading

benpankow commented May 3, 2024 •

edited

Loading