-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize existing row count metadata to use dagster/row_count
key
#21524
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @benpankow and the rest of your teammates on Graphite |
dagster/row_count
key
c1ed43c
to
563fccc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main question for me here is: how do we name this in a way that users can't confuse it with "the number of rows added during this particular materialization". Maybe "full_row_count"? "total_row_count"? "table_row_count"?
Going to start a quick naming slack thread
examples/project_fully_featured/project_fully_featured/resources/parquet_io_manager.py
Outdated
Show resolved
Hide resolved
@@ -11,7 +11,7 @@ def error_on_warning(): | |||
raise_exception_on_warnings() | |||
|
|||
|
|||
def test_table_metadata_set(): | |||
def test_table_metadata_set() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this type annotation out of curiosity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without a type annotation on the function signature, pyright
will skip analysis of the underlying function. This is because we have analyzeUnannotatedFunctions
set to False
:
Lines 44 to 47 in db84034
# Set to false to help us during the transition from mypy to pyright. Mypy does | |
# not analyze unannotated functions by default, and so as of 2023-02 the codebase contains a large | |
# number of type errors in unannotated functions. Eventually we can turn off this setting. | |
analyzeUnannotatedFunctions = false |
So everything will appear as Any
in your local editor unless you type the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, did not know this
375cf8b
to
c3ae946
Compare
dagster/row_count
keydagster/total_row_count
key
Settled on |
d70aa25
to
9d957ab
Compare
dagster/total_row_count
keydagster/row_count
key
9d957ab
to
72635fe
Compare
@@ -11,7 +11,7 @@ def error_on_warning(): | |||
raise_exception_on_warnings() | |||
|
|||
|
|||
def test_table_metadata_set(): | |||
def test_table_metadata_set() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, did not know this
python_modules/dagster/dagster/_core/definitions/metadata/metadata_set.py
Outdated
Show resolved
Hide resolved
…data_set.py Co-authored-by: Sandy Ryza <[email protected]>
c13c4cf
to
bb7f6cb
Compare
…t` execution (#21542) ## Summary Adds a new `fetch_table_metadata` experimental flag to `DbtCliResource.cli` which allows us to fetch `dagster/total_row_count` (introduced in #21524) to dbt-built tables: ```python @dbt_assets(manifest=dbt_manifest) def jaffle_shop_dbt_assets( context: AssetExecutionContext, dbt: DbtCliResource, ): yield from dbt.cli( ["build"], context=context, fetch_table_metadata=True, ).stream() ``` <img width="534" alt="Screenshot 2024-05-03 at 11 03 19 AM" src="https://github.com/dagster-io/dagster/assets/10215173/c3e64633-5fc3-44e4-99e3-601f0c7a0856"> Under the hood, this PR uses dbt's `dbt.adapters.base.impl.BaseAdapter` abstraction to let Dagster connect to the user's warehouse using the dbt-provided credentials. Right now, we just run a simple `select count(*)` on the tables specified in each `AssetMaterialization` and `Output`, but this lays some groundwork we could use for fetching other data as well. There are a few caveats: - When using duckdb, we wait for the dbt run to conclude, since duckdb does not allow simultaneous connections when a write connection is open (e.g. when dbt is running) - We don't query row counts on views, since they may include non-trivial sql which could be expensive to query ## Test Plan Tested locally w/ duckdb, bigquery, and snowflake. Introduced basic pytest test to test against duckdb.
dagster-io#21524) ## Summary Moves the existing row count metadata to be a subset of the `TableMetadataSet`, so it is prefixed as `dagster/row_count`. This identifies it as a special metadata type, lets us specially handle it in the UI, and also ensures it remains the same across integrations. Introduces `dagster/partition_row_count` for partitioned asset materializations, as well. ## Test Plan Brief unit test of `TableMetadataSet`, update other unit tests.
…t` execution (dagster-io#21542) ## Summary Adds a new `fetch_table_metadata` experimental flag to `DbtCliResource.cli` which allows us to fetch `dagster/total_row_count` (introduced in dagster-io#21524) to dbt-built tables: ```python @dbt_assets(manifest=dbt_manifest) def jaffle_shop_dbt_assets( context: AssetExecutionContext, dbt: DbtCliResource, ): yield from dbt.cli( ["build"], context=context, fetch_table_metadata=True, ).stream() ``` <img width="534" alt="Screenshot 2024-05-03 at 11 03 19 AM" src="https://github.com/dagster-io/dagster/assets/10215173/c3e64633-5fc3-44e4-99e3-601f0c7a0856"> Under the hood, this PR uses dbt's `dbt.adapters.base.impl.BaseAdapter` abstraction to let Dagster connect to the user's warehouse using the dbt-provided credentials. Right now, we just run a simple `select count(*)` on the tables specified in each `AssetMaterialization` and `Output`, but this lays some groundwork we could use for fetching other data as well. There are a few caveats: - When using duckdb, we wait for the dbt run to conclude, since duckdb does not allow simultaneous connections when a write connection is open (e.g. when dbt is running) - We don't query row counts on views, since they may include non-trivial sql which could be expensive to query ## Test Plan Tested locally w/ duckdb, bigquery, and snowflake. Introduced basic pytest test to test against duckdb.
Summary
Moves the existing row count metadata to be a subset of the
TableMetadataSet
, so it is prefixed asdagster/row_count
. This identifies it as a special metadata type, lets us specially handle it in the UI, and also ensures it remains the same across integrations.Introduces
dagster/partition_row_count
for partitioned asset materializations, as well.Test Plan
Brief unit test of
TableMetadataSet
, update other unit tests.