Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize existing row count metadata to use dagster/row_count key #21524

Merged
merged 9 commits into from
May 8, 2024

Conversation

benpankow
Copy link
Member

@benpankow benpankow commented Apr 30, 2024

Summary

Moves the existing row count metadata to be a subset of the TableMetadataSet, so it is prefixed as dagster/row_count. This identifies it as a special metadata type, lets us specially handle it in the UI, and also ensures it remains the same across integrations.

Introduces dagster/partition_row_count for partitioned asset materializations, as well.

Test Plan

Brief unit test of TableMetadataSet, update other unit tests.

@benpankow benpankow changed the title standardize row count meta Standardize existing row count metadata to use dagster/row_count key Apr 30, 2024
@benpankow benpankow force-pushed the benpankow/row-count-meta branch 2 times, most recently from c1ed43c to 563fccc Compare May 1, 2024 20:09
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main question for me here is: how do we name this in a way that users can't confuse it with "the number of rows added during this particular materialization". Maybe "full_row_count"? "total_row_count"? "table_row_count"?

Going to start a quick naming slack thread

@@ -11,7 +11,7 @@ def error_on_warning():
raise_exception_on_warnings()


def test_table_metadata_set():
def test_table_metadata_set() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this type annotation out of curiosity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a type annotation on the function signature, pyright will skip analysis of the underlying function. This is because we have analyzeUnannotatedFunctions set to False:

dagster/pyproject.toml

Lines 44 to 47 in db84034

# Set to false to help us during the transition from mypy to pyright. Mypy does
# not analyze unannotated functions by default, and so as of 2023-02 the codebase contains a large
# number of type errors in unannotated functions. Eventually we can turn off this setting.
analyzeUnannotatedFunctions = false

So everything will appear as Any in your local editor unless you type the function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, did not know this

@benpankow benpankow force-pushed the benpankow/row-count-meta branch 3 times, most recently from 375cf8b to c3ae946 Compare May 3, 2024 18:37
@benpankow benpankow marked this pull request as ready for review May 3, 2024 18:40
@benpankow benpankow requested a review from sryza May 3, 2024 21:25
@benpankow benpankow changed the title Standardize existing row count metadata to use dagster/row_count key Standardize existing row count metadata to use dagster/total_row_count key May 3, 2024
@benpankow
Copy link
Member Author

benpankow commented May 3, 2024

Updated to total_row_count but open to a subsequent update if we arrive at consensus

Settled on row_count after talking with @sryza offline. We felt that total_row_count can be misleading (e.g. does total refer to the total in the table, or total of "good" and "bad" rows, etc?) and that non-total values would be prefixed (e.g. sync_row_count, updated_row_count etc). Just row_count is simplest and avoids any extra baggage.

@benpankow benpankow changed the title Standardize existing row count metadata to use dagster/total_row_count key Standardize existing row count metadata to use dagster/row_count key May 6, 2024
@@ -11,7 +11,7 @@ def error_on_warning():
raise_exception_on_warnings()


def test_table_metadata_set():
def test_table_metadata_set() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, did not know this

@benpankow benpankow merged commit 5637b05 into master May 8, 2024
1 check passed
@benpankow benpankow deleted the benpankow/row-count-meta branch May 8, 2024 18:52
benpankow added a commit that referenced this pull request May 9, 2024
…t` execution (#21542)

## Summary

Adds a new `fetch_table_metadata` experimental flag to
`DbtCliResource.cli` which allows us to fetch `dagster/total_row_count`
(introduced in #21524) to dbt-built tables:

```python
@dbt_assets(manifest=dbt_manifest)
def jaffle_shop_dbt_assets(
    context: AssetExecutionContext,
    dbt: DbtCliResource,
):
    yield from dbt.cli(
        ["build"],
        context=context,
        fetch_table_metadata=True,
    ).stream()
```

<img width="534" alt="Screenshot 2024-05-03 at 11 03 19 AM"
src="https://github.com/dagster-io/dagster/assets/10215173/c3e64633-5fc3-44e4-99e3-601f0c7a0856">

Under the hood, this PR uses dbt's `dbt.adapters.base.impl.BaseAdapter`
abstraction to let Dagster connect to the user's warehouse using the
dbt-provided credentials. Right now, we just run a simple `select
count(*)` on the tables specified in each `AssetMaterialization` and
`Output`, but this lays some groundwork we could use for fetching other
data as well.

There are a few caveats:
- When using duckdb, we wait for the dbt run to conclude, since duckdb
does not allow simultaneous connections when a write connection is open
(e.g. when dbt is running)
- We don't query row counts on views, since they may include non-trivial
sql which could be expensive to query

## Test Plan

Tested locally w/ duckdb, bigquery, and snowflake. Introduced basic
pytest test to test against duckdb.
danielgafni pushed a commit to danielgafni/dagster that referenced this pull request Jun 18, 2024
dagster-io#21524)

## Summary

Moves the existing row count metadata to be a subset of the `TableMetadataSet`, so it is prefixed as `dagster/row_count`. This identifies it as a special metadata type, lets us specially handle it in the UI, and also ensures it remains the same across integrations.

Introduces `dagster/partition_row_count` for partitioned asset materializations, as well.

## Test Plan

Brief unit test of `TableMetadataSet`, update other unit tests.
danielgafni pushed a commit to danielgafni/dagster that referenced this pull request Jun 18, 2024
…t` execution (dagster-io#21542)

## Summary

Adds a new `fetch_table_metadata` experimental flag to
`DbtCliResource.cli` which allows us to fetch `dagster/total_row_count`
(introduced in dagster-io#21524) to dbt-built tables:

```python
@dbt_assets(manifest=dbt_manifest)
def jaffle_shop_dbt_assets(
    context: AssetExecutionContext,
    dbt: DbtCliResource,
):
    yield from dbt.cli(
        ["build"],
        context=context,
        fetch_table_metadata=True,
    ).stream()
```

<img width="534" alt="Screenshot 2024-05-03 at 11 03 19 AM"
src="https://github.com/dagster-io/dagster/assets/10215173/c3e64633-5fc3-44e4-99e3-601f0c7a0856">

Under the hood, this PR uses dbt's `dbt.adapters.base.impl.BaseAdapter`
abstraction to let Dagster connect to the user's warehouse using the
dbt-provided credentials. Right now, we just run a simple `select
count(*)` on the tables specified in each `AssetMaterialization` and
`Output`, but this lays some groundwork we could use for fetching other
data as well.

There are a few caveats:
- When using duckdb, we wait for the dbt run to conclude, since duckdb
does not allow simultaneous connections when a write connection is open
(e.g. when dbt is running)
- We don't query row counts on views, since they may include non-trivial
sql which could be expensive to query

## Test Plan

Tested locally w/ duckdb, bigquery, and snowflake. Introduced basic
pytest test to test against duckdb.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants