Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/bigquery): support column-level lineage #8382

Merged
merged 16 commits into from
Jul 11, 2023

Conversation

hsheth2
Copy link
Collaborator

@hsheth2 hsheth2 commented Jul 7, 2023

Changes stacked on #8334.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 7, 2023
@hsheth2 hsheth2 marked this pull request as ready for review July 7, 2023 23:24
Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me! If we're looking to ship this soon for a client, I don't remember any changes that /need/ to happen, so I think we can push and address followups after. I got stuck on the recursive temp-table-removal process for a bit, so I have some naming suggestions there but not sure the best way to make that clearer

auditStamp=datetime.now(timezone.utc),
type=DatasetLineageTypeClass.VIEW,
for view, view_definition in self.view_definitions[project_id].items():
raw_view_lineage = sqlglot_lineage(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think this sql parser is strictly better than the old one? A fallback to the old one might be safer, but honestly I would like to err on the side of code velocity over safety (in cases like these) so I am fine with this as is

EDIT: That being said, a function like below might be nice, in case we support multiple parsers or adjust the call signature of sqlglot_lineage:

def parse_lineage(self, query: str, project_id: str):
    return sqlglot_lineage(query, self.platform, self.sql_parser_schema_resolver, project_id)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially wanted to do that

However the sql_parser_schema_resolver is only available in the main BigquerySource, and not in the BigqueryLineage class.

Ideally we create a BigqueryContext object with the schema resolver, urn generation utils, etc and pass that around everywhere

Comment on lines 690 to 696
lineage[view] = set(
make_lineage_edges_from_parsing_result(
raw_view_lineage,
audit_stamp=ts,
lineage_type=DatasetLineageTypeClass.VIEW,
)
for table in upstream_tables
}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we always cast the result of make_lineage_edges_from_parsing_result to set, could we just do that in make_lineage_edges_from_parsing_result instead?

@@ -629,23 +785,26 @@ def calculate_lineage_for_project(
def get_lineage_for_table(
self,
bq_table: BigQueryTableRef,
bq_table_urn: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be done in this PR, but I'd like to pass into the lineage extractor dataset_urn_builder: Callable[[BigQueryTableRef], str] like I do with the bq usage extractor, so that we always use the same method to generate dataset urns. Then, I think this won't be necessary (and we can remove the mce_builder call below in favor of it too)

Comment on lines +270 to +272
self.sql_parser_schema_resolver = SchemaResolver(
platform=self.platform, env=self.config.env
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I think we'd implement SchemaResolver.close and call it when this source is closed and when the datahub graph is closed, rather than relying on garbage collection, which I think is the case right now

@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Jul 10, 2023
@hsheth2
Copy link
Collaborator Author

hsheth2 commented Jul 11, 2023

The CI issues will be fixed by #8393, so merging through.

@hsheth2 hsheth2 merged commit d4135d5 into datahub-project:master Jul 11, 2023
39 of 43 checks passed
@hsheth2 hsheth2 deleted the bq-col-lineage branch July 11, 2023 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants