feat(ingest/bigquery): support column-level lineage #8382

hsheth2 · 2023-07-07T23:01:11Z

Changes stacked on #8334.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

asikowitz

Overall looks good to me! If we're looking to ship this soon for a client, I don't remember any changes that /need/ to happen, so I think we can push and address followups after. I got stuck on the recursive temp-table-removal process for a bit, so I have some naming suggestions there but not sure the best way to make that clearer

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

asikowitz · 2023-07-10T19:04:03Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

- auditStamp=datetime.now(timezone.utc),
- type=DatasetLineageTypeClass.VIEW,
+ for view, view_definition in self.view_definitions[project_id].items():
+ raw_view_lineage = sqlglot_lineage(


Do we think this sql parser is strictly better than the old one? A fallback to the old one might be safer, but honestly I would like to err on the side of code velocity over safety (in cases like these) so I am fine with this as is

EDIT: That being said, a function like below might be nice, in case we support multiple parsers or adjust the call signature of sqlglot_lineage:

def parse_lineage(self, query: str, project_id: str): return sqlglot_lineage(query, self.platform, self.sql_parser_schema_resolver, project_id)

I initially wanted to do that

However the sql_parser_schema_resolver is only available in the main BigquerySource, and not in the BigqueryLineage class.

Ideally we create a BigqueryContext object with the schema resolver, urn generation utils, etc and pass that around everywhere

asikowitz · 2023-07-10T19:05:08Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+ lineage[view] = set(
+ make_lineage_edges_from_parsing_result(
+ raw_view_lineage,
+ audit_stamp=ts,
+ lineage_type=DatasetLineageTypeClass.VIEW,
 )
- for table in upstream_tables
- }
+ )


I see we always cast the result of make_lineage_edges_from_parsing_result to set, could we just do that in make_lineage_edges_from_parsing_result instead?

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py

asikowitz · 2023-07-10T21:00:01Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

@@ -629,23 +785,26 @@ def calculate_lineage_for_project(
 def get_lineage_for_table(
 self,
 bq_table: BigQueryTableRef,
+ bq_table_urn: str,


Doesn't have to be done in this PR, but I'd like to pass into the lineage extractor dataset_urn_builder: Callable[[BigQueryTableRef], str] like I do with the bq usage extractor, so that we always use the same method to generate dataset urns. Then, I think this won't be necessary (and we can remove the mce_builder call below in favor of it too)

asikowitz · 2023-07-10T21:01:52Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

+ self.sql_parser_schema_resolver = SchemaResolver(
+ platform=self.platform, env=self.config.env
+ )


Ideally, I think we'd implement SchemaResolver.close and call it when this source is closed and when the datahub graph is closed, rather than relying on garbage collection, which I think is the case right now

hsheth2 · 2023-07-11T18:12:46Z

The CI issues will be fixed by #8393, so merging through.

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 7, 2023

hsheth2 mentioned this pull request Jul 7, 2023

feat(ingest): schema-aware SQL parsing for column-level lineage #8334

Merged

5 tasks

hsheth2 marked this pull request as ready for review July 7, 2023 23:24

hsheth2 added 8 commits July 7, 2023 16:25

remove unused files

4de6350

simplify bq gen_lineage

2581742

column level view lineage working

07e27cb

correct col confidence

d80842d

parsing for audit logs

ee0441a

merge column lineage through temp tables

6768f1a

add a extract_column_lineage flag

38970e7

refactor _allow_table_name_reuse

8f50043

hsheth2 force-pushed the bq-col-lineage branch from 6629681 to 8f50043 Compare July 7, 2023 23:25

fix some stuff from review

7378e17

vercel bot deployed to Preview July 8, 2023 00:16 View deployment

hsheth2 requested a review from asikowitz July 8, 2023 01:03

hsheth2 added 2 commits July 10, 2023 12:34

make view storage more efficient

cd2015e

Merge branch 'master' into bq-col-lineage

352d8ed

vercel bot deployed to Preview July 10, 2023 20:34 View deployment

asikowitz approved these changes Jul 10, 2023

View reviewed changes

hsheth2 added 4 commits July 10, 2023 16:05

initial review

a467472

clarify naming

c0e6124

use temp audit log

786ce33

remove parse_view_lineage

87eeb9d

hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Jul 10, 2023

vercel bot deployed to Preview July 10, 2023 23:56 View deployment

fix lint

0a76a9b

vercel bot deployed to Preview July 11, 2023 00:44 View deployment

hsheth2 merged commit d4135d5 into datahub-project:master Jul 11, 2023
39 of 43 checks passed

hsheth2 deleted the bq-col-lineage branch July 11, 2023 18:12

asikowitz mentioned this pull request Aug 29, 2023

fix(ingest/bigquery): Handle null view_definition; remove view definition hash ids #8747

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest/bigquery): support column-level lineage #8382

feat(ingest/bigquery): support column-level lineage #8382

hsheth2 commented Jul 7, 2023 •

edited

Loading

asikowitz left a comment

asikowitz Jul 10, 2023

hsheth2 Jul 10, 2023

asikowitz Jul 10, 2023

asikowitz Jul 10, 2023

asikowitz Jul 10, 2023

hsheth2 commented Jul 11, 2023

feat(ingest/bigquery): support column-level lineage #8382

feat(ingest/bigquery): support column-level lineage #8382

Conversation

hsheth2 commented Jul 7, 2023 • edited Loading

Checklist

asikowitz left a comment

Choose a reason for hiding this comment

asikowitz Jul 10, 2023

Choose a reason for hiding this comment

hsheth2 Jul 10, 2023

Choose a reason for hiding this comment

asikowitz Jul 10, 2023

Choose a reason for hiding this comment

asikowitz Jul 10, 2023

Choose a reason for hiding this comment

asikowitz Jul 10, 2023

Choose a reason for hiding this comment

hsheth2 commented Jul 11, 2023

hsheth2 commented Jul 7, 2023 •

edited

Loading