New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat(ingest): schema-aware SQL parsing for column-level lineage #8334

Merged

hsheth2 merged 39 commits into datahub-project:master from hsheth2:sqlglot

Jul 7, 2023

Collaborator

hsheth2 commented Jun 28, 2023

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

hsheth2 added 26 commits

June 5, 2023 15:54


          start playing with sqlglot

10c1467


          more stuff for testing

906c24d


          freeze a bunch of stuff

8dbdaef


          tweak lineage

afbc7b6


          start hacking schema awareness

1d34187


          enable cross-run persist in FileBackedDict

1bb874a


          start cleaning up

ff148a9


          fix schema resolution bugs

44f0adf


          handle table normalization

80002a8


          add todos

1ca83c0


          fix lowercasing test

1624d06


          switch to pydantic

27c812f


          slightly better bq _partitiontime handling

d4c8017


          add identify call

0c7eb5c


          fix for bq/snowflake schema

09dbc24


          query types + bug fixes

fc28e8b


          better table qualification

b58a06e


          better debug info

fca4718


          resolve table urns

603fa7f


          convert remaining urns

15856bc


          mark more things as internal

b43b575


          tweak logging + bug fixes

862275b


          fixup lint

a2bcfbb


          start testing setup

6aee383


          add some test cases

b4e3893


          cleanup exceptions

1c01456

vercel bot had a problem deploying to Preview

June 28, 2023 23:36

Failure

github-actions bot added the ingestion label

hsheth2 added 2 commits

June 28, 2023 16:57


          Merge branch 'master' into sqlglot

ea97fc7


          add extra subqueries test

05949af

hsheth2 added 2 commits

June 28, 2023 17:59


          lint

d1dafaa


          add a cli tool + graph method for sql parsing

e314540

vercel bot deployed to Preview

June 29, 2023 01:18

View deployment


          fix build

7ff2af8

vercel bot deployed to Preview

June 29, 2023 02:32

View deployment


          add bq table normalization test

8aa1697

vercel bot deployed to Preview

June 29, 2023 19:44

View deployment

hsheth2 added 2 commits

June 29, 2023 12:46


          tweak normalization in snowflake

2d7264d


          tweak comment

7b105c3

hsheth2 force-pushed the sqlglot branch from 65a4c20 to 7b105c3 Compare

June 29, 2023 19:49

hsheth2 marked this pull request as ready for review

June 29, 2023 19:55


          add todo

1c1bc30

vercel bot deployed to Preview

June 29, 2023 20:03

View deployment


          add pip extra

349d481

vercel bot deployed to Preview

June 29, 2023 20:19

View deployment


          fix test

791c203

vercel bot had a problem deploying to Preview

July 5, 2023 19:43

Failure

hsheth2 requested a review from asikowitz

July 6, 2023 18:43

hsheth2 added 2 commits

July 6, 2023 12:01


          handle bigquery's monkeypatching

4cf2f03


          Merge branch 'master' into sqlglot

1c0f186

vercel bot deployed to Preview

July 6, 2023 19:23

View deployment

asikowitz approved these changes

View reviewed changes

Collaborator

asikowitz left a comment

Haven't been able to go through this whole thing but looks good to me so far. I have a couple questions but nothing serious. I feel like there should be a cleaner way to do the -> urn conversion but haven't really thought about it, the "duplicate" classes _ColumnRef and ColumnRef just rubbing me the wrong way. Going to approve in case you need to merge, will finish review much later tonight

metadata-ingestion/setup.py

@@ @@ -285,6 +295,7 @@ def get_long_description(): @@
                   | bigquery_common
                   | {
                       *sqllineage_lib,
+                      *sqlglot_lib,

Collaborator

asikowitz Jul 5, 2023

Would be nice to get rid of bigquery beta if we can

metadata-ingestion/src/datahub/utilities/file_backed_collections.py Show resolved Hide resolved

metadata-ingestion/src/datahub/utilities/file_backed_collections.py Show resolved Hide resolved

metadata-ingestion/src/datahub/utilities/file_backed_collections.py Show resolved Hide resolved

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py Show resolved Hide resolved

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py Show resolved Hide resolved

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py

		column_error: Optional[Exception]


		class SqlParsingResult(BaseModel):

Collaborator

asikowitz Jul 6, 2023

What's the difference between using pydantic here vs dataclasses? Mostly curious

metadata-ingestion/src/datahub/ingestion/graph/client.py Show resolved Hide resolved

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py

+                  default_schema: Optional[str] = None,
+              ) -> SqlParsingResult:
+                  # TODO: convert datahub platform names to sqlglot dialect
+                  dialect = platform

Collaborator

asikowitz Jul 6, 2023

Mostly unrelated to this PR, but we should make a platform enum

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py Show resolved Hide resolved

asikowitz approved these changes

View reviewed changes

Collaborator

asikowitz left a comment

Haven't been able to go through this whole thing but looks good to me so far. I have a couple questions but nothing serious. I feel like there should be a cleaner way to do the -> urn conversion but haven't really thought about it, the "duplicate" classes _ColumnRef and ColumnRef just rubbing me the wrong way. Going to approve in case you need to merge, will finish review much later tonight

hsheth2 mentioned this pull request

feat(ingest/bigquery): support column-level lineage #8382

Merged

5 tasks

Collaborator Author

hsheth2 commented Jul 7, 2023

@asikowitz going to merge here and fix in a follow up: #8382

hsheth2 merged commit 3e47b3d into datahub-project:master

hsheth2 deleted the sqlglot branch

July 7, 2023 23:24

asikowitz reviewed

View reviewed changes

metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py

Comment on lines +247 to +249

+                      self._schema_cache: FileBackedDict[Optional[SchemaInfo]] = FileBackedDict(
+                          shared_connection=shared_conn,
+                      )

Collaborator

asikowitz Jul 10, 2023

Doesn't look like we ever close this, are we just relying on garbage collection for that? I could see that taking a while because references may be kept in memory by the lru_cache

Collaborator Author

hsheth2 Jul 10, 2023

fixed in the follow up by explicitly calling self._make_schema_resolver.cache_clear()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels