OL facets - PR2 - read facets from views based on lineage_events table #2355

pawel-big-lebowski · 2023-01-11T09:31:18Z

Signed-off-by: Pawel Leszczynski [email protected]

Problem

Migration of facets from lineage_events to separate facets tables will be asynchronous for some users as it will require manual step after database schema is migrated to newer version.

We want to have an option to switch Marquez code to read facets from lineage_events before the manual migration and switch to using job_facets, dataset_facets and run_facets tables after the migration. We cannot split it into separate releases, as migration is manual and asynchronous and we never know when the user runs it (please note users may run upgrade multiple versions in one run).

Solution

job_facets_vew, dataset_facets_view and run_facets_view will be created on the top of existing lineage_events table.
Marquez codebase will read facets from those tables.
In the next PR, data will be migrated from lineage_events table to facets' tables and views will be replaced to point at job_facets, dataset_facets and run_facets tables.
Changelog file will be modified in Ol facets PR3.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

You've signed-off your work
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
You've included a header in any source code files (if relevant)

pawel-big-lebowski · 2023-01-11T11:08:53Z

api/src/main/java/marquez/db/migrations/V56_1__FacetViews.java

+@Slf4j
+public class V56_1__FacetViews implements JavaMigration {
+
+ public static final String DATASET_FACETS_DEFINITION =


Migration is written as Java code because facet definitions will be reused in a migration that writes data to facets table.

codecov · 2023-01-13T13:38:20Z

Codecov Report

Merging #2355 (b835721) into main (d7298a9) will increase coverage by 0.30%.
The diff coverage is 87.25%.

@@             Coverage Diff              @@
##               main    #2355      +/-   ##
============================================
+ Coverage     76.81%   77.11%   +0.30%     
- Complexity     1195     1234      +39     
============================================
  Files           226      228       +2     
  Lines          5473     5572      +99     
  Branches        443      447       +4     
============================================
+ Hits           4204     4297      +93     
- Misses          772      775       +3     
- Partials        497      500       +3

Impacted Files	Coverage Δ
api/src/main/java/marquez/MarquezApp.java	`63.75% <0.00%> (-2.49%)`	⬇️
api/src/main/java/marquez/db/DatasetDao.java	`98.64% <ø> (ø)`
api/src/main/java/marquez/db/DatasetFacetsDao.java	`82.22% <0.00%> (-2.23%)`	⬇️
...pi/src/main/java/marquez/db/DatasetVersionDao.java	`95.83% <ø> (ø)`
api/src/main/java/marquez/db/FacetUtils.java	`66.66% <ø> (ø)`
api/src/main/java/marquez/db/JobDao.java	`100.00% <ø> (ø)`
api/src/main/java/marquez/db/JobFacetsDao.java	`64.28% <ø> (-2.39%)`	⬇️
api/src/main/java/marquez/db/JobVersionDao.java	`91.04% <ø> (ø)`
api/src/main/java/marquez/db/RunDao.java	`92.50% <ø> (ø)`
api/src/main/java/marquez/db/RunFacetsDao.java	`80.95% <ø> (+3.67%)`	⬆️
... and 5 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pawel-big-lebowski · 2023-01-13T13:44:14Z

docker/init-db.sh

@@ -8,7 +8,7 @@
 set -eu

 psql -v ON_ERROR_STOP=1 --username "${POSTGRES_USER}" > /dev/null <<-EOSQL
- CREATE USER ${MARQUEZ_USER};
+ CREATE USER ${MARQUEZ_USER} SUPERUSER;


@wslulciuc @collado-mike @mobuchowski This line is worth looking at and perhaps a discussion.

As long we're using PSQL 12, super user is required to install uuid-ossp extension. We can remove it once upgrading PostgreSQL version to 13.

Additionally, the migration code should work in a way such that it is allowed to run externally sql:

CREATE EXTENSION IF NOT EXISTS "uuid-ossp"

and run migration with no extra privileges.

The CHANGELOG info about this will be included in the PR3.

It'd be great to upgrade our postgres version, but we'll want to decouple the upgrade from the migration. Meaning, there's no way arounds having to require users to run the migration as superuser. Since requiring superuser is only for the migration, providing migration steps in the changelog would be enough.

Hmm, requiring the Postgres upgrade and/or running the migration process as a superuser seems like a serious hurdle. What's the principal gain the user gets from this? I.e., why do the facet tables require a UUID primary key to begin with? AFAIK, we only join with those tables based on the foreign keys (run_uuid, job_uuid, dataset_uuid). If we get rid of the primary keys on those facet records, we don't have to worry about new extensions or postgres upgrades and I don't know if there's any downside to eliminating them.

Uuid as a primary key has been proposed in proposal #2076 some time ago before I started working on that.
For me it looked like some general pattern in Marquez, that we create primary keys for tables so that database schema can be easily extended in future (easy joining of tables). For these tables, primary key cannot be derived from other columns. But I don't see any blocker in removing those columns.

Unfortunately, yes the migration would be a hurdle for users running Marquez with postgres 12 or below. As @pawel-big-lebowski mentioned, we do have a general pattern for defining a uuid (=PK) column for most tables (mapping tables excluded ex: dataset_fields_tag_mapping). It would be great to keep this convention as we don't know how the facet tables will be used or extended.

For the facet tables, the uuid ensure uniqueness. For example, the schema for dataset_facets has dataset_uuid and run_uuid but a run may have more than one facet defined with the same name, therefore uuid would ensure we capture both values of the facet. To simply our merging logic for facets, the facet tables are, for now, append only.

We can still append a new facet for the same run even without the UUID. The primary key is principally useful for fetching a specific record or referencing a specific record from another table. Neither of those is the case here.

Arguing that we don't know how the facet tables will be used doesn't make sense either, as we can always add the UUID column after the fact. When there's a reason to introduce the primary key, we can argue that the extra work is worth the effort. For now, the current migration requires a very substantial change in the system with no benefit. Following the pattern simply for the sake of the pattern isn't a good reason to require these changes to the system or the deployment process that users are following.

wslulciuc

We'll have to update the copyright (2022 -> 2023):


/*
 * Copyright 2018-2023 contributors to the Marquez project
 * SPDX-License-Identifier: Apache-2.0
 */

wslulciuc

We can remove metadata.json as it's not used.

marquez.dev.yml

pawel-big-lebowski · 2023-01-18T08:41:26Z

We can remove metadata.json as it's not used.

Not sure why anything related to metadata.json was in this PR. After merging PR1 and rebasing this one, it's no longer present.

collado-mike · 2023-01-25T21:58:35Z

api/src/main/java/marquez/db/DatasetVersionDao.java

+ ), selected_dataset_version_facets AS (
+ SELECT dv.uuid, dv.dataset_name, dv.namespace_name, dv.run_uuid, df.lineage_event_time, df.facet
 FROM selected_dataset_version_runs dv
- LEFT JOIN lineage_events le ON le.run_uuid = dv.run_uuid
+ LEFT JOIN dataset_facets_view df ON df.dataset_uuid = dv.dataset_uuid


The original query returned facets that were attached to the specific dataset version that is being queried. This query returns all facets that are attached to the dataset. This will include things like data quality facets on previous or newer versions of the dataset, which may not reflect the state of the selected dataset version. We need to be able to attach and query facets to specific dataset versions.

Great remark, I've added missing join on run_uuid.

collado-mike · 2023-01-26T20:13:02Z

api/src/main/java/marquez/db/DatasetVersionDao.java

+ ), selected_dataset_version_facets AS (
+ SELECT dv.uuid, dv.dataset_name, dv.namespace_name, dv.run_uuid, df.lineage_event_time, df.facet
 FROM selected_dataset_version_runs dv
- LEFT JOIN lineage_events le ON le.run_uuid = dv.run_uuid
+ LEFT JOIN dataset_facets_view df ON df.dataset_uuid = dv.dataset_uuid AND df.run_uuid = dv.run_uuid


The selected_dataset_version_facets CTE is expensive to create because we need to join on the runs_input_mapping table to find all of the runs that read from or wrote to the selected dataset version. Rather than going through this extra hop, can we add the dataset version UUID to the facet table so we can simply query all facets that apply to a specific dataset version?

Yeah ;-) I had similar thoughts on this but the counterargument in my head was to avoid doing endlessly more & more in a single PR.

I think you're right and there won't be better moment to fix this.
I've included it in a second commit within this PR.

…able Signed-off-by: Pawel Leszczynski <[email protected]>

Signed-off-by: Pawel Leszczynski <[email protected]>

collado-mike

LGTM. There's a comment on the Dataset facet retrieval, but I think it's an optimization and can be done now or later.

collado-mike · 2023-01-30T19:30:14Z

api/src/main/java/marquez/db/DatasetDao.java

 ), dataset_runs AS (
- SELECT d.uuid, d.name, d.namespace_name, dv.run_uuid, dv.lifecycle_state, event_time, event
+ SELECT d.uuid, d.name, d.namespace_name, dv.run_uuid, dv.lifecycle_state, lineage_event_time, facet
 FROM selected_datasets d
- INNER JOIN dataset_versions dv ON dv.uuid = d.current_version_uuid
+ INNER JOIN dataset_versions AS dv ON dv.uuid = d.current_version_uuid
 LEFT JOIN LATERAL (
- SELECT run_uuid, event_time, event FROM lineage_events
- WHERE run_uuid = dv.run_uuid
- ) e ON e.run_uuid = dv.run_uuid
+ SELECT run_uuid, lineage_event_time, facet FROM dataset_facets_view
+ WHERE dataset_uuid = dv.dataset_uuid
+ ) df ON df.run_uuid = dv.run_uuid
 UNION
- SELECT d.uuid, d.name, d.namespace_name, rim.run_uuid, lifecycle_state, event_time, event
+ SELECT d.uuid, d.name, d.namespace_name, rim.run_uuid, lifecycle_state, lineage_event_time, facet
 FROM selected_datasets d
 INNER JOIN dataset_versions dv ON dv.uuid = d.current_version_uuid
 LEFT JOIN runs_input_mapping rim ON dv.uuid = rim.dataset_version_uuid
 LEFT JOIN LATERAL (
- SELECT run_uuid, event_time, event FROM lineage_events
- WHERE run_uuid = rim.run_uuid
-  ) e ON e.run_uuid = rim.run_uuid
+ SELECT dataset_uuid, run_uuid, lineage_event_time, facet FROM dataset_facets_view
+ WHERE dataset_uuid = dv.dataset_uuid AND run_uuid = rim.run_uuid
+ ) df ON df.run_uuid = rim.run_uuid


The only purpose of this CTE is to determine which runs have input or output facets for the current dataset version. Given that the dataset_facets_view now has dataset_version_uuid, I think we can drop this whole subquery and join directly on df.dataset_version_uuid=d.current_version_uuid` below

Signed-off-by: Pawel Leszczynski <[email protected]>

wslulciuc

LGTM 💯 🚀 🥇

boring-cyborg bot added the api API layer changes label Jan 11, 2023

pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch 2 times, most recently from c45ac3a to bad157a Compare January 11, 2023 11:04

pawel-big-lebowski commented Jan 11, 2023

View reviewed changes

pawel-big-lebowski force-pushed the ol-facets/PR1-feed-new-tables branch 2 times, most recently from 3b63ae1 to c89504d Compare January 12, 2023 08:49

pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch 3 times, most recently from 117f18c to 1624f92 Compare January 13, 2023 11:06

boring-cyborg bot added the docker label Jan 13, 2023

pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch from 1624f92 to 0d322ef Compare January 13, 2023 13:24

pawel-big-lebowski force-pushed the ol-facets/PR1-feed-new-tables branch from c89504d to be0f179 Compare January 13, 2023 13:28

pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch from 0d322ef to bde93bd Compare January 13, 2023 13:29

pawel-big-lebowski commented Jan 13, 2023

View reviewed changes

pawel-big-lebowski marked this pull request as ready for review January 13, 2023 13:44

pawel-big-lebowski requested review from mobuchowski, collado-mike and wslulciuc and removed request for mobuchowski January 13, 2023 13:44

pawel-big-lebowski mentioned this pull request Jan 17, 2023

Add OL facet tables #2152

Closed

7 tasks

pawel-big-lebowski self-assigned this Jan 17, 2023

Base automatically changed from ol-facets/PR1-feed-new-tables to main January 17, 2023 10:01

wslulciuc requested changes Jan 17, 2023

View reviewed changes

pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch 4 times, most recently from 973832e to 2df36cf Compare January 18, 2023 08:01

pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch from 2df36cf to 80116ab Compare January 18, 2023 08:31

pawel-big-lebowski commented Jan 18, 2023

View reviewed changes

marquez.dev.yml Outdated Show resolved Hide resolved

wslulciuc mentioned this pull request Jan 19, 2023

Define marquez.yml for API load test #2368

Merged

7 tasks

pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch from 80116ab to 1d1fec4 Compare January 19, 2023 08:50

pawel-big-lebowski mentioned this pull request Jan 20, 2023

Remove job context. #2373

Merged

7 tasks

pawel-big-lebowski requested a review from wslulciuc January 23, 2023 07:20

collado-mike reviewed Jan 25, 2023

View reviewed changes

pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch from 1d1fec4 to 5967b0c Compare January 26, 2023 08:28

collado-mike reviewed Jan 26, 2023

View reviewed changes

pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch from 5967b0c to 298b4e1 Compare January 27, 2023 07:00

OL facets - PR2 - read facets from views pointing to lineage_events t…

b4b1903

…able Signed-off-by: Pawel Leszczynski <[email protected]>

pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch from 298b4e1 to b4b1903 Compare January 27, 2023 07:07

add dataset_version_uuid to dataset_versions

45ecf15

Signed-off-by: Pawel Leszczynski <[email protected]>

collado-mike approved these changes Jan 30, 2023

View reviewed changes

OL facets - PR3 - migrate data to facet tables (#2359)

88b8c7d

Signed-off-by: Pawel Leszczynski <[email protected]>

boring-cyborg bot added the docs label Jan 31, 2023

Merge branch 'main' into ol-facets/PR2-read-data-from-views

b835721

wslulciuc approved these changes Jan 31, 2023

View reviewed changes

wslulciuc merged commit d17668a into main Jan 31, 2023

wslulciuc deleted the ol-facets/PR2-read-data-from-views branch January 31, 2023 09:10

pawel-big-lebowski mentioned this pull request Feb 6, 2023

Optimise fetching facets in DatasetDao #2406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OL facets - PR2 - read facets from views based on lineage_events table #2355

OL facets - PR2 - read facets from views based on lineage_events table #2355

pawel-big-lebowski commented Jan 11, 2023 •

edited

Loading

pawel-big-lebowski Jan 11, 2023

codecov bot commented Jan 13, 2023 •

edited

Loading

pawel-big-lebowski Jan 13, 2023

wslulciuc Jan 17, 2023

collado-mike Jan 25, 2023

pawel-big-lebowski Jan 26, 2023

wslulciuc Jan 26, 2023 •

edited

Loading

collado-mike Jan 26, 2023

wslulciuc left a comment

wslulciuc left a comment

pawel-big-lebowski commented Jan 18, 2023

collado-mike Jan 25, 2023

pawel-big-lebowski Jan 26, 2023

collado-mike Jan 26, 2023

pawel-big-lebowski Jan 27, 2023

collado-mike left a comment

collado-mike Jan 30, 2023

pawel-big-lebowski Feb 6, 2023

wslulciuc left a comment

OL facets - PR2 - read facets from views based on lineage_events table #2355

OL facets - PR2 - read facets from views based on lineage_events table #2355

Conversation

pawel-big-lebowski commented Jan 11, 2023 • edited Loading

Problem

Solution

Checklist

Choose a reason for hiding this comment

codecov bot commented Jan 13, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wslulciuc Jan 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wslulciuc left a comment

Choose a reason for hiding this comment

wslulciuc left a comment

Choose a reason for hiding this comment

pawel-big-lebowski commented Jan 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

collado-mike left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wslulciuc left a comment

Choose a reason for hiding this comment

pawel-big-lebowski commented Jan 11, 2023 •

edited

Loading

codecov bot commented Jan 13, 2023 •

edited

Loading

wslulciuc Jan 26, 2023 •

edited

Loading