Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OL facets - PR2 - read facets from views based on lineage_events table #2355

Merged
merged 4 commits into from
Jan 31, 2023

Conversation

pawel-big-lebowski
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski commented Jan 11, 2023

Signed-off-by: Pawel Leszczynski [email protected]

Problem

Migration of facets from lineage_events to separate facets tables will be asynchronous for some users as it will require manual step after database schema is migrated to newer version.

We want to have an option to switch Marquez code to read facets from lineage_events before the manual migration and switch to using job_facets, dataset_facets and run_facets tables after the migration. We cannot split it into separate releases, as migration is manual and asynchronous and we never know when the user runs it (please note users may run upgrade multiple versions in one run).

Solution

  • job_facets_vew, dataset_facets_view and run_facets_view will be created on the top of existing lineage_events table.
  • Marquez codebase will read facets from those tables.
  • In the next PR, data will be migrated from lineage_events table to facets' tables and views will be replaced to point at job_facets, dataset_facets and run_facets tables.
  • Changelog file will be modified in Ol facets PR3.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label Jan 11, 2023
@pawel-big-lebowski pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch 2 times, most recently from c45ac3a to bad157a Compare January 11, 2023 11:04
@Slf4j
public class V56_1__FacetViews implements JavaMigration {

public static final String DATASET_FACETS_DEFINITION =
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration is written as Java code because facet definitions will be reused in a migration that writes data to facets table.

@pawel-big-lebowski pawel-big-lebowski force-pushed the ol-facets/PR1-feed-new-tables branch 2 times, most recently from 3b63ae1 to c89504d Compare January 12, 2023 08:49
@pawel-big-lebowski pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch 3 times, most recently from 117f18c to 1624f92 Compare January 13, 2023 11:06
@boring-cyborg boring-cyborg bot added the docker label Jan 13, 2023
@codecov
Copy link

codecov bot commented Jan 13, 2023

Codecov Report

Merging #2355 (b835721) into main (d7298a9) will increase coverage by 0.30%.
The diff coverage is 87.25%.

@@             Coverage Diff              @@
##               main    #2355      +/-   ##
============================================
+ Coverage     76.81%   77.11%   +0.30%     
- Complexity     1195     1234      +39     
============================================
  Files           226      228       +2     
  Lines          5473     5572      +99     
  Branches        443      447       +4     
============================================
+ Hits           4204     4297      +93     
- Misses          772      775       +3     
- Partials        497      500       +3     
Impacted Files Coverage Δ
api/src/main/java/marquez/MarquezApp.java 63.75% <0.00%> (-2.49%) ⬇️
api/src/main/java/marquez/db/DatasetDao.java 98.64% <ø> (ø)
api/src/main/java/marquez/db/DatasetFacetsDao.java 82.22% <0.00%> (-2.23%) ⬇️
...pi/src/main/java/marquez/db/DatasetVersionDao.java 95.83% <ø> (ø)
api/src/main/java/marquez/db/FacetUtils.java 66.66% <ø> (ø)
api/src/main/java/marquez/db/JobDao.java 100.00% <ø> (ø)
api/src/main/java/marquez/db/JobFacetsDao.java 64.28% <ø> (-2.39%) ⬇️
api/src/main/java/marquez/db/JobVersionDao.java 91.04% <ø> (ø)
api/src/main/java/marquez/db/RunDao.java 92.50% <ø> (ø)
api/src/main/java/marquez/db/RunFacetsDao.java 80.95% <ø> (+3.67%) ⬆️
... and 5 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@@ -8,7 +8,7 @@
set -eu

psql -v ON_ERROR_STOP=1 --username "${POSTGRES_USER}" > /dev/null <<-EOSQL
CREATE USER ${MARQUEZ_USER};
CREATE USER ${MARQUEZ_USER} SUPERUSER;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wslulciuc @collado-mike @mobuchowski This line is worth looking at and perhaps a discussion.

As long we're using PSQL 12, super user is required to install uuid-ossp extension. We can remove it once upgrading PostgreSQL version to 13.

Additionally, the migration code should work in a way such that it is allowed to run externally sql:

CREATE EXTENSION IF NOT EXISTS "uuid-ossp"

and run migration with no extra privileges.

The CHANGELOG info about this will be included in the PR3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be great to upgrade our postgres version, but we'll want to decouple the upgrade from the migration. Meaning, there's no way arounds having to require users to run the migration as superuser. Since requiring superuser is only for the migration, providing migration steps in the changelog would be enough.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, requiring the Postgres upgrade and/or running the migration process as a superuser seems like a serious hurdle. What's the principal gain the user gets from this? I.e., why do the facet tables require a UUID primary key to begin with? AFAIK, we only join with those tables based on the foreign keys (run_uuid, job_uuid, dataset_uuid). If we get rid of the primary keys on those facet records, we don't have to worry about new extensions or postgres upgrades and I don't know if there's any downside to eliminating them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uuid as a primary key has been proposed in proposal #2076 some time ago before I started working on that.
For me it looked like some general pattern in Marquez, that we create primary keys for tables so that database schema can be easily extended in future (easy joining of tables). For these tables, primary key cannot be derived from other columns. But I don't see any blocker in removing those columns.

Copy link
Member

@wslulciuc wslulciuc Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, yes the migration would be a hurdle for users running Marquez with postgres 12 or below. As @pawel-big-lebowski mentioned, we do have a general pattern for defining a uuid (=PK) column for most tables (mapping tables excluded ex: dataset_fields_tag_mapping). It would be great to keep this convention as we don't know how the facet tables will be used or extended.

For the facet tables, the uuid ensure uniqueness. For example, the schema for dataset_facets has dataset_uuid and run_uuid but a run may have more than one facet defined with the same name, therefore uuid would ensure we capture both values of the facet. To simply our merging logic for facets, the facet tables are, for now, append only.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still append a new facet for the same run even without the UUID. The primary key is principally useful for fetching a specific record or referencing a specific record from another table. Neither of those is the case here.

Arguing that we don't know how the facet tables will be used doesn't make sense either, as we can always add the UUID column after the fact. When there's a reason to introduce the primary key, we can argue that the extra work is worth the effort. For now, the current migration requires a very substantial change in the system with no benefit. Following the pattern simply for the sake of the pattern isn't a good reason to require these changes to the system or the deployment process that users are following.

@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review January 13, 2023 13:44
@pawel-big-lebowski pawel-big-lebowski mentioned this pull request Jan 17, 2023
7 tasks
@pawel-big-lebowski pawel-big-lebowski self-assigned this Jan 17, 2023
Base automatically changed from ol-facets/PR1-feed-new-tables to main January 17, 2023 10:01
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to update the copyright (2022 -> 2023):


/*
 * Copyright 2018-2023 contributors to the Marquez project
 * SPDX-License-Identifier: Apache-2.0
 */

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove metadata.json as it's not used.

@pawel-big-lebowski pawel-big-lebowski force-pushed the ol-facets/PR2-read-data-from-views branch 4 times, most recently from 973832e to 2df36cf Compare January 18, 2023 08:01
marquez.dev.yml Outdated Show resolved Hide resolved
@pawel-big-lebowski
Copy link
Collaborator Author

We can remove metadata.json as it's not used.

Not sure why anything related to metadata.json was in this PR. After merging PR1 and rebasing this one, it's no longer present.

Comment on lines 171 to 174
), selected_dataset_version_facets AS (
SELECT dv.uuid, dv.dataset_name, dv.namespace_name, dv.run_uuid, df.lineage_event_time, df.facet
FROM selected_dataset_version_runs dv
LEFT JOIN lineage_events le ON le.run_uuid = dv.run_uuid
LEFT JOIN dataset_facets_view df ON df.dataset_uuid = dv.dataset_uuid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original query returned facets that were attached to the specific dataset version that is being queried. This query returns all facets that are attached to the dataset. This will include things like data quality facets on previous or newer versions of the dataset, which may not reflect the state of the selected dataset version. We need to be able to attach and query facets to specific dataset versions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great remark, I've added missing join on run_uuid.

Comment on lines 210 to 213
), selected_dataset_version_facets AS (
SELECT dv.uuid, dv.dataset_name, dv.namespace_name, dv.run_uuid, df.lineage_event_time, df.facet
FROM selected_dataset_version_runs dv
LEFT JOIN lineage_events le ON le.run_uuid = dv.run_uuid
LEFT JOIN dataset_facets_view df ON df.dataset_uuid = dv.dataset_uuid AND df.run_uuid = dv.run_uuid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The selected_dataset_version_facets CTE is expensive to create because we need to join on the runs_input_mapping table to find all of the runs that read from or wrote to the selected dataset version. Rather than going through this extra hop, can we add the dataset version UUID to the facet table so we can simply query all facets that apply to a specific dataset version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ;-) I had similar thoughts on this but the counterargument in my head was to avoid doing endlessly more & more in a single PR.

I think you're right and there won't be better moment to fix this.
I've included it in a second commit within this PR.

Copy link
Collaborator

@collado-mike collado-mike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. There's a comment on the Dataset facet retrieval, but I think it's an optimization and can be done now or later.

Comment on lines 78 to +94
), dataset_runs AS (
SELECT d.uuid, d.name, d.namespace_name, dv.run_uuid, dv.lifecycle_state, event_time, event
SELECT d.uuid, d.name, d.namespace_name, dv.run_uuid, dv.lifecycle_state, lineage_event_time, facet
FROM selected_datasets d
INNER JOIN dataset_versions dv ON dv.uuid = d.current_version_uuid
INNER JOIN dataset_versions AS dv ON dv.uuid = d.current_version_uuid
LEFT JOIN LATERAL (
SELECT run_uuid, event_time, event FROM lineage_events
WHERE run_uuid = dv.run_uuid
) e ON e.run_uuid = dv.run_uuid
SELECT run_uuid, lineage_event_time, facet FROM dataset_facets_view
WHERE dataset_uuid = dv.dataset_uuid
) df ON df.run_uuid = dv.run_uuid
UNION
SELECT d.uuid, d.name, d.namespace_name, rim.run_uuid, lifecycle_state, event_time, event
SELECT d.uuid, d.name, d.namespace_name, rim.run_uuid, lifecycle_state, lineage_event_time, facet
FROM selected_datasets d
INNER JOIN dataset_versions dv ON dv.uuid = d.current_version_uuid
LEFT JOIN runs_input_mapping rim ON dv.uuid = rim.dataset_version_uuid
LEFT JOIN LATERAL (
SELECT run_uuid, event_time, event FROM lineage_events
WHERE run_uuid = rim.run_uuid
) e ON e.run_uuid = rim.run_uuid
SELECT dataset_uuid, run_uuid, lineage_event_time, facet FROM dataset_facets_view
WHERE dataset_uuid = dv.dataset_uuid AND run_uuid = rim.run_uuid
) df ON df.run_uuid = rim.run_uuid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only purpose of this CTE is to determine which runs have input or output facets for the current dataset version. Given that the dataset_facets_view now has dataset_version_uuid, I think we can drop this whole subquery and join directly on df.dataset_version_uuid=d.current_version_uuid` below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #2407

@boring-cyborg boring-cyborg bot added the docs label Jan 31, 2023
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 💯 🚀 🥇

@wslulciuc wslulciuc merged commit d17668a into main Jan 31, 2023
@wslulciuc wslulciuc deleted the ol-facets/PR2-read-data-from-views branch January 31, 2023 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes docker docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants