Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add index on jobs_fqn namespace and fqn to optimize read queries #2357

Merged
merged 3 commits into from
Feb 9, 2023

Conversation

collado-mike
Copy link
Collaborator

Problem

An index of the jobs table's name and namespace_name has existed for a long time, but there was no index added to the newer jobs_fqn table's namespace_name and job_fqn columns.

Solution

Adds an index on the jobs_fqn table using the namespace_name and job_fqn columns.

EXPLAIN plans on a Marquez installation with ~100,000 jobs for the query

SELECT uuid FROM jobs_view j WHERE j.namespace_name=:namespace AND j.name=:name

Original EXPLAIN plan:

Nested Loop  (cost=1000.42..18554.76 rows=1 width=16)
  ->  Gather  (cost=1000.00..18546.32 rows=1 width=16)
        Workers Planned: 2
        ->  Parallel Seq Scan on jobs_fqn f  (cost=0.00..17546.22 rows=1 width=16)
              Filter: (((namespace_name)::text = 'abcdefg'::text) AND ((job_fqn)::text = 'example_job'::text))
  ->  Index Only Scan using jobs_pkey on jobs j  (cost=0.42..8.44 rows=1 width=16)
        Index Cond: (uuid = f.uuid)

New EXPLAIN plan:

Nested Loop  (cost=0.97..17.01 rows=1 width=16)
  ->  Index Scan using jobs_fqn_namespace_name_job_fqn_idx on jobs_fqn f  (cost=0.55..8.57 rows=1 width=16)
        Index Cond: (((namespace_name)::text = 'abcdefg'::text) AND ((job_fqn)::text = 'example_job'::text))
  ->  Index Only Scan using jobs_pkey on jobs j  (cost=0.42..8.44 rows=1 width=16)
        Index Cond: (uuid = f.uuid)

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label Jan 12, 2023
@codecov
Copy link

codecov bot commented Jan 12, 2023

Codecov Report

Merging #2357 (67930ce) into main (996e834) will decrease coverage by 0.40%.
The diff coverage is n/a.

❗ Current head 67930ce differs from pull request most recent head f772621. Consider uploading reports for the commit f772621 to get more accurate results

@@             Coverage Diff              @@
##               main    #2357      +/-   ##
============================================
- Coverage     77.11%   76.72%   -0.40%     
+ Complexity     1234     1177      -57     
============================================
  Files           228      222       -6     
  Lines          5572     5354     -218     
  Branches        447      429      -18     
============================================
- Hits           4297     4108     -189     
+ Misses          775      768       -7     
+ Partials        500      478      -22     
Impacted Files Coverage Δ
...main/java/marquez/service/models/LineageEvent.java 86.30% <0.00%> (-8.22%) ⬇️
api/src/main/java/marquez/db/OpenLineageDao.java 96.09% <0.00%> (-0.20%) ⬇️
api/src/main/java/marquez/db/DatasetDao.java 98.64% <0.00%> (ø)
api/src/main/java/marquez/service/models/Run.java 82.50% <0.00%> (ø)
api/src/main/java/marquez/service/models/Node.java 61.53% <0.00%> (ø)
...pi/src/main/java/marquez/db/DatasetVersionDao.java 95.83% <0.00%> (ø)
.../java/marquez/db/models/ColumnLineageNodeData.java 100.00% <0.00%> (ø)
api/src/main/java/marquez/db/JobFacetsDao.java
api/src/main/java/marquez/db/RunFacetsDao.java
.../java/marquez/db/migrations/V56_1__FacetViews.java
... and 9 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pawel-big-lebowski
Copy link
Collaborator

I was confused a couple of times by a pattern in Marquez when a table has both columns namespace_uuid and namespace_name. Is there any reason for that? I found it confusing as it is hard to say which one to use and which one is indexed. The same applies for jobs_fqn. Are we able to drop one of those fields?

I think I would be in favour of:

  • dropping namespace_name,
  • creating index on namespace_uuid,
  • joining this table with namespaces table.

Just in case someone would like the ability to rename namespaces in future ;-)

@collado-mike
Copy link
Collaborator Author

collado-mike commented Jan 14, 2023

I was confused a couple of times by a pattern in Marquez when a table has both columns namespace_uuid and namespace_name. Is there any reason for that? I found it confusing as it is hard to say which one to use and which one is indexed. The same applies for jobs_fqn. Are we able to drop one of those fields?

TBH, I don't know when that pattern started. I assume it was to easily query the jobs/runs tables with namespaces without having to incur the cost of a join. Given the tiny size of the namespaces table and the fact that we always query by a single namespace, it was probably an unnecessary optimization.

Here are two EXPLAIN plans for a jobs_fqn table with ~500,000 rows.

SELECT * FROM jobs_fqn
WHERE namespace_name='abcdefg'
AND job_fqn='a_job_name';
Index Scan using jobs_fqn_namespace_name_job_fqn_idx on jobs_fqn  (cost=0.55..8.57 rows=1 width=170)
  Index Cond: (((namespace_name)::text = 'abcdefg'::text) AND ((job_fqn)::text = 'a_job_name'::text))
SELECT j.* FROM jobs_fqn j
INNER JOIN namespaces n ON j.namespace_uuid=n.uuid
WHERE n.name='abcdefg'
AND job_fqn='a_job_name';
Gather  (cost=1008.30..19286.50 rows=1 width=170)
  Workers Planned: 2
  ->  Hash Join  (cost=8.30..18286.40 rows=1 width=170)
        Hash Cond: (j.namespace_uuid = n.uuid)
        ->  Parallel Seq Scan on jobs_fqn j  (cost=0.00..18278.08 rows=5 width=170)
              Filter: ((job_fqn)::text = 'a_job_name'::text)
        ->  Hash  (cost=8.29..8.29 rows=1 width=16)
              ->  Index Scan using namespaces_name_key on namespaces n  (cost=0.28..8.29 rows=1 width=16)
                    Index Cond: ((name)::text = 'abcdefg'::text)

Technically, the cost of the join is super high. In reality, the two queries execute in practically the same amount of time.

@pawel-big-lebowski
Copy link
Collaborator

@collado-mike That's interesting. Seems like postgres prefers filtering first job_fqn='a_job_name' which is done as sequential scan as there is no index on job_fqn. Isn't that a root cause?

@wslulciuc
Copy link
Member

@pawel-big-lebowski, @collado-mike: I can provide some background on the denormalization efforts on some of the tables. In order to reduce (possibly expensive) joins, some tables have redundant data. One example being the jobs table since it often referenced the namespaces table. In #935, we see the denormalization of namespace and job names. Now, a lot has changed since and I think it's worth revisiting this pattern and favor normalization assuming query performance isn't affect.

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛳️ it!

@collado-mike
Copy link
Collaborator Author

Seems like postgres prefers filtering first job_fqn='a_job_name' which is done as sequential scan as there is no index on job_fqn. Isn't that a root cause?

I did test out an index on namespace_uuid instead of namespace_name. There is still a slight performance gain by avoiding the join.

Here's the plan if I do a join on the two tables along with an index on namespace_uuid and job_fqn.

Nested Loop  (cost=1.26..25.32 rows=1 width=364)
  ->  Nested Loop  (cost=0.83..16.88 rows=1 width=166)
        ->  Index Scan using namespaces_name_key on namespaces n  (cost=0.28..8.30 rows=1 width=16)
              Index Cond: ((name)::text = 'abcdefg'::text)
        ->  Index Scan using jobs_fqn_namespace_uuid_job_fqn_idx on jobs_fqn f  (cost=0.55..8.57 rows=1 width=166)
              Index Cond: ((namespace_uuid = n.uuid) AND ((job_fqn)::text = 'a_job_name'::text))
  ->  Index Scan using jobs_pkey on jobs j  (cost=0.42..8.44 rows=1 width=200)
        Index Cond: (uuid = f.uuid)

Here's the plan with the proposed index.

Nested Loop  (cost=0.98..17.01 rows=1 width=364)
  ->  Index Scan using jobs_fqn_namespace_name_job_fqn_idx on jobs_fqn f  (cost=0.55..8.57 rows=1 width=166)
        Index Cond: (((namespace_name)::text = 'abcdefg'::text) AND ((job_fqn)::text = 'a_job_name'::text))
  ->  Index Scan using jobs_pkey on jobs j  (cost=0.42..8.44 rows=1 width=200)
        Index Cond: (uuid = f.uuid)

Again, it is a slight gain, though TBH, I think it's probably an unnecessary optimization. But I don't propose we start undoing this pattern in this PR. I think that's a job that's better left to a dedicated task with research into all of the impacted queries.

@collado-mike collado-mike merged commit f260382 into main Feb 9, 2023
@collado-mike collado-mike deleted the add_jobs_fqn_index branch February 9, 2023 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants