Add index on jobs_fqn namespace and fqn to optimize read queries #2357

collado-mike · 2023-01-12T20:53:24Z

Problem

An index of the jobs table's name and namespace_name has existed for a long time, but there was no index added to the newer jobs_fqn table's namespace_name and job_fqn columns.

Solution

Adds an index on the jobs_fqn table using the namespace_name and job_fqn columns.

EXPLAIN plans on a Marquez installation with ~100,000 jobs for the query

SELECT uuid FROM jobs_view j WHERE j.namespace_name=:namespace AND j.name=:name

Original EXPLAIN plan:

Nested Loop  (cost=1000.42..18554.76 rows=1 width=16)
  ->  Gather  (cost=1000.00..18546.32 rows=1 width=16)
        Workers Planned: 2
        ->  Parallel Seq Scan on jobs_fqn f  (cost=0.00..17546.22 rows=1 width=16)
              Filter: (((namespace_name)::text = 'abcdefg'::text) AND ((job_fqn)::text = 'example_job'::text))
  ->  Index Only Scan using jobs_pkey on jobs j  (cost=0.42..8.44 rows=1 width=16)
        Index Cond: (uuid = f.uuid)

New EXPLAIN plan:

Nested Loop  (cost=0.97..17.01 rows=1 width=16)
  ->  Index Scan using jobs_fqn_namespace_name_job_fqn_idx on jobs_fqn f  (cost=0.55..8.57 rows=1 width=16)
        Index Cond: (((namespace_name)::text = 'abcdefg'::text) AND ((job_fqn)::text = 'example_job'::text))
  ->  Index Only Scan using jobs_pkey on jobs j  (cost=0.42..8.44 rows=1 width=16)
        Index Cond: (uuid = f.uuid)

Checklist

You've signed-off your work
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
You've included a header in any source code files (if relevant)

codecov · 2023-01-12T21:01:00Z

Codecov Report

Merging #2357 (67930ce) into main (996e834) will decrease coverage by 0.40%.
The diff coverage is n/a.

❗ Current head 67930ce differs from pull request most recent head f772621. Consider uploading reports for the commit f772621 to get more accurate results

@@             Coverage Diff              @@
##               main    #2357      +/-   ##
============================================
- Coverage     77.11%   76.72%   -0.40%     
+ Complexity     1234     1177      -57     
============================================
  Files           228      222       -6     
  Lines          5572     5354     -218     
  Branches        447      429      -18     
============================================
- Hits           4297     4108     -189     
+ Misses          775      768       -7     
+ Partials        500      478      -22

Impacted Files	Coverage Δ
...main/java/marquez/service/models/LineageEvent.java	`86.30% <0.00%> (-8.22%)`	⬇️
api/src/main/java/marquez/db/OpenLineageDao.java	`96.09% <0.00%> (-0.20%)`	⬇️
api/src/main/java/marquez/db/DatasetDao.java	`98.64% <0.00%> (ø)`
api/src/main/java/marquez/service/models/Run.java	`82.50% <0.00%> (ø)`
api/src/main/java/marquez/service/models/Node.java	`61.53% <0.00%> (ø)`
...pi/src/main/java/marquez/db/DatasetVersionDao.java	`95.83% <0.00%> (ø)`
.../java/marquez/db/models/ColumnLineageNodeData.java	`100.00% <0.00%> (ø)`
api/src/main/java/marquez/db/JobFacetsDao.java
api/src/main/java/marquez/db/RunFacetsDao.java
.../java/marquez/db/migrations/V56_1__FacetViews.java
... and 9 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pawel-big-lebowski · 2023-01-13T07:45:49Z

I was confused a couple of times by a pattern in Marquez when a table has both columns namespace_uuid and namespace_name. Is there any reason for that? I found it confusing as it is hard to say which one to use and which one is indexed. The same applies for jobs_fqn. Are we able to drop one of those fields?

I think I would be in favour of:

dropping namespace_name,
creating index on namespace_uuid,
joining this table with namespaces table.

Just in case someone would like the ability to rename namespaces in future ;-)

collado-mike · 2023-01-14T00:35:18Z

I was confused a couple of times by a pattern in Marquez when a table has both columns namespace_uuid and namespace_name. Is there any reason for that? I found it confusing as it is hard to say which one to use and which one is indexed. The same applies for jobs_fqn. Are we able to drop one of those fields?

TBH, I don't know when that pattern started. I assume it was to easily query the jobs/runs tables with namespaces without having to incur the cost of a join. Given the tiny size of the namespaces table and the fact that we always query by a single namespace, it was probably an unnecessary optimization.

Here are two EXPLAIN plans for a jobs_fqn table with ~500,000 rows.

SELECT * FROM jobs_fqn
WHERE namespace_name='abcdefg'
AND job_fqn='a_job_name';

Index Scan using jobs_fqn_namespace_name_job_fqn_idx on jobs_fqn  (cost=0.55..8.57 rows=1 width=170)
  Index Cond: (((namespace_name)::text = 'abcdefg'::text) AND ((job_fqn)::text = 'a_job_name'::text))

SELECT j.* FROM jobs_fqn j
INNER JOIN namespaces n ON j.namespace_uuid=n.uuid
WHERE n.name='abcdefg'
AND job_fqn='a_job_name';

Gather  (cost=1008.30..19286.50 rows=1 width=170)
  Workers Planned: 2
  ->  Hash Join  (cost=8.30..18286.40 rows=1 width=170)
        Hash Cond: (j.namespace_uuid = n.uuid)
        ->  Parallel Seq Scan on jobs_fqn j  (cost=0.00..18278.08 rows=5 width=170)
              Filter: ((job_fqn)::text = 'a_job_name'::text)
        ->  Hash  (cost=8.29..8.29 rows=1 width=16)
              ->  Index Scan using namespaces_name_key on namespaces n  (cost=0.28..8.29 rows=1 width=16)
                    Index Cond: ((name)::text = 'abcdefg'::text)

Technically, the cost of the join is super high. In reality, the two queries execute in practically the same amount of time.

pawel-big-lebowski · 2023-01-16T07:39:22Z

@collado-mike That's interesting. Seems like postgres prefers filtering first job_fqn='a_job_name' which is done as sequential scan as there is no index on job_fqn. Isn't that a root cause?

wslulciuc · 2023-02-01T22:49:11Z

@pawel-big-lebowski, @collado-mike: I can provide some background on the denormalization efforts on some of the tables. In order to reduce (possibly expensive) joins, some tables have redundant data. One example being the jobs table since it often referenced the namespaces table. In #935, we see the denormalization of namespace and job names. Now, a lot has changed since and I think it's worth revisiting this pattern and favor normalization assuming query performance isn't affect.

wslulciuc

🛳️ it!

collado-mike · 2023-02-01T22:56:05Z

Seems like postgres prefers filtering first job_fqn='a_job_name' which is done as sequential scan as there is no index on job_fqn. Isn't that a root cause?

I did test out an index on namespace_uuid instead of namespace_name. There is still a slight performance gain by avoiding the join.

Here's the plan if I do a join on the two tables along with an index on namespace_uuid and job_fqn.

Nested Loop  (cost=1.26..25.32 rows=1 width=364)
  ->  Nested Loop  (cost=0.83..16.88 rows=1 width=166)
        ->  Index Scan using namespaces_name_key on namespaces n  (cost=0.28..8.30 rows=1 width=16)
              Index Cond: ((name)::text = 'abcdefg'::text)
        ->  Index Scan using jobs_fqn_namespace_uuid_job_fqn_idx on jobs_fqn f  (cost=0.55..8.57 rows=1 width=166)
              Index Cond: ((namespace_uuid = n.uuid) AND ((job_fqn)::text = 'a_job_name'::text))
  ->  Index Scan using jobs_pkey on jobs j  (cost=0.42..8.44 rows=1 width=200)
        Index Cond: (uuid = f.uuid)

Here's the plan with the proposed index.

Nested Loop  (cost=0.98..17.01 rows=1 width=364)
  ->  Index Scan using jobs_fqn_namespace_name_job_fqn_idx on jobs_fqn f  (cost=0.55..8.57 rows=1 width=166)
        Index Cond: (((namespace_name)::text = 'abcdefg'::text) AND ((job_fqn)::text = 'a_job_name'::text))
  ->  Index Scan using jobs_pkey on jobs j  (cost=0.42..8.44 rows=1 width=200)
        Index Cond: (uuid = f.uuid)

Again, it is a slight gain, though TBH, I think it's probably an unnecessary optimization. But I don't propose we start undoing this pattern in this PR. I think that's a job that's better left to a dedicated task with research into all of the impacted queries.

Signed-off-by: Michael Collado <[email protected]>

collado-mike requested a review from wslulciuc January 12, 2023 20:53

collado-mike assigned wslulciuc Jan 12, 2023

boring-cyborg bot added the api API layer changes label Jan 12, 2023

wslulciuc approved these changes Feb 1, 2023

View reviewed changes

Add index on jobs_fqn namespace and fqn to optimize read queries

f772621

Signed-off-by: Michael Collado <[email protected]>

collado-mike force-pushed the add_jobs_fqn_index branch from 67930ce to f772621 Compare February 1, 2023 23:08

collado-mike enabled auto-merge (squash) February 1, 2023 23:08

Merge branch 'main' into add_jobs_fqn_index

d73e096

pawel-big-lebowski self-requested a review February 9, 2023 07:33

pawel-big-lebowski approved these changes Feb 9, 2023

View reviewed changes

Merge branch 'main' into add_jobs_fqn_index

4d24716

collado-mike merged commit f260382 into main Feb 9, 2023

collado-mike deleted the add_jobs_fqn_index branch February 9, 2023 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add index on jobs_fqn namespace and fqn to optimize read queries #2357

Add index on jobs_fqn namespace and fqn to optimize read queries #2357

collado-mike commented Jan 12, 2023

codecov bot commented Jan 12, 2023 •

edited

Loading

pawel-big-lebowski commented Jan 13, 2023

collado-mike commented Jan 14, 2023 •

edited

Loading

pawel-big-lebowski commented Jan 16, 2023

wslulciuc commented Feb 1, 2023

wslulciuc left a comment

collado-mike commented Feb 1, 2023

Add index on jobs_fqn namespace and fqn to optimize read queries #2357

Add index on jobs_fqn namespace and fqn to optimize read queries #2357

Conversation

collado-mike commented Jan 12, 2023

Problem

Solution

Checklist

codecov bot commented Jan 12, 2023 • edited Loading

Codecov Report

pawel-big-lebowski commented Jan 13, 2023

collado-mike commented Jan 14, 2023 • edited Loading

pawel-big-lebowski commented Jan 16, 2023

wslulciuc commented Feb 1, 2023

wslulciuc left a comment

Choose a reason for hiding this comment

collado-mike commented Feb 1, 2023

codecov bot commented Jan 12, 2023 •

edited

Loading

collado-mike commented Jan 14, 2023 •

edited

Loading