Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying Run Tags Table Causing Slow Down #18269

Open
ChristopherFyfe opened this issue Nov 27, 2023 · 2 comments
Open

Querying Run Tags Table Causing Slow Down #18269

ChristopherFyfe opened this issue Nov 27, 2023 · 2 comments
Labels
area: deployment Related to deploying Dagster type: bug Something isn't working

Comments

@ChristopherFyfe
Copy link

Dagster version

1.4.0

What's the issue?

Query for updating the run tags table is slow and causing unusually high CPU usage.

Following query:
UPDATE run_tags SET value=? WHERE run_tags.run_id = ? AND run_tags.key = ?

Many of the query runs are inserting a value for "dagster/image".

Query is doing a full table scan of our run tags table, approx 21G.

What did you expect to happen?

Slow queries of the run tags tables is causing our CPU usage to max out, and hence slow down of all other queries against the database. Queries of this table should not result in such slow down.

How to reproduce?

No response

Deployment type

Dagster Helm chart

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@ChristopherFyfe ChristopherFyfe added the type: bug Something isn't working label Nov 27, 2023
@ChristopherFyfe ChristopherFyfe changed the title Run Tags Table Updating Too Slow to Update Querying Run Tags Table Causing Slow Down Nov 27, 2023
@smackesey
Copy link
Collaborator

cc @prha

@smackesey smackesey added the area: deployment Related to deploying Dagster label Nov 27, 2023
@stefkauff
Copy link

BTW, for now we added an index manually to these fields which helps to speed up the query. What we don't understand yet is that "dagster/image" needs to be updated as this is effectively never changing and could be know at insert time.

prha pushed a commit that referenced this issue Jul 24, 2024
… for better performance. (#22833)

## Summary & Motivation
While running jobs on frequent schedules we have noticed that as the
amount of runs grows some ui operations become very slow. Looking at AWS
monitoring we see that one query in particular seems to be very slow:

![Screenshot_20240703_151316](https://github.com/dagster-io/dagster/assets/4254771/23040062-030b-4161-a43a-8667cbef4d56)

By analyzing the query we have noticed that it is related to run tag
filtering. Here is an example of such a query with filled parameters:
```sql
EXPLAIN (ANALYSE, BUFFERS) SELECT runs.id,
       runs.run_body,
       runs.status,
       runs.create_timestamp,
       runs.update_timestamp,
       runs.start_time,
       runs.end_time
FROM runs
WHERE runs.run_id IN (SELECT run_tags.run_id
                      FROM run_tags
                      WHERE
                          run_tags.key = 'dagster/schedule_name' AND run_tags.value = 'quick_partitioned_job_schedule'
                         OR run_tags.key = '.dagster/repository' AND
                            run_tags.value = '__repository__@example-code'
                      GROUP BY run_tags.run_id
                      HAVING count(DISTINCT run_tags.key) = 2)
ORDER BY runs.id DESC
LIMIT 1;
```

I believe there are multiple instances discussing this:
* #18269
* #19003

Looking at the query plan:
https://explain.dalibo.com/plan/2c1bga585e8ca45f

![image](https://github.com/dagster-io/dagster/assets/4254771/0d8fc125-ac67-4a1b-8f37-7ff69cbfa81f)

We notice that the subquery scans a lot of rows (which is correct as we
have a lot of runs with same tags), but afterwards, the filter on runs
is very slow and filters away a lot of rows. A lot of work is done to
retrieve only one row with highest matching run id which feels like it
can be much more efficient.

To improve performance of these type of queries I would like to propose
two changes:

1. Replace the subquery by multiple joins. I would expect that this
would make a much flatter execution plan and thus a potential for
earlier filtering.

The example query would result in something like this:
```sql
EXPLAIN (ANALYSE, BUFFERS)
SELECT runs.id,
       runs.run_body,
       runs.status,
       runs.create_timestamp,
       runs.update_timestamp,
       runs.start_time,
       runs.end_time
FROM runs
JOIN public.run_tags r on runs.run_id = r.run_id AND r.key = 'dagster/schedule_name' AND r.value = 'quick_partitioned_job_schedule'
JOIN public.run_tags r2 on runs.run_id = r2.run_id AND r2.key = '.dagster/repository' AND r2.value = '__repository__@example-code'
ORDER BY runs.id DESC
LIMIT 1;
```

2. As mentioned in one of the referenced threads, add an index on run_id
for run tags. This would make joins in (1) much faster.

```sql
CREATE UNIQUE INDEX run_tags_run_idx ON public.run_tags USING btree (run_id, id);
```

The changes in the PR implement both changes in Dagster.

## How I Tested These Changes

I have tested these change by first running tests to make sure they
don't break dagster. Then I have set up a local benchmark to test the
changes. I have populated the dagster instace with 5.4 million runs and
10.7 million related run tags.

Afterwards I have applied the proposed changes and measured their
performance. Each query was run five times and the performance of the
fifth run was used. This is, to make sure all the data was in shared
buffers to make the comparison fair.

Results
| Experiment | Query Plan Analysis | Runtime | 
| ----------------- | ----------------------------- | ------------ | 
| Baseline | [10.7 run tags, unoptimized -
explain.dalibo.com](https://explain.dalibo.com/plan/2c1bga585e8ca45f) |
15.579s |
| Query Optimization | [10.7 run tags, optimized query, no index -
explain.dalibo.com](https://explain.dalibo.com/plan/agf3fabc77b5ce58) |
7.560s |
| Query Optimization + Custom Index | [10.7 run tags, optimized query,
with index -
explain.dalibo.com](https://explain.dalibo.com/plan/dgf924f51g585ab5) |
0.076s |

Interestingly "Query Optimization" alone results in a more complex but
faster query plan. The "Query Optimization + Custom Index" results in a
desired much simpler query plan that doesn't do as many reads.

Overall the changes improve the performance almost 200x. The addition of
the index shouldn't provide much overhead.

I have also found that if run_tags would use `runs.id` as foreign key
and not `runs.run_id` the query performance would be much faster 0.018s.
But this change would be too large and possibly break things.

---------

Signed-off-by: Egor Dmitriev <[email protected]>
prha pushed a commit that referenced this issue Jul 24, 2024
A followup to #22833 adding an optimized query for tag filtering.

## Summary & Motivation
While running jobs on frequent schedules we have noticed that as the
amount of runs grows some ui operations become very slow. Looking at AWS
monitoring we see that one query in particular seems to be very slow:

![Screenshot_20240703_151316](https://github.com/dagster-io/dagster/assets/4254771/23040062-030b-4161-a43a-8667cbef4d56)

By analyzing the query we have noticed that it is related to run tag
filtering. Here is an example of such a query with filled parameters:
```sql
EXPLAIN (ANALYSE, BUFFERS) SELECT runs.id,
       runs.run_body,
       runs.status,
       runs.create_timestamp,
       runs.update_timestamp,
       runs.start_time,
       runs.end_time
FROM runs
WHERE runs.run_id IN (SELECT run_tags.run_id
                      FROM run_tags
                      WHERE
                          run_tags.key = 'dagster/schedule_name' AND run_tags.value = 'quick_partitioned_job_schedule'
                         OR run_tags.key = '.dagster/repository' AND
                            run_tags.value = '__repository__@example-code'
                      GROUP BY run_tags.run_id
                      HAVING count(DISTINCT run_tags.key) = 2)
ORDER BY runs.id DESC
LIMIT 1;
```

I believe there are multiple instances discussing this:
* #18269
* #19003

Looking at the query plan:
https://explain.dalibo.com/plan/2c1bga585e8ca45f

![image](https://github.com/dagster-io/dagster/assets/4254771/0d8fc125-ac67-4a1b-8f37-7ff69cbfa81f)

We notice that the subquery scans a lot of rows (which is correct as we
have a lot of runs with same tags), but afterwards, the filter on runs
is very slow and filters away a lot of rows. A lot of work is done to
retrieve only one row with highest matching run id which feels like it
can be much more efficient.

To improve performance of these type of queries I would like to propose
two changes:

1. Replace the subquery by multiple joins. I would expect that this
would make a much flatter execution plan and thus a potential for
earlier filtering.

The example query would result in something like this:
```sql
EXPLAIN (ANALYSE, BUFFERS)
SELECT runs.id,
       runs.run_body,
       runs.status,
       runs.create_timestamp,
       runs.update_timestamp,
       runs.start_time,
       runs.end_time
FROM runs
JOIN public.run_tags r on runs.run_id = r.run_id AND r.key = 'dagster/schedule_name' AND r.value = 'quick_partitioned_job_schedule'
JOIN public.run_tags r2 on runs.run_id = r2.run_id AND r2.key = '.dagster/repository' AND r2.value = '__repository__@example-code'
ORDER BY runs.id DESC
LIMIT 1;
```

2. As mentioned in one of the referenced threads, add an index on run_id
for run tags. This would make joins in (1) much faster.

```sql
CREATE UNIQUE INDEX run_tags_run_idx ON public.run_tags USING btree (run_id, id);
```

The changes in the PR implement both changes in Dagster.

## How I Tested These Changes

I have tested these change by first running tests to make sure they
don't break dagster. Then I have set up a local benchmark to test the
changes. I have populated the dagster instace with 5.4 million runs and
10.7 million related run tags.

Afterwards I have applied the proposed changes and measured their
performance. Each query was run five times and the performance of the
fifth run was used. This is, to make sure all the data was in shared
buffers to make the comparison fair.

Results
| Experiment | Query Plan Analysis | Runtime | 
| ----------------- | ----------------------------- | ------------ | 
| Baseline | [10.7 run tags, unoptimized -
explain.dalibo.com](https://explain.dalibo.com/plan/2c1bga585e8ca45f) |
15.579s |
| Query Optimization | [10.7 run tags, optimized query, no index -
explain.dalibo.com](https://explain.dalibo.com/plan/agf3fabc77b5ce58) |
7.560s |
| Query Optimization + Custom Index | [10.7 run tags, optimized query,
with index -
explain.dalibo.com](https://explain.dalibo.com/plan/dgf924f51g585ab5) |
0.076s |

Interestingly "Query Optimization" alone results in a more complex but
faster query plan. The "Query Optimization + Custom Index" results in a
desired much simpler query plan that doesn't do as many reads.

Overall the changes improve the performance almost 200x. The addition of
the index shouldn't provide much overhead.

I have also found that if run_tags would use `runs.id` as foreign key
and not `runs.run_id` the query performance would be much faster 0.018s.
But this change would be too large and possibly break things.

---------

Signed-off-by: Egor Dmitriev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: deployment Related to deploying Dagster type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants