Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] Refacto run dao sql query #2685

Conversation

sophiely
Copy link
Contributor

@sophiely sophiely commented Nov 16, 2023

Problem

Closes: 2686

The SQL query run for listing all runs is quite slow and often result in a time out.

Solution

The logic is quite the same as before, the heavy operation is all the 4 JSON_AGG so we just need to filter first and apply this operation only on the relevent rows

run_uuid IN (SELECT uuid FROM runs_view WHERE job_uuid IN (SELECT uuid FROM filtered_jobs)).

For a given namespace and job name and a db.t4g.medium (vCPU: 2, RAM: 4 GB) machine:

  • The old query takes: 1 minute 14 seconds
  • The new query takes: 1.919 seconds
WITH filtered_jobs AS (
            SELECT
                jv.uuid,
                jv.namespace_name,
                jv.name
            FROM jobs_view jv
            WHERE jv.namespace_name=:namespace AND (jv.name=:jobName OR :jobName = ANY(jv.aliases))
          ),
          run_facets_agg AS (
            SELECT
                run_uuid,
                JSON_AGG(facet ORDER BY lineage_event_time ASC) AS facets
            FROM run_facets_view
            WHERE
                run_uuid IN (SELECT uuid FROM runs_view WHERE job_uuid IN (SELECT uuid FROM filtered_jobs))
            GROUP BY run_uuid
          ),
          input_versions_agg AS (
               SELECT
                   im.run_uuid,
                   JSON_AGG(json_build_object('namespace', dv.namespace_name,
                        'name', dv.dataset_name,
                        'version', dv.version,
                        'dataset_version_uuid', dv.uuid
                   )) AS input_versions
               FROM runs_input_mapping im
               INNER JOIN dataset_versions dv ON im.dataset_version_uuid = dv.uuid
               WHERE
                   im.run_uuid IN (SELECT uuid FROM runs_view WHERE job_uuid IN (SELECT uuid FROM filtered_jobs))
               GROUP BY im.run_uuid
          ),
          output_versions_agg AS (
              SELECT
                  dv.run_uuid,
              JSON_AGG(json_build_object('namespace', namespace_name,
                                       'name', dataset_name,
                                       'version', version,
                                       'dataset_version_uuid', uuid
                                       )) AS output_versions
              FROM dataset_versions dv
              WHERE dv.run_uuid IN (SELECT uuid FROM runs_view WHERE job_uuid IN (SELECT uuid FROM filtered_jobs))
              GROUP BY dv.run_uuid
          ),
          dataset_facets_agg AS (
              SELECT
                  run_uuid,
                  JSON_AGG(json_build_object(
                      'dataset_version_uuid', dataset_version_uuid,
                      'name', name,
                      'type', type,
                      'facet', facet
                  ) ORDER BY created_at ASC) as dataset_facets
              FROM dataset_facets_view
              WHERE run_uuid IN (SELECT uuid FROM runs_view WHERE job_uuid IN (SELECT uuid FROM filtered_jobs))
              AND (type ILIKE 'output' OR type ILIKE 'input')
              GROUP BY run_uuid
          )
          SELECT
              r.*,
              ra.args,
              f.facets,
              jv.version AS job_version,
              ri.input_versions,
              ro.output_versions,
              df.dataset_facets
          FROM runs_view r
          INNER JOIN filtered_jobs fj ON r.job_uuid = fj.uuid
          LEFT JOIN run_facets_agg f ON r.uuid = f.run_uuid
          LEFT JOIN run_args ra ON ra.uuid = r.run_args_uuid
          LEFT JOIN job_versions jv ON jv.uuid = r.job_version_uuid
          LEFT JOIN input_versions_agg ri ON r.uuid = ri.run_uuid
          LEFT JOIN output_versions_agg ro ON r.uuid = ro.run_uuid
          LEFT JOIN dataset_facets_agg df ON r.uuid = df.run_uuid
          ORDER BY r.started_at DESC NULLS LAST
          LIMIT :limit OFFSET :offset

One-line summary:

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label Nov 16, 2023
Copy link

netlify bot commented Nov 16, 2023

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
🔨 Latest commit dc74376
🔍 Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/65578daa94a5b60008c50e6a

Copy link

codecov bot commented Nov 16, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (dddac31) 84.05% compared to head (dc74376) 84.05%.

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2685   +/-   ##
=========================================
  Coverage     84.05%   84.05%           
  Complexity     1379     1379           
=========================================
  Files           248      248           
  Lines          6297     6297           
  Branches        286      286           
=========================================
  Hits           5293     5293           
  Misses          851      851           
  Partials        153      153           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

)) AS input_versions
FROM runs_input_mapping im
INNER JOIN dataset_versions dv ON im.dataset_version_uuid = dv.uuid
WHERE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where .. in (SELECT ...) necessary to gain performance boost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion yes because with this condition we only JSON_AGG (which is the bottleneck here) on the uuid that matters and not on the whole table.

For a given namespace and job name and a db.t4g.medium (vCPU: 2, RAM: 4 GB) machine:

The old query takes: 1 minute 14 seconds
The new query takes: 1.919 seconds
The new query without filter: More than 10 minutes (probably more, i stop it at ten minutes)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Could you then add some comment describing that it's done for this purpose.

It may be tempting for anyone in future to refactor this code and get rid of this. In such case, our tests will not notice the performance degradation...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ! ;)

@sophiely sophiely force-pushed the feat/refacto-list-run-dao-query branch from e440afe to ab4f42d Compare November 16, 2023 13:00
Signed-off-by: sophiely <[email protected]>
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

@pawel-big-lebowski pawel-big-lebowski merged commit 1019fb9 into MarquezProject:main Nov 20, 2023
16 checks passed
@wslulciuc wslulciuc added this to the 0.43.0 milestone Dec 13, 2023
@sophiely sophiely deleted the feat/refacto-list-run-dao-query branch July 25, 2024 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[PERF] List run query Timeout
4 participants