refactor: Combined query to fetch dimension values #1487

bprusinowski · 2024-05-02T09:30:40Z

This PR is an exploration of the feasibility to fetch values for multiple dimensions in a single SPARQL query.

Constrains

We need to be able to filter each dimension individually, due to cascading filters behavior. Could be achieved with FILTER(IF(?dimensionIri = <A>, ?dimensionIri = <a> && ?dimensionIri = <a>, ?dimensionIri))
We need to be able to unversion dimension values per individual dimension (can't unversion values outside of individual SELECT queries, as sometimes schema:sameAs is not used to indicate unversioned values, this is only the case when a dimension is versioned).

we need to combine individual queries into a big one (at least that's my current assumption).

vercel · 2024-05-02T09:30:46Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
visualization-tool	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 14, 2024 6:35am

Rdataflow · 2024-05-02T09:38:56Z

@bprusinowski can you share an example of a less performing query? thank you

bprusinowski · 2024-05-02T10:08:15Z

@Rdataflow sure, this is a query to fetch dimension values for every dimension in the Bathing water quality cube (takes ~12s). Compare this to queries fired on PROD: https://visualize.admin.ch/en/create/lmbY5klvYJAm?dataSource=Prod&flag__debug=true&flag__server-side-cache.disable=true (takes 1.4s for 5 queries, one per each dimension, fired in parallel).

The above example is for a non-filtered query (fetches all dimension values), but we also need to be able to filter and unversion each dimension separately (see PR description). This is why I think we need to combine individual queries like this – let me know if it's enough information or if you see some other direction to try.

Rdataflow · 2024-05-02T11:37:33Z

@bprusinowski it seems those OPTIONALs for schema:version have a negative impact. You may try to further parallelize using a query pattern of

{ 
    # versioned case
    ...
    ?dimension schema:version ?version 
    ... 
} UNION { 
    # nonversioned case
    ... 
    FILTER NOT EXISTS { ?dimension schema:version ?version } 
    ... 
}

i.e. like https://s.zazuko.com/HenZ6K

HTH for self guiding the next steps 👍

bprusinowski · 2024-05-06T11:27:39Z

Hey @Rdataflow, I optimized the query based on your suggestion and fixed some issues, it seems to fetch the correct values now.

However it looks like it's less performing than the current approach of parallel queries, both for smaller and bigger cubes. You can see some examples for Photovoltaikanlagen (TEST, PR) and NFI: Change (TEST, PR) cubes (both using INT data source to not rely on cached endpoint and with disabled server-side cache).

It looks like the new approach is ~2x slower than what we currently have on TEST. I will take a deeper look to see if there's something obvious to optimize, but it would be great if you could also take a look in case you have a bit of time.

Let me know what you about the whole direction of combining the queries 👀

Rdataflow · 2024-05-06T11:41:43Z

@bprusinowski unfortunately those /create/ links won't endure... do you have some permalinks maybe?

Rdataflow · 2024-05-06T11:42:31Z

i.e. those /create/new?from=<published> would work IIUC

bprusinowski · 2024-05-06T11:53:18Z

@Rdataflow of course – I updated the links, should be ok now (I blame it on Monday morning 🤦 😅)

Rdataflow · 2024-05-06T13:34:00Z

@bprusinowski what happens if you drop #pragma evaluate on everywhere? with the query in the current form this might help. curious to see...

bprusinowski · 2024-05-06T14:59:39Z

@Rdataflow I did some tests to fire the query "for the first time", to avoid some apparent caching on LINDAS side (did this by modifying the query to e.g. remove retrieval of color, so the query looks like new); it still looks like #pragma improves the situation a bit (1.8s for pragma vs 2.2s without) for the Traffic noise pollution cube. The timings change if I "comment out" other properties, but the delta seems to always be in favor of #pragma...

Rdataflow · 2024-05-06T18:55:16Z

@bprusinowski the query pattern still would profit of some minor optimization steps...
the 3rd and 4th UNION block of each dimension may be optimized using a inner SELECT to prevent ?obs ?dimensionIri ?versionedValue from evaluation in case of empty due to FILTER NOT EXISTS { ... } returning empty set

this might look like i.e.

# 3rd UNION block on a dimension
  {
    SELECT DISTINCT ?dimensionIri ?versionedValue ?unversionedValue
    WHERE {
      {
        SELECT ?observation
        WHERE {
          VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
          <https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationConstraint/sh:property ?dimension .
          ?dimension sh:path ?dimensionIri .
          ?dimension schema:version ?version .
          FILTER NOT EXISTS {
            ?dimension sh:in ?in .
          }
          <https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationSet/cube:observation ?observation .
        }
      }
      VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
      ?observation ?dimensionIri ?versionedValue .
      ?versionedValue schema:sameAs ?unversionedValue .
    }
  }
  UNION
  {
# 4th UNION block on a dimension
    SELECT DISTINCT ?dimensionIri ?versionedValue ?unversionedValue
    WHERE {
      {
        SELECT ?observation
        WHERE {
          VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
          <https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationConstraint/sh:property ?dimension .
          ?dimension sh:path ?dimensionIri .
          FILTER NOT EXISTS {
            ?dimension sh:in ?in .
          }
          FILTER NOT EXISTS {
            ?dimension schema:version ?version .
          }
          <https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationSet/cube:observation ?observation .
        }
      }
      ?observation ?dimensionIri ?versionedValue .
      BIND(?versionedValue AS ?unversionedValue)
    }
  }

then as the innermost SELECT ?observation becomes emtpy (true for many dimensions) the UNION blocks are faster now 😄

nb: or in case you prefer the #pragmas transfer those pragmas to the inner SELECT ?observation then

bprusinowski · 2024-05-07T13:23:33Z

Thanks again @Rdataflow for optimizing the query 💯

Unfortunately it looks that it's still significantly less performing that the ones we have on TEST / INT / PROD. See this combined query that takes 12-13s – the same cube on TEST takes ~7-8s to load values for every dimension when fired separately.

I think we might reach out to Zazuko, seeing that the approach we currently try doesn't seem to improve things – does it sound good? Maybe I miss some additional context, but knowing that we'll use a cached endpoint that will already offload a load of computing power from Stardog, I am not sure if it's worth it to sacrifice 50% of performance (assuming is scales linearly 😅 – but even if not, an overhead of 4s for NFI cubes if noticeable) just to send a smaller number of queries.

Let me know what you think @Rdataflow :)

cc @sosiology @adintegra

Rdataflow · 2024-05-26T06:22:12Z

@bprusinowski the proposed query obviously misses to constrain the dimensionIri to the relevant dimension only - therefore it suffers heavily degraded performance

see comments inline
https://s.zazuko.com/23hyB45

nb: regarding perf on TEST see VSHN SBAR-1122 and comment inline

cc @sosiology @adintegra

bprusinowski added 2 commits May 2, 2024 10:29

refactor: Create wrapper for dimension values loading

7443d0d

refactor: Fetch values for every dimension at the same time

0a89525

vercel bot deployed to Preview May 2, 2024 09:35 View deployment

perf: Remove OPTIONAL

e8ba4a9

vercel bot deployed to Preview May 3, 2024 10:11 View deployment

fix: Need to unpack metadata from versioned dimension values

2b3440e

vercel bot deployed to Preview May 6, 2024 11:11 View deployment

perf: Optimize dimensions values query

b59063c

vercel bot deployed to Preview May 7, 2024 07:52 View deployment

perf: Constraint relevant dimension values query parts to one dimension

d099617

vercel bot deployed to Preview June 14, 2024 06:35 View deployment

bprusinowski marked this pull request as ready for review June 14, 2024 07:05

bprusinowski requested a review from ptbrowne as a code owner June 14, 2024 07:05

bprusinowski merged commit a7d9a19 into main Jun 21, 2024
5 of 7 checks passed

bprusinowski deleted the refactor/combined-components-values-query branch June 21, 2024 07:24

bprusinowski mentioned this pull request Jun 21, 2024

Rework Components query to query multiple dimensions at once #1470

Closed

2 tasks

Rdataflow mentioned this pull request Jul 9, 2024

possible filter and components query needlessly fires twice - and takes a very long time #1658

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Combined query to fetch dimension values #1487

refactor: Combined query to fetch dimension values #1487

bprusinowski commented May 2, 2024 •

edited

Loading

vercel bot commented May 2, 2024 •

edited

Loading

Rdataflow commented May 2, 2024

bprusinowski commented May 2, 2024 •

edited

Loading

Rdataflow commented May 2, 2024

bprusinowski commented May 6, 2024 •

edited

Loading

Rdataflow commented May 6, 2024

Rdataflow commented May 6, 2024 •

edited

Loading

bprusinowski commented May 6, 2024

Rdataflow commented May 6, 2024

bprusinowski commented May 6, 2024

Rdataflow commented May 6, 2024 •

edited

Loading

bprusinowski commented May 7, 2024 •

edited

Loading

Rdataflow commented May 26, 2024 •

edited

Loading

refactor: Combined query to fetch dimension values #1487

refactor: Combined query to fetch dimension values #1487

Conversation

bprusinowski commented May 2, 2024 • edited Loading

Constrains

vercel bot commented May 2, 2024 • edited Loading

Rdataflow commented May 2, 2024

bprusinowski commented May 2, 2024 • edited Loading

Rdataflow commented May 2, 2024

bprusinowski commented May 6, 2024 • edited Loading

Rdataflow commented May 6, 2024

Rdataflow commented May 6, 2024 • edited Loading

bprusinowski commented May 6, 2024

Rdataflow commented May 6, 2024

bprusinowski commented May 6, 2024

Rdataflow commented May 6, 2024 • edited Loading

bprusinowski commented May 7, 2024 • edited Loading

Rdataflow commented May 26, 2024 • edited Loading

bprusinowski commented May 2, 2024 •

edited

Loading

vercel bot commented May 2, 2024 •

edited

Loading

bprusinowski commented May 2, 2024 •

edited

Loading

bprusinowski commented May 6, 2024 •

edited

Loading

Rdataflow commented May 6, 2024 •

edited

Loading

Rdataflow commented May 6, 2024 •

edited

Loading

bprusinowski commented May 7, 2024 •

edited

Loading

Rdataflow commented May 26, 2024 •

edited

Loading