Skip to content
This repository has been archived by the owner on Jul 19, 2023. It is now read-only.

Improve block querying with a binary join and sync iterator #807

Merged

Conversation

simonswine
Copy link
Collaborator

@simonswine simonswine commented Jun 30, 2023

This should avoid over reading at then end of the join (with Tempo's sync iterator) and also make sure it seeks before looking into the second iterator of the join. (Binary Join iterator).

@simonswine simonswine changed the title 20230630 binary join and sync iterator Improve block querying with a binary join and sync iterator Jun 30, 2023
@simonswine simonswine force-pushed the 20230630_binary-join-and-sync-iterator branch from 5fd4c58 to 9e40582 Compare July 6, 2023 08:29
for i := definitionLevel + 1; i < len(t); i++ {
t[i] = -1
// the following is nextSlow() unrolled
switch repetitionLevel {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what's going on here ? why 5 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think might be the highest definition level the otel schema has?

Probably @joe-elliott might be able to anwser that better.

In the grand scheme of things, the gain is probably not worth the unrolling for us, happy to put that back to the original (grafana/tempo#2412).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonswine is correct. it's just an arbitrary depth that is enough for our schema.

i don't recall the numbers but profiling heavy "needle in the haystack" queries showed a large %age of the time in this function (which is why we unrolled it).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine if we don't overflow our current definition/repetition level for the 3 column we use this iterator with.

Copy link
Collaborator Author

@simonswine simonswine Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is inline with limit imposed by RowNumber [6]int64

@simonswine simonswine force-pushed the 20230630_binary-join-and-sync-iterator branch 3 times, most recently from 6f92b4a to 38d0c1a Compare July 7, 2023 17:59
},
{
name: "time before results",
expectedResultCount: 0,
timePredicate: NewIntBetweenPredicate(-10, -1),
seriesPageReads: 1,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyriltovena We could prevent this single read if we would push down TimeNanos predicates into the SeriesID iterator.

I don't think we should do this as that PR though 😆

@simonswine simonswine force-pushed the 20230630_binary-join-and-sync-iterator branch from 38d0c1a to d34f635 Compare July 10, 2023 10:47
@simonswine
Copy link
Collaborator Author

The difference in the ColumnIterator vs SyncIterator is minuscule:

❯ benchstat before.txt after.txt
name                    old time/op    new time/op    delta
ColumnIterator/sync-12    17.9ms ± 5%    17.4ms ± 4%  -3.14%  (p=0.007 n=10+10)

name                    old alloc/op   new alloc/op   delta
ColumnIterator/sync-12    68.6MB ± 0%    65.7MB ± 0%  -4.25%  (p=0.000 n=10+10)

name                    old allocs/op  new allocs/op  delta
ColumnIterator/sync-12      200k ± 0%      200k ± 0%  -0.09%  (p=0.000 n=10+9)

But getting rid of the async aspect is beneficial stopping overreading.

A bigger impact is on the JoinIterator vs BinaryJoinIterator and the amount of pages read from a block:

  • It will now seekOrNext as soon as possible rather than reading the first page of every column and the move on.
  • Also it is quite beneficial doing a next instead of sync if the it is the adjacent row (that might be something we would rather have in the SyncIterator maybe)

Left --- / Right +++

image

The impact to the overall query time is no longer so significant, this was probably due a bug that missed out on some matches.

@simonswine simonswine marked this pull request as ready for review July 10, 2023 15:10
}
} else {
// the right value can't be smaller than the left one because we seeked beyond it
panic("not expected to happen")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the panic message more specific ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 48e524a

Copy link
Collaborator

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simonswine simonswine merged commit 89d35b1 into grafana:main Jul 11, 2023
17 checks passed
simonswine added a commit to simonswine/pyroscope that referenced this pull request Jul 18, 2023
…phlare#807)

* Pulll tempo sync

* BinaryJoinIterator

* Remove unused code

* Make use of the BinaryJoinIterator and remove JoinIterator

* Correctly log the ReadPages on span

* Use load on the atomic page stats

* Get tests from tempo

* Move map predicate and base it off the min/max

This predicate still wouldn't work all that well with a fully SeriesID sorted
block (so accress row groups). But let's fix that separately

* Fix bug when we iterate left iterator based on a higher right result

* Test BinaryJoinIterator thoroughly

* Optimize when the Seek destination is only a single row away

* Fix tests comments

* Validate that page reads are as expected

* Fix closing and cancellation

A call to Next/Seek after a close should also yield a context cancelled.

* Do cancel it in the right place

* Improve panic messaging

* Add nextSlow and uncomment test
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants