[processor/interval] Persistence backed interval processor with support for multiple aggregation intervals #33949

lahsivjar · 2024-07-08T14:10:34Z

Component(s)

processor/interval

Is your feature request related to a problem? Please describe.

The current version of the interval processor does in-memory aggregations for a single given interval (more than one intervals can be added with extra-overhead as new processors). The in-memory aggregations can be lost due to crashes and can also cause high memory usage for high cardinality aggregations.

Describe the solution you'd like

I propose database-backed aggregation supporting more than one intervals. I have a working PoC for this here, it is rough around the edges and will require some work before it can be contributed but it demonstrates the proposed idea. The PoC uses pebble-db, an Log-Structured-Merge (LSM) database and provides option to aggregate over multiple intervals.

The currently linked PoC uses only the interval as a key in the database which would require us to load the full aggregated interval in-memory, however, we could provide granular keys or even simple hash-based partitioning on resource IDs to reduce memory requirements even further.

Describe alternatives you've considered

No response

Additional context

I would love to hear what the maintainers of the interval-processor think about the proposal. If the maintainers are on-board the idea, I can clean-up the PoC, get it to a feature parity with the current interval processor, and create a PR.

github-actions · 2024-07-08T14:14:21Z

Pinging code owners:

processor/interval: @RichieSams @sh0rez @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

djaglowski · 2024-07-08T16:56:20Z

We have an established notion of a storage extension which allows components to persist data via an interface. Users can then choose the implementation of the interface (i.e. the type of storage extension) that meets their needs.

The use case you're describing seems reasonable but can we please first consider whether it could work with a storage extension? There are several examples of components using the extension - this might be a relatively straightforward one.

lahsivjar · 2024-07-08T17:51:05Z

@djaglowski Thanks for looking over the issue.

We have an established notion of a storage extension which allows components to persist data via an interface

The current storage extensions would not be efficient for aggregations. The best we could do with the current storage extension would be to store each resource as a key-value pair and do something like update on write manually. For example, when we get something stored on the same key then we decode the data, perform merge, and write it back to the DB. Pebble, OTOH, can do periodic compactions and perform merges (aggregations) on compactions.

I think if we can implement pebble as a new storage extension then we can use it with aggregations efficiently. WDYT, does it make sense?

djaglowski · 2024-07-08T18:59:10Z

I'm not opposed to a new pebble extension (though I don't have capacity to help make it happen). What I'm most curious about is whether the storage extension really needs to be used so aggressively. In other words, can we maintain aggregations in memory and periodically write the overall state to the storage extension (e.g. once per second)?

lahsivjar · 2024-07-09T09:33:25Z

I'm not opposed to a new pebble extension (though I don't have capacity to help make it happen).

Got it. I can do the heavy lifting here once we have a general direction, if that would work for you.

What I'm most curious about is whether the storage extension really needs to be used so aggressively. In other words, can we maintain aggregations in memory and periodically write the overall state to the storage extension (e.g. once per second)?

No, we don't need to use it so aggressively as to hit the database for each value we get. I did a PoC for using pebble for aggregations and used a size-based batch buffer of 16MBs (ref) to keep the data in-memory before flushing it to the actual DB. In addition, before harvesting the period when it is mature we would flush all in-memory data to db.

One of the advantages of using something like pebble over a generic key-value DB would be that it would make writes fairly straightforward, without a lot of computational overhead involved in merging.

RichieSams · 2024-07-09T13:30:55Z

What I'm most curious about is whether the storage extension really needs to be used so aggressively. In other words, can we maintain aggregations in memory and periodically write the overall state to the storage extension (e.g. once per second)?

I had the same thought. Would it be fine to only store to the DB every X seconds or so, so we don't have to suffer the read, deserialize, write roundtrip for every operation

djaglowski · 2024-07-09T13:36:48Z

I'm not opposed to a new pebble extension (though I don't have capacity to help make it happen).

Got it. I can do the heavy lifting here once we have a general direction, if that would work for you.

To clarify, I don't have capacity to sponsor (review and ultimately be responsible for) another component. It's still worth proposing formally though since you may find another sponsor. Otherwise you can always host it outside of contrib and people can pull it into their builds.

lahsivjar · 2024-07-09T14:18:09Z

I had the same thought. Would it be fine to only store to the DB every X seconds or so, so we don't have to suffer the read, deserialize, write roundtrip for every operation

Hmm, I understand the point now. In the PoC I posted I am still serializing the data and writing it to an in-memory batch, however, we can save the whole serialization/deserialization bit by performing some merges in-memory and then serializing the partial aggregate. This makes sense to me and we could definitely do it.

To clarify, I don't have capacity to sponsor (review and ultimately be responsible for) another component. It's still worth proposing formally though since you may find another sponsor

Got it. Thanks for the clarification.

lahsivjar · 2024-07-09T14:42:26Z

so we don't have to suffer the read, deserialize, write roundtrip for every operation

A clarification to this, with pebble, in the hot path (theConsumeMetrics step) we would not read and deserialize but rather just perform serializaton followed by write. The new metric will be written as a merge operation. The deserialize, actual merge, and write will be performed during compaction or when we harvest the aggregated data.

lahsivjar · 2024-07-11T13:44:07Z

An update, I am working on building a pebble-based storage extension as per @djaglowski 's suggestion earlier. The existing storage extension API is basic and using it directly would be a bit detrimental for the performance as it has: a) no range keys support b) keys are always string. I will know more once I have made some progress with the code.

lahsivjar added enhancement New feature or request needs triage New item requiring triage labels Jul 8, 2024

lahsivjar changed the title ~~Persistence backed interval processor with support for multiple aggregation intervals~~ [processor/interval] Persistence backed interval processor with support for multiple aggregation intervals Jul 8, 2024

github-actions bot added the processor/interval label Jul 8, 2024

github-actions bot mentioned this issue Jul 9, 2024

Weekly Report: 2024-07-02 - 2024-07-09 #33962

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/interval] Persistence backed interval processor with support for multiple aggregation intervals #33949

[processor/interval] Persistence backed interval processor with support for multiple aggregation intervals #33949

lahsivjar commented Jul 8, 2024 •

edited

Loading

github-actions bot commented Jul 8, 2024

djaglowski commented Jul 8, 2024

lahsivjar commented Jul 8, 2024 •

edited

Loading

djaglowski commented Jul 8, 2024

lahsivjar commented Jul 9, 2024 •

edited

Loading

RichieSams commented Jul 9, 2024

djaglowski commented Jul 9, 2024

lahsivjar commented Jul 9, 2024

lahsivjar commented Jul 9, 2024

lahsivjar commented Jul 11, 2024

[processor/interval] Persistence backed interval processor with support for multiple aggregation intervals #33949

[processor/interval] Persistence backed interval processor with support for multiple aggregation intervals #33949

Comments

lahsivjar commented Jul 8, 2024 • edited Loading

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Jul 8, 2024

djaglowski commented Jul 8, 2024

lahsivjar commented Jul 8, 2024 • edited Loading

djaglowski commented Jul 8, 2024

lahsivjar commented Jul 9, 2024 • edited Loading

RichieSams commented Jul 9, 2024

djaglowski commented Jul 9, 2024

lahsivjar commented Jul 9, 2024

lahsivjar commented Jul 9, 2024

lahsivjar commented Jul 11, 2024

lahsivjar commented Jul 8, 2024 •

edited

Loading

lahsivjar commented Jul 8, 2024 •

edited

Loading

lahsivjar commented Jul 9, 2024 •

edited

Loading