Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc(decision): Batch multiple files together into single large file to improve network throughput #98

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 16 additions & 12 deletions text/0098-store-multiple-replay-segments-in-a-single-blob.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,13 @@

# Summary

Recording data is sent in segments. Currently each segment is written to its own file. Writing files is the most expensive component of our GCS usage. It is also the most expensive component, in terms of time, in our processing pipeline. By merging many segment files together into a single file we can minimize our costs and maximize our Kafka consumer's throughput.
Recording data is sent in segments. Each segment is written to its own file. Writing files is the most expensive component of our Google Cloud Storage usage. It is also the most expensive component, in terms of time, in our processing pipeline. By merging many segment files together into a single file we can minimize our costs and maximize our processing pipeline's throughput.

# Motivation

1. Minimize costs.
2. Improve throughput.
3. Enable new features in a cost-effective manner.

# Background

Expand All @@ -26,15 +27,19 @@ In practical terms, this means 75% of our spend is allocated to writing new file

First, a new table called "recording_byte_range" with the following structure is created:

| replay_id | segment_id | filename | start | stop |
| replay_id | segment_id | path | start | stop |
cmanallen marked this conversation as resolved.
Show resolved Hide resolved
| --------- | ---------- | -------- | ----- | ----- |
| A | 0 | file.bin | 0 | 6241 |
| B | 0 | file.bin | 6242 | 8213 |
| A | 1 | file.bin | 8214 | 12457 |

This table will need to support, at a minimum, one write per segment. Currently, we recieve ~350 segments per second at peak load.
Replay and segment ID should be self-explanatory. Path is the location of the blob in our bucket. Start and stop are integers which represent the index positions in an inclusive range. This range is a contiguous sequence of related bytes. In other words, the segment's compressed data is contained within the range.

Second, the Session Replay recording consumer will not _commit_ blob data to GCS for each segment. Instead it will buffer many segments and flush them all together as a single blob to GCS. In this step it will also make a bulk insertion into the database.
Notice each row in the example above points to the same file but with different start and stop locations. This is implies that multiple segments and replays can be present in the same file. A single file can be shared by hundreds of different segments.

This table will need to support, at a minimum, one write per segment.

Second, the Session Replay recording consumer will not commit blob data to Google Cloud Storage for each segment. Instead it will buffer many segments and flush them together as a single blob to GCS. Next it will make a bulk insertion into the database for tracking.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this buffering process going to look like?
It will have to rely on persistent state to prevent you from loosing chunks in case of failover before a file is fully ingested while parts have been already committed in Kafka.

Also a more high level concern with this.
If the idea is to build a generic system reusable by other features, relying on a specific consumer to do the buffering has the important implication that Kafka is always going to be a moving part of your system. Have you considered doing the buffering separately.

Copy link
Member Author

@cmanallen cmanallen Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this buffering process going to look like?

Like buffering would in our Snuba consumers. You keep an array of messages/rows/bytes in memory. When it comes time to flush you zip them together in some protocol specific format. In this case the protocol is byte concatenation.

It will have to rely on persistent state to prevent you from loosing chunks in case of failover before a file is fully ingested while parts have been already committed in Kafka.

If we assume that the replays consumer pushes the bytes to another, generic Kafka consumer which buffers the files before upload then the persistent state will be the log. Upload failures can be re-run from the last committed offset. Persistent failures would have to be handled as part of a DLQ and would require re-running. Potentially introducing a significant amount of latency between (in the replays case) a billing outcome and the replay recording being made available.

Assuming this buffering/upload step exists inside our existing consumer (i.e. not a generic service) then offsets will not be committed until after the batch has been uploaded.

Also a more high level concern with this.
If the idea is to build a generic system reusable by other features, relying on a specific consumer to do the buffering has the important implication that Kafka is always going to be a moving part of your system. Have you considered doing the buffering separately.

I have considered that. The idea being process A buffers a bunch of file-parts/offsets before sending the completed object to permanent storage either through direct interaction, filestore, or Kafka intermediary (or any number of intermediaries). The problem is then that each call site is responsible for the buffer which is the hardest part of the problem.

Kafka is an important part of this system in my mind. When I wrote this document I relied on the guarantees it provides. I think if this is a generic service made available to the company then publishing to the topic should be a simple process for integrators. The fact that its Kafka internally is an irrelevant implementation detail (to the caller).


```mermaid
flowchart
Expand All @@ -54,19 +59,18 @@ The response bytes will be decompressed, merged into a single payload, and retur

# Drawbacks

- Deleting recording data from a GDPR request, project deletion, or a user delete request will require downloading the file, overwriting the bytes within the deleted range with null bytes (`\x00`) before re-uploading the file.
- This will reset the retention period.
- This is an expensive operation and depending on the size of the project being deleted a very time consuming operation.
1. Deleting data becomes tricky. See "Unresolved Questions".
cmanallen marked this conversation as resolved.
Show resolved Hide resolved

# Unresolved Questions

1. Can we keep the data in GCS but make it inaccessible?
1. Can we keep deleted data in GCS but make it inaccessible?
mdtro marked this conversation as resolved.
Show resolved Hide resolved

- User and project deletes could leave their data orphaned in GCS.
- User and project deletes:
- We would remove all capability to access it making it functionally deleted.
- GDPR deletes will likely require overwriting the range but if they're limited in scope that should be acceptable.
- Single replays, small projects, or if the mechanism is infrequently used should make this a valid deletion mechanism.
- The data could be encrypted, with its key stored on the metadata row, making it unreadable upon delete.
- GDPR deletes:
- Would this require downloading the file, over-writing the subsequence of bytes, and re-uploading a new file?
- Single replays, small projects, or if the mechanism is infrequently used could make this a valid deletion mechanism.
- The data could be encrypted, with some encryption key stored on the metadata row, making the byte sequence unreadable upon row delete.

2. What datastore should we use to store the byte range information?

Expand Down