Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc(decision): Batch multiple files together into single large file to improve network throughput #98

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Re-write to include notes about encryption of files
  • Loading branch information
cmanallen committed Jun 8, 2023
commit 812a110ab33485e5213897004c561629963db87d
108 changes: 64 additions & 44 deletions text/0098-store-multiple-replay-segments-in-a-single-blob.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,29 +15,43 @@ Recording data is sent in segments. Each segment is written to its own file. Wri

# Background

This document exists to inform all relevant stakeholders of our proposal and seek feedback prior to implementation.
This document was originally written to respond to a percieved problem in the Session Replays recording consumer. However, upon exploring these ideas more it was determined that this could be more generally applied to the organization as a whole. For that reason I've made many of the names generic but have also retained many references to Session Replay.

# Supporting Data

Google Cloud Storage lists the costs for writing and storing data as two separate categories. Writing a file costs $0.005 per 1000 files. Storing that file costs $0.02 per gigabyte. For the average file this works out to: $0.000000012 for storage and $0.00000005 for the write.
Google Cloud Storage lists the costs for writing and storing data as two separate categories. Writing a file costs $0.005 per 1000 files. Storing that file costs $0.02 per gigabyte. For the average Session Replay file (with a retention period of 90 days) this works out to: $0.000000012 for storage and $0.00000005 for the write.

In practical terms, this means 75% of our spend is allocated to writing new files.
cmanallen marked this conversation as resolved.
Show resolved Hide resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the cost is mostly due to write, at scale is this cost problematic? Like is it a blocker for reaching higher scale or are we simply looking for a more efficient option ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe cost will prevent us from reaching greater scale. Write cost scale linearly. Pricing does not quite scale linearly but if you're happy with how much GCS costs at low levels of demand you will be happy at peak demand.


# Proposal

First, a new table called "recording_byte_range" with the following structure is created:

| replay_id | segment_id | path | start | stop |
| --------- | ---------- | -------- | ----- | ----- |
| A | 0 | file.bin | 0 | 6241 |
| B | 0 | file.bin | 6242 | 8213 |
| A | 1 | file.bin | 8214 | 12457 |

Replay and segment ID should be self-explanatory. Path is the location of the blob in our bucket. Start and stop are integers which represent the index positions in an inclusive range. This range is a contiguous sequence of related bytes. In other words, the segment's compressed data is contained within the range.

Notice each row in the example above points to the same file but with different start and stop locations. This is implies that multiple segments and replays can be present in the same file. A single file can be shared by hundreds of different segments.

This table will need to support, at a minimum, one write per segment.
First, a new table called "file_part_byte_range" with the following structure is created:

| id | key | path | start | stop | dek | kek_id |
| --- | --- | -------- | ----- | ----- | -------- | ------ |
| 1 | A:0 | file.bin | 0 | 6241 | Aq3[...] | 1 |
| 2 | B:0 | file.bin | 6242 | 8213 | ppT[...] | 1 |
| 3 | A:1 | file.bin | 8214 | 12457 | 99M[...] | 1 |

- The key field is client generated identifier.
- It is not unique.
- The value of the key field should be easily computable by your service.
Copy link

@fpacifici fpacifici Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ensuring this is unique and avoiding having two keys ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least once delivery guarantees. If you're bulk inserting 1000 rows per upload then it becomes difficult to split out the rows that have duplicates.

But, thinking on this more... my plan is to use Kafka for this. Ordering guarantees could mean that the file batch is deterministic (assuming we don't have a stateful countdown timer). If 1 row exists in the database then they all must exist (therefore the batch was already written - commit offsets and move on). I suppose re-balancing would not affect this? A unique constraint could be possible.

That being said I think a deadline is important so files don't sit idle in the consumer for undefined periods of time.

Copy link
Member Author

@cmanallen cmanallen Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again a read query against the unique key prior to the insert operation could satisfy this constraint. So yes I would say a unique constraint is possible here.

- In the case of Session Replay the key could be a concatenation of `replay_id` and `segment_id`.
- Illustrated above as `replay_id:segment_id`.
- Alternatively, a true composite key could be stored on a secondary table which contains a reference to the `id` of the `file_part_byte_range` row.
- Path is the location of the blob in our bucket.
- Start and stop are integers which represent the index positions in an inclusive range.
- This range is a contiguous sequence of related bytes.
- In other words, the entirety of the file part's encrypted data is contained within the range.
- The "dek" column is the **D**ata **E**ncryption **K**ey.
- The DEK is the key that was used to encrypt the byte range.
- The key itself is encrypted by the KEK.
- **K**ey **E**ncryption **K**ey.
- Encryption is explored in more detail in the following sections.
- The "kek_id" column contains the ID of the KEK used to encrypt the DEK.
- This KEK can be fetched from a remote **K**ey **M**anagement **S**ervice or a local database table.

Notice each row in the example above points to the same file but with different start and stop locations. This implies that multiple, independent parts can be present in the same file. A single file can be shared by hundreds of different parts.

Second, the Session Replay recording consumer will not commit blob data to Google Cloud Storage for each segment. Instead it will buffer many segments and flush them together as a single blob to GCS. Next it will make a bulk insertion into the database for tracking.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this buffering process going to look like?
It will have to rely on persistent state to prevent you from loosing chunks in case of failover before a file is fully ingested while parts have been already committed in Kafka.

Also a more high level concern with this.
If the idea is to build a generic system reusable by other features, relying on a specific consumer to do the buffering has the important implication that Kafka is always going to be a moving part of your system. Have you considered doing the buffering separately.

Copy link
Member Author

@cmanallen cmanallen Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this buffering process going to look like?

Like buffering would in our Snuba consumers. You keep an array of messages/rows/bytes in memory. When it comes time to flush you zip them together in some protocol specific format. In this case the protocol is byte concatenation.

It will have to rely on persistent state to prevent you from loosing chunks in case of failover before a file is fully ingested while parts have been already committed in Kafka.

If we assume that the replays consumer pushes the bytes to another, generic Kafka consumer which buffers the files before upload then the persistent state will be the log. Upload failures can be re-run from the last committed offset. Persistent failures would have to be handled as part of a DLQ and would require re-running. Potentially introducing a significant amount of latency between (in the replays case) a billing outcome and the replay recording being made available.

Assuming this buffering/upload step exists inside our existing consumer (i.e. not a generic service) then offsets will not be committed until after the batch has been uploaded.

Also a more high level concern with this.
If the idea is to build a generic system reusable by other features, relying on a specific consumer to do the buffering has the important implication that Kafka is always going to be a moving part of your system. Have you considered doing the buffering separately.

I have considered that. The idea being process A buffers a bunch of file-parts/offsets before sending the completed object to permanent storage either through direct interaction, filestore, or Kafka intermediary (or any number of intermediaries). The problem is then that each call site is responsible for the buffer which is the hardest part of the problem.

Kafka is an important part of this system in my mind. When I wrote this document I relied on the guarantees it provides. I think if this is a generic service made available to the company then publishing to the topic should be a simple process for integrators. The fact that its Kafka internally is an irrelevant implementation detail (to the caller).


Expand All @@ -53,36 +67,48 @@ flowchart
G --> A;
```

Third, when a client requests recording data we will look it up in the "recording_byte_range" table. From it's response, we will issue as many fetch requests as there are rows in the response. These requests may target a single file or many files. The files will be fetched with a special header that instructs the service provider to only respond with a subset of the bytes. Specifically, the bytes that related to our replay.
## Writing

The response bytes will be decompressed, merged into a single payload, and returned to the user as they are now.
Writing a file part is a four step process.

# Drawbacks
First, the bytes must be encrypted with a randomly generated DEK. Second, the DEK is encrypted with a KEK. Third, the file is uploaded to the cloud storage provider. Fourth, a metadata row is written to the "file_part_byte_range" containing a key, the containing blob's filepath, start and stop offsets, and the encrypted DEK.

1. Deleting data becomes tricky. See "Unresolved Questions".
**A Note on Aggregating File Parts**

# Unresolved Questions
It is up to the implementer to determine how many parts exist in a file. An implementer may choose to store one part per file or may store an unbounded number of parts per file.

However, if you are using this system, it is recommended that more than one part be stored per file. Otherwise it is more economical to upload the file using simpler, more-direct methods.

## Reading

To read a file part the metadata row in the "file_part_byte_range" table is fetched. Using the filepath, starting byte, and ending byte we fetch the encrypted bytes from remote storage. Now that we have our encrypted bytes we can use the DEK we fetched from the "file_part_byte_range" table to decrypt the blob and return it to the user.

## Deleting

To delete a file part the metadata row in the "file_part_byte_range" table is deleted. With the removal of the DEK, the file part is no longer readable and is considered deleted.

1. Can we keep deleted data in GCS but make it inaccessible?
Project deletes, user deletes, GDPR deletes, and user-access TTLs are managed by deleting the metadata row in the "file_part_byte_range" table.

- User and project deletes:
- We would remove all capability to access it making it functionally deleted.
- GDPR deletes:
- Would this require downloading the file, over-writing the subsequence of bytes, and re-uploading a new file?
- Single replays, small projects, or if the mechanism is infrequently used could make this a valid deletion mechanism.
- The data could be encrypted, with some encryption key stored on the metadata row, making the byte sequence unreadable upon row delete.
File parts can be grouped into like-retention-periods and deleted manually or automatically after expiry. However, in the case of replays, storage costs are minor. We will retain our encrypted segment data for the maximum retention period of 90 days.

2. What datastore should we use to store the byte range information?
## Key Rotation

- Cassandra, Postgres, AlloyDB?
- Postgres likely won't be able to keep up long-term.
- Especially if we write multiple byte ranges per segment.
- Cassandra could be a good choice but its not clear what operational burden this imposes on SnS and Ops.
- AlloyDB seems popular among the SnS team and could be a good choice.
- It can likely interface with the Django ORM. But its not clear to me at the time of writing.
- Whatever database we use must support deletes.
If a KEK is compromised and needs to be rotated we will need to follow a four step process. First, we query for every row in the "file_part_byte_range" table whose DEK was encrypted with the old KEK. Second, we will decrypt every DEK with the old KEK. Third, we will encrypt the DEK with a new KEK. Fourth, the old KEK is dropped.

3. How fast can we encrypt and decrypt each segment?
DEKs are more complicated to rotate as it requires modifying the blob. However, because DEKs are unique to a byte range within a single file we have a limited surface area for a compromised key to be exploited. To rotate a DEK first download the blob, second decrypt the byte range with the compromised DEK, third generate a new DEK, fourth encrypt the payload with the new DEK, fifth encrypt the new DEK with any KEK, and sixth upload and re-write the metadata rows with the new offsets.

# Drawbacks

# Unresolved Questions

1. How can we efficiently fetch a KEK?
- As currently specified we would need to reach out to a remote service.
- Can we cache the key in Redis or more permanently in a datastore like Postgres?
- This expands the area of attack for a potential intruder.
- Does KEK lookup efficiency matter on an endpoint where we're downloading KBs of blob data?
- Maybe?
- Does KEK lookup efficiency matter for ingest when it can be cached for the duration of the consumer's life?
- No.

# Extensions

Expand Down Expand Up @@ -134,6 +160,8 @@ with open(filename, "r") as f:

## Consumer Buffering Mechanics

The following section is highly specific to the Session Replay product.

We will continue our approach of using _at-least once processing_. Each message we receive is guaranteed to be processed to completion regardless of error or interrupt. Duplicate messages are possible under this scheme and must be accounted for in the planning of each component.

**Buffer Location and Behavior**
Expand Down Expand Up @@ -198,11 +226,3 @@ With a buffered approach most of the consumer's effects are accomplished in two
4. Click tracking.
- Duplicate click events will be inserted for a replay, segment pair.
- This is an acceptable outcome and will not impact search behavior.

## Security

How do we prevent users from seeing segments that do not belong to them? The short-answer is test coverage. Recording another segment's byte range would be the same as generating the incorrect filename under the current system. These outcomes are prevented with robust test coverage.

In the case of bad byte range math, we do have some implicit protection. Each segment is compressed independently. Fetching a malformed byte range would yield unreadable data. A bad byte range either truncates the compression headers or includes unintelligible bytes at the beginning or end of the sequence. If we manage to decompress a subset of a valid byte range the decompressed output would be malformed JSON and would not be returnable to the user.

Additionally, each segment could be encrypted with an encryption-key that is stored on the row. Requesting an invalid byte range would yield malformed data which could not be decrypted with the encryption-key stored on the row.