Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc(decision): Batch multiple files together into single large file to improve network throughput #98

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Clean up managing effects section slightly
  • Loading branch information
cmanallen committed Jun 9, 2023
commit d63a30a95626ce0d835390453ea2c7908d668edd
7 changes: 4 additions & 3 deletions text/0098-store-multiple-replay-segments-in-a-single-blob.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,7 @@ We will continue our approach of using _at-least once processing_. Each message

The buffer is kept as an in-memory list inside the consumer process. For each message we receive we append the message to the buffer. Afterwards, we check if the buffer is full. If it is we flush. Else we wait for another message.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this will constrain you to a fairly small buffer size.
Also it ties the number of replicas of your consumer to the cost efficiency of the storage, which is quite undesirable:
Assuming you are never going to commit on kafka untill the buffer is flushed (if you did you would not be able to guarantee at least once):

  • If you increase the number of replicas for any reason, each replicas takes less traffic, thus it takes longer to fill the buffer.
  • Unless you tie the commit batch time and size to the number of replicas (which is undesirable - see the snuba consumer), increasing the number of replicas would increase the number of files written per unit of time.
    replicas and file size should not be connected to each other, otherwise we will have to keep in mind a lot of affected moving parts when scaling the consumer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the time to accumulate a batch is less than the time to upload a batch then you need to add a replica. That's the only constraint. You get more efficiency at peak load so its best to run our replicas hot. The deadline will prevent the upload from sitting idle too long. The total scale factor will be determined by the number of machines we can throw at the problem.

Multi-processing/threading, I think, will be deadly to this project. So we will need a lot of single-threaded machines running.

I re-wrote this response several times. Its as disorganized as my thoughts are on this. Happy to hear critiques.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you tie the commit batch time and size to the number of replicas (which is undesirable - see the snuba consumer), increasing the number of replicas would increase the number of files written per unit of time.

I think this is an okay outcome. If you double the number of replicas you halve the number of parts per file and double the number of files. That reduces cost efficiency but the throughput efficiency remains the same for each replica. Ignoring replica count, cost efficiency will ebb and flow with the variations in load we receive throughout the day.

We still come out ahead because the total number of files written per second is less than the current implementation which is 1 file per message.

Copy link
Member Author

@cmanallen cmanallen Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] otherwise we will have to keep in mind a lot of affected moving parts when scaling the consumer.

We should scale our replicas agnostic to the implementation of the buffer's flush mechanics. I mentioned above about cost-efficiency being a hard target. So I don't think we should target it.

A deadline should be present to guarantee regular buffer commits and a max buffer size should exist to prevent us from using too many resources. I think those two commit semantics save us from having to think about the implications of adding replicas. The throughput of a single machine may drop but the risk of back log has decreased across the cluster.


This is a somewhat simplified view of whats happening. In reality we will have time based flushing and a timeout mechanism for message listening. This ensures the buffer does not stay partially full indefinitely.
This is a somewhat simplified view of whats happening. In reality we will use deadline flushing in addition to a count or resource-usage based flushing. This ensures the buffer does not stay partially full indefinitely.

**Buffer Flush**

Expand Down Expand Up @@ -250,12 +250,13 @@ With a buffered approach most of the consumer's effects are accomplished in two
1. Click Tracking.
- Click events are published to the replay-event Kafka consumer.
- This publishing step is asynchronous and relies on threading to free up the main process thread.
- This operation is measured in microseconds and is not anticipated to significantly impact total throughput.
- Because we can tolerate duplicates, we can publish click-events when we see the message or when we commit a batch.
- Neither choice is anticipated to significantly impact message processing.
2. Outcome Tracking.
- Outcome events are published to the outcomes Kafka consumer.
- This publishing step is asynchronous and relies on threading to free up the main process thread.
- This operation only occurs for segment-0 events.
- This operation is measured in microseconds and is not anticipated to significantly impact total throughput.
- I am unsure if this step can tolerate duplicates. It likely does but if it does not we could have to commit during the flushing step.
3. Project lookup.
- Projects are retrieved by a cache lookup or querying PostgreSQL if it could not be found.
- This operation typically takes >1ms to complete.
Expand Down