-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc(decision): Batch multiple files together into single large file to improve network throughput #98
Open
cmanallen
wants to merge
21
commits into
main
Choose a base branch
from
rfc/store-multiple-replay-segments-in-a-single-blob
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+187
−0
Open
rfc(decision): Batch multiple files together into single large file to improve network throughput #98
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
912301d
rfc(decision): Store Multiple Replay Segments in a Single Blob
cmanallen 7dfe76c
Initial commit
cmanallen f82f9e6
Add semicolons
cmanallen 721c28c
Fix flowchart
cmanallen 4ff4376
Use correct byte start
cmanallen 4ec773a
Add extension
cmanallen 86c1819
Remove ancillary goal motivation
cmanallen 9f9d123
Add more detail
cmanallen 8b25e3b
Use clearer language
cmanallen d34401e
Add technical details section
cmanallen 8af3d50
Merge branch 'rfc/store-multiple-replay-segments-in-a-single-blob' of…
cmanallen 812a110
Re-write to include notes about encryption of files
cmanallen f5fe8e8
Update questions section
cmanallen 78a0d6e
Add created-at column
cmanallen 82f222d
Add high-level overview
cmanallen a7ad1d6
Improve formatting
cmanallen d63a30a
Clean up managing effects section slightly
cmanallen 354431d
Add more to supporting data section and add an FAQ
cmanallen 15430b4
Re-write RFC to be product-neutral
cmanallen 71f1715
Add conclusion
cmanallen 3759346
To to with
cmanallen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Clean up managing effects section slightly
- Loading branch information
commit d63a30a95626ce0d835390453ea2c7908d668edd
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this will constrain you to a fairly small buffer size.
Also it ties the number of replicas of your consumer to the cost efficiency of the storage, which is quite undesirable:
Assuming you are never going to commit on kafka untill the buffer is flushed (if you did you would not be able to guarantee at least once):
replicas and file size should not be connected to each other, otherwise we will have to keep in mind a lot of affected moving parts when scaling the consumer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the time to accumulate a batch is less than the time to upload a batch then you need to add a replica. That's the only constraint. You get more efficiency at peak load so its best to run our replicas hot. The deadline will prevent the upload from sitting idle too long. The total scale factor will be determined by the number of machines we can throw at the problem.
Multi-processing/threading, I think, will be deadly to this project. So we will need a lot of single-threaded machines running.
I re-wrote this response several times. Its as disorganized as my thoughts are on this. Happy to hear critiques.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an okay outcome. If you double the number of replicas you halve the number of parts per file and double the number of files. That reduces cost efficiency but the throughput efficiency remains the same for each replica. Ignoring replica count, cost efficiency will ebb and flow with the variations in load we receive throughout the day.
We still come out ahead because the total number of files written per second is less than the current implementation which is 1 file per message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should scale our replicas agnostic to the implementation of the buffer's flush mechanics. I mentioned above about cost-efficiency being a hard target. So I don't think we should target it.
A deadline should be present to guarantee regular buffer commits and a max buffer size should exist to prevent us from using too many resources. I think those two commit semantics save us from having to think about the implications of adding replicas. The throughput of a single machine may drop but the risk of back log has decreased across the cluster.