-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage for large file uploads #3062
Reduce memory usage for large file uploads #3062
Conversation
Someone check me here but I believe Client objects are only accessed by the Reactor thread, so I don't think there can be a thread safety issue here because this state is not shared. |
@@ -98,6 +98,8 @@ def initialize(io, env=nil) | |||
@body_remain = 0 | |||
|
|||
@in_last_chunk = false | |||
|
|||
@read_buffer = +"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this? I haven't seen this syntax before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the short syntax for a non frozen String (if strings was to be frozen by default). Same as String.new
which would read better IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+""
also has better performance.
Benchmark.bm do |bm|
bm.report { 100000000.times { String.new } }
bm.report { 100000000.times { +'' } }
end
user system total real
4.359000 0.000000 4.359000 ( 10.255661)
2.954000 0.000000 2.954000 ( 6.286254)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally used String.new
but this triggered a Rubocop Performance/UnfreezeString
error.
I like String.new
because it's intention is clear and we only allocate once. But I also see the benefit of using +""
as a rule app-wide if this is what the team prefers. Performance is important in Puma. There are some places where this gain may be meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's obviously not very easily googleable 😆
Can you contribute a repro app where you see the same gain? I think it would be interesting for others to reproduce/play with this, to verify gains. |
This was my take as well. It appeared that |
I'm happy to do that. My app uses Rails + Carrierwave + S3. I can bootstrap a simple, one page Rails app with the same upload behavior. But I wonder if it would be worth even simplifying further to a pure Rails app with local |
To keep things as simple as possible, I'm building a small Sinatra app to demonstrate the same behavior. I have a proof of concept running. I'll circle back here when I have the final version pushed up with reproducible steps to demonstrate the memory consumption. |
If it's just built in Sinatra, you could drop into the |
The demo app is up https://github.com/willkoehler/puma_demo I see the same essential memory behavior in this demo app as the Rails app. However Ruby garbage collection appears to be a bit more efficient in the demo app. Memory doesn't grow in the same unbounded way over multiple uploads as I'm seeing in the Rails app. However the demo app still shows a clear benefit with the PR vs Puma 6.0.2 during the data phase of the HTTP request. |
Thanks for the demo app, I simplified it a tiny bit (just the Puma 6.0.2
This PR
|
Forgot to mention, the file I uploaded above was 322 MB (337322761 bytes) |
Nice 🙌 I like the trick of using There's a lot going on here to unpack memory-wise and I want to make sure we're separating Puma's memory use from the rest of the app. In particular, once Puma hands the app off to Rack and Sinatra, I see additional memory allocation that we won't be able to control. EDIT: Just skip to the next comment #3062 (comment). It's a lot clearer than what I tried to say below ...but if you want to keep reading, here was my original thought: If we throttle the request bandwidth with
With my PR I expect it will look like this
I believe the last few memory logs in each of the samples above (where the memory usage jumps) are after Puma has handed the request off to Rack
I'm going to update my app locally with your changes, reproduce your test and see if I can add some clarity about where memory is being allocated and how this PR may help. |
I updated my app and tested with Here are the results uploading a 115MB video. Run the app: Puma 6.0.2
This PR
With larger files, garbage collection sometimes kicks in and improve the memory story. But I still think this change has benefits:
|
Any chance we could commit this demo app to |
By re-using a single read buffer (vs allocating a new buffer on each call to read_nonblock), we significantly reduce the memory used when receiving large request bodies, such as those generated by Multipart/form-data requests with file uploads.
aff129c
to
4806bd0
Compare
For sure 👍 I simplified the demo app and added it to |
4806bd0
to
084eaff
Compare
Excellent. Thank you very much for your contribution. |
Thanks everyone for the review and vetting 🙌 I'm running 5973781 in production. I'll report back if I see anything strange. |
@willkoehler Can you please check this strange behaviour #3145 produced by your PR? |
@willkoehler , thank you for the demo-app I used it to test I think this is a wonderful approach, except that now that the Good luck! |
Hi @boazsegev, I'm glad the demo-app was helpful 🙌 This was my first Puma contribution, so the decision about the buffer size is above my "pay grade". However, it sounds plausible that a larger buffer size would be an improvement and it's probably worth opening a separate issue to start the conversation. Feel free to tag me in the issue. I would be happy to explore the benefits with you and make a case for increasing the buffer size if it ends up improving performance or memory efficiency. |
Hi @willkoehler and @nateberkopec , As you know, I'm busy maintaining my own server (iodine)... however, please allow me to make a quick case for increasing Puma's buffer size, by comparing the 4Kb buffer to iodine's 8kb buffer (for HTTP Comparison performed on my machine, while using
I assume non-CPU time is lost to system calls and waiting on IO, as the request is throttled. When the IO is slow enough, both show ~3-4% CPU usage. But when IO is faster (more saturated), the increased buffer means that the program spends more time running its own tasks (rather than system tasks). Note that there are other differences (i.e., the iodine server is written in C), but at a quick glance it might indicate it's worthwhile to experiment with this. I understand this might be better in a new issue, but I do not think I'll open an issue as that would also imply the responsibility to contribute (which might take me a while to get to). Cheers! |
Wondering whether this could be a config option? Maybe Allows for a typical config 'out-of-the-box`, but users dealing with large files can increase it... |
The motivation for this change is to reduce memory used by a Rails application running on Heroku that handles large file uploads (~100Mb). In it's current state the application (running Puma 6.0.2 + Rails 5.2.8 + Ruby 2.7.6) starts using swap and triggering "memory quota exceeded errors" on the Heroku dyno after only a few file uploads.
This PR is a proof of concept that appears to significantly reduce the memory used by Puma during the data phase of the HTTP request. The change I made was to re-use a single read buffer in
Puma::Client#read_body
andPuma::Client#read_chunked_body
vs allocating a new buffer on each call toread_nonblock()
.I proved this change worked locally by throttling the network speed in Chrome dev tools and watching application memory usage as Puma received the file upload request. I then reproduced this experiment on a Heroku dyno, uploading a 100 Mb file to my application while watching memory on the dyno with
top
. I grabbed screenshots just before Puma handed the request off to Rack to isolate any memory usage to Puma alone.Puma v6.0.2 - Initial memory
Puma v6.0.2 - Memory at 95% uploaded
This PR - Initial memory
This PR - Memory at 95% uploaded
To demonstrate this change had a positive impact on the application memory footprint overall (not just the Puma data phase), I watched Heroku's memory usage graph while uploading 5 large (100Mb) files. This PR reduced memory usage by ~100Mb (25% reduction) compared to Puma 6.0.2. See screenshot below.
My biggest concern is that this solution may not be thread-safe. To test this, I ran Puma locally with
WEB_CONCURRENCY=0 bin/rails s
and performed 3 simultaneous 100Mb file uploads. Each of the files was received successfully.One final note: I did not make a change to the
read_nonblock
call inPuma::Client#try_to_finish
Doing so creates a shared memory conflict betweendata
and@buffer
and@read_buffer
(they are all just references to the same thing). This would need to be fixed by dupingdata
before assigning to@buffer
.I felt that this additional complexity may be confusing and easily broken in the future.
Your checklist for this pull request
[ci skip]
to the title of the PR.#issue
" to the PR description or my commit messages.