Reduce memory usage for large file uploads #3062

willkoehler · 2023-01-22T00:06:14Z

The motivation for this change is to reduce memory used by a Rails application running on Heroku that handles large file uploads (~100Mb). In it's current state the application (running Puma 6.0.2 + Rails 5.2.8 + Ruby 2.7.6) starts using swap and triggering "memory quota exceeded errors" on the Heroku dyno after only a few file uploads.

This PR is a proof of concept that appears to significantly reduce the memory used by Puma during the data phase of the HTTP request. The change I made was to re-use a single read buffer in Puma::Client#read_body and Puma::Client#read_chunked_body vs allocating a new buffer on each call to read_nonblock().

I proved this change worked locally by throttling the network speed in Chrome dev tools and watching application memory usage as Puma received the file upload request. I then reproduced this experiment on a Heroku dyno, uploading a 100 Mb file to my application while watching memory on the dyno with top. I grabbed screenshots just before Puma handed the request off to Rack to isolate any memory usage to Puma alone.

Puma v6.0.2 - Initial memory

Puma v6.0.2 - Memory at 95% uploaded

This PR - Initial memory

This PR - Memory at 95% uploaded

To demonstrate this change had a positive impact on the application memory footprint overall (not just the Puma data phase), I watched Heroku's memory usage graph while uploading 5 large (100Mb) files. This PR reduced memory usage by ~100Mb (25% reduction) compared to Puma 6.0.2. See screenshot below.

My biggest concern is that this solution may not be thread-safe. To test this, I ran Puma locally with WEB_CONCURRENCY=0 bin/rails s and performed 3 simultaneous 100Mb file uploads. Each of the files was received successfully.

One final note: I did not make a change to the read_nonblock call in Puma::Client#try_to_finish Doing so creates a shared memory conflict between data and @buffer and @read_buffer (they are all just references to the same thing). This would need to be fixed by duping data before assigning to @buffer.

if @buffer
  @buffer << data
else
  @buffer = data.dup
end

I felt that this additional complexity may be confusing and easily broken in the future.

Your checklist for this pull request

I have reviewed the guidelines for contributing to this repository.
I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
My pull request is 100 lines added/removed or less so that it can be easily reviewed.
If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
I have updated the documentation accordingly.
All new and existing tests passed, including Rubocop.

nateberkopec · 2023-01-23T05:26:33Z

My biggest concern is that this solution may not be thread-safe.

Someone check me here but I believe Client objects are only accessed by the Reactor thread, so I don't think there can be a thread safety issue here because this state is not shared.

nateberkopec · 2023-01-23T05:28:58Z

lib/puma/client.rb

@@ -98,6 +98,8 @@ def initialize(io, env=nil)
 @body_remain = 0

 @in_last_chunk = false
+
+ @read_buffer = +""


What is this? I haven't seen this syntax before

It is the short syntax for a non frozen String (if strings was to be frozen by default). Same as String.new which would read better IMHO.

+"" also has better performance.

Benchmark.bm do |bm| bm.report { 100000000.times { String.new } } bm.report { 100000000.times { +'' } } end user system total real 4.359000 0.000000 4.359000 ( 10.255661) 2.954000 0.000000 2.954000 ( 6.286254)

I originally used String.new but this triggered a Rubocop Performance/UnfreezeString error.

I like String.new because it's intention is clear and we only allocate once. But I also see the benefit of using +"" as a rule app-wide if this is what the team prefers. Performance is important in Puma. There are some places where this gain may be meaningful.

Thanks, it's obviously not very easily googleable 😆

dentarg · 2023-01-23T11:19:37Z

Can you contribute a repro app where you see the same gain? I think it would be interesting for others to reproduce/play with this, to verify gains.

willkoehler · 2023-01-23T13:18:38Z

Someone check me here but I believe Client objects are only accessed by the Reactor thread, so I don't think there can be a thread safety issue here because this state is not shared.

This was my take as well. It appeared that @read_buffer would belong to (and only be used by) a single request at any given time.

willkoehler · 2023-01-23T13:24:59Z

Can you contribute a repro app where you see the same gain? I think it would be interesting for others to reproduce/play with this, to verify gains.

I'm happy to do that. My app uses Rails + Carrierwave + S3. I can bootstrap a simple, one page Rails app with the same upload behavior. But I wonder if it would be worth even simplifying further to a pure Rails app with local ActiveStorage?

willkoehler · 2023-01-24T03:43:15Z

To keep things as simple as possible, I'm building a small Sinatra app to demonstrate the same behavior. I have a proof of concept running. I'll circle back here when I have the final version pushed up with reproducible steps to demonstrate the memory consumption.

nateberkopec · 2023-01-25T03:41:01Z

If it's just built in Sinatra, you could drop into the benchmarks folder of the repository as well.

willkoehler · 2023-01-26T15:10:13Z

The demo app is up https://github.com/willkoehler/puma_demo

I see the same essential memory behavior in this demo app as the Rails app. However Ruby garbage collection appears to be a bit more efficient in the demo app. Memory doesn't grow in the same unbounded way over multiple uploads as I'm seeing in the Rails app. However the demo app still shows a clear benefit with the PR vs Puma 6.0.2 during the data phase of the HTTP request.

dentarg · 2023-01-26T22:06:24Z

Thanks for the demo app, I simplified it a tiny bit (just the config.ru, upload using curl) and added puma_worker_killer (https://github.com/zombocom/puma_worker_killer) just to log the memory usage (every 0.3 second): https://github.com/dentarg/gists/tree/master/gists/puma_pr-3062-upload-file-mem

Puma 6.0.2

[36093] Puma starting in cluster mode...
[36093] * Puma version: 6.0.2 (ruby 3.1.3-p185) ("Sunflower")
[36093] *  Min threads: 0
[36093] *  Max threads: 5
[36093] *  Environment: development
[36093] *   Master PID: 36093
[36093] *      Workers: 1
[36093] *     Restarts: (✔) hot (✔) phased
[36093] * Listening on https://0.0.0.0:9090
[36093] Use Ctrl-C to stop
[36093] - Worker 0 (PID: 36094) booted in 0.07s, phase: 0
[36093] PumaWorkerKiller: Consuming 70.703125 mb with master and 1 workers.
[36093] PumaWorkerKiller: Consuming 70.703125 mb with master and 1 workers.
...
[36093] PumaWorkerKiller: Consuming 110.40625 mb with master and 1 workers.
[36093] PumaWorkerKiller: Consuming 161.328125 mb with master and 1 workers.
[36093] PumaWorkerKiller: Consuming 276.515625 mb with master and 1 workers.
127.0.0.1 - - [26/Jan/2023:22:51:24 +0100] "POST /upload HTTP/1.1" 200 7 0.1335
[36093] PumaWorkerKiller: Consuming 250.859375 mb with master and 1 workers.
[36093] PumaWorkerKiller: Consuming 250.875 mb with master and 1 workers.
[36093] PumaWorkerKiller: Consuming 250.9375 mb with master and 1 workers.

This PR

[36240] Puma starting in cluster mode...
[36240] * Puma version: 6.0.2 (ruby 3.1.3-p185) ("Sunflower")
[36240] *  Min threads: 0
[36240] *  Max threads: 5
[36240] *  Environment: development
[36240] *   Master PID: 36240
[36240] *      Workers: 1
[36240] *     Restarts: (✔) hot (✔) phased
[36240] * Listening on https://0.0.0.0:9090
[36240] Use Ctrl-C to stop
[36240] - Worker 0 (PID: 36242) booted in 0.07s, phase: 0
[36240] PumaWorkerKiller: Consuming 71.125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 71.15625 mb with master and 1 workers.
...
[36240] PumaWorkerKiller: Consuming 71.578125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 140.25 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 213.28125 mb with master and 1 workers.
127.0.0.1 - - [26/Jan/2023:22:51:51 +0100] "POST /upload HTTP/1.1" 200 7 0.1279
[36240] PumaWorkerKiller: Consuming 213.28125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 213.3125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 213.3125 mb with master and 1 workers.

dentarg · 2023-01-26T22:07:32Z

Forgot to mention, the file I uploaded above was 322 MB (337322761 bytes)

willkoehler · 2023-01-26T22:58:51Z

Nice 🙌 I like the trick of usingpuma_worker_killer to monitor memory. Using curl to make the request will allow us to use the --limit-rate param to throttle the bandwidth, which will more closely simulate behavior in production so we can see how memory grows and is garbaged collected during the data phase.

There's a lot going on here to unpack memory-wise and I want to make sure we're separating Puma's memory use from the rest of the app. In particular, once Puma hands the app off to Rack and Sinatra, I see additional memory allocation that we won't be able to control.

EDIT: Just skip to the next comment #3062 (comment). It's a lot clearer than what I tried to say below

...but if you want to keep reading, here was my original thought:

If we throttle the request bandwidth with --limit-rate I think what we will see is that, with Puma 6.0.2, memory grows during the data phase of the request (inside the Puma code) and it will look like this

[36093] PumaWorkerKiller: Consuming 70.703125 mb with master and 1 workers.
[36093] PumaWorkerKiller: Consuming 70.703125 mb with master and 1 workers.

**** Memory keeps growing - mostly but maybe sometimes dropping when garbage collection runs

[36093] PumaWorkerKiller: Consuming 110.40625 mb with master and 1 workers.

With my PR I expect it will look like this

[36240] PumaWorkerKiller: Consuming 71.125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 71.15625 mb with master and 1 workers.

**** Memory stays about the same

[36240] PumaWorkerKiller: Consuming 71.578125 mb with master and 1 workers.

I believe the last few memory logs in each of the samples above (where the memory usage jumps) are after Puma has handed the request off to Rack

[36240] PumaWorkerKiller: Consuming 71.578125 mb with master and 1 workers.

**** Request handed from Puma to Rack/Sinatra. Memory use jumps here but is out of our control.

[36240] PumaWorkerKiller: Consuming 140.25 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 213.28125 mb with master and 1 workers.
127.0.0.1 - - [26/Jan/2023:22:51:51 +0100] "POST /upload HTTP/1.1" 200 7 0.1279
[36240] PumaWorkerKiller: Consuming 213.28125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 213.3125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 213.3125 mb with master and 1 workers.

I'm going to update my app locally with your changes, reproduce your test and see if I can add some clarity about where memory is being allocated and how this PR may help.

willkoehler · 2023-01-27T01:42:11Z

I updated my app and tested with puma_worker_killer and curl. Thanks again for the suggestion 🙌 This is much nicer than the way I was testing.

Here are the results uploading a 115MB video.

Run the app: bundle exec puma -p 9090 --config puma.rb -w 1
Make a rate-limited request: curl --form "video=@IMG_0009.MOV" --limit-rate 10M https://localhost:9090/upload

Puma 6.0.2

[11820] Puma starting in cluster mode...
[11820] * Puma version: 6.0.2 (ruby 3.2.0-p0) ("Sunflower")
[11820] *  Min threads: 0
[11820] *  Max threads: 5
[11820] *  Environment: development
[11820] *   Master PID: 11820
[11820] *      Workers: 1
[11820] *     Restarts: (✔) hot (✔) phased
[11820] * Listening on https://0.0.0.0:3000
[11820] Use Ctrl-C to stop
[11820] - Worker 0 (PID: 11949) booted in 0.06s, phase: 0
[11820] PumaWorkerKiller: Consuming 70.984375 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 70.984375 mb with master and 1 workers.

...curl request made - memory increases as file is received

[11820] PumaWorkerKiller: Consuming 72.796875 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 75.921875 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 78.953125 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 82.15625 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 85.265625 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 88.046875 mb with master and 1 workers.

...(clipped out lines) memory keeps increasing while request is received

[11820] PumaWorkerKiller: Consuming 121.53125 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 122.75 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 125.40625 mb with master and 1 workers.

...request handed off from Puma to Rack/Sinatra

[11820] PumaWorkerKiller: Consuming 220.6875 mb with master and 1 workers.
127.0.0.1 - - [26/Jan/2023:20:09:56 -0500] "POST /upload HTTP/1.1" 200 162 0.0553
[11820] PumaWorkerKiller: Consuming 228.96875 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 228.96875 mb with master and 1 workers.

This PR

[20815] Puma starting in cluster mode...
[20815] * Puma version: 6.0.2 (ruby 3.2.0-p0) ("Sunflower")
[20815] *  Min threads: 0
[20815] *  Max threads: 5
[20815] *  Environment: development
[20815] *   Master PID: 20815
[20815] *      Workers: 1
[20815] *     Restarts: (✔) hot (✔) phased
[20815] * Listening on https://0.0.0.0:3000
[20815] Use Ctrl-C to stop
[20815] - Worker 0 (PID: 20944) booted in 0.1s, phase: 0
[20815] PumaWorkerKiller: Consuming 73.25 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.25 mb with master and 1 workers.

...curl request made - memory stays level as file is received

[20815] PumaWorkerKiller: Consuming 73.28125 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.296875 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.34375 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.359375 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.359375 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.359375 mb with master and 1 workers.

...(clipped out lines) memory continues to stay level

[20815] PumaWorkerKiller: Consuming 73.703125 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.703125 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.703125 mb with master and 1 workers.

...request handed off from Puma to Rack/Sinatra

[20815] PumaWorkerKiller: Consuming 181.96875 mb with master and 1 workers.
127.0.0.1 - - [26/Jan/2023:20:27:16 -0500] "POST /upload HTTP/1.1" 200 162 0.0585
[20815] PumaWorkerKiller: Consuming 183.78125 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 183.78125 mb with master and 1 workers.

With larger files, garbage collection sometimes kicks in and improve the memory story. But I still think this change has benefits:

I've found that garbage collection isn't perfect. Memory still tends to grow in real-world apps.
If there's a straightforward way to avoid allocating the memory in the first place (and making the garbage collector work), then we reduce the overall load on the system.
There's more improvement that can be made after Puma hands over the request to Rack. I have ideas there as well. But that's a topic for a separate PR

nateberkopec · 2023-01-27T06:26:59Z

Any chance we could commit this demo app to /benchmarks? Say in a new folder?

By re-using a single read buffer (vs allocating a new buffer on each call to read_nonblock), we significantly reduce the memory used when receiving large request bodies, such as those generated by Multipart/form-data requests with file uploads.

willkoehler · 2023-01-27T13:51:07Z

Any chance we could commit this demo app to /benchmarks? Say in a new folder?

For sure 👍 I simplified the demo app and added it to /benchmarks/local/sinatra Feel free to suggest a different name or location.

nateberkopec · 2023-01-29T04:58:38Z

Excellent. Thank you very much for your contribution.

willkoehler · 2023-01-29T18:19:09Z

Thanks everyone for the review and vetting 🙌 I'm running 5973781 in production. I'll report back if I see anything strange.

sanzstez · 2023-05-10T12:20:01Z

@willkoehler Can you please check this strange behaviour #3145 produced by your PR?

willkoehler · 2023-05-10T12:32:40Z

Hi @sanzstez I'd be happy to help look into that. Let's continue the conversation in #3145

boazsegev · 2023-05-18T09:02:34Z

@willkoehler , thank you for the demo-app I used it to test iodine 👍🏻

I think this is a wonderful approach, except that now that the @read_buffer isn't re-allocated for every read, I would consider increasing the buffer size. The system's buffer wouldn't be less than 64kb and using smaller chunks increases system calls (which hurts performance)... I would push for a 32Kb or 64Kb buffer instead (IMHO).

Good luck!

willkoehler · 2023-05-18T12:11:33Z

Hi @boazsegev, I'm glad the demo-app was helpful 🙌

This was my first Puma contribution, so the decision about the buffer size is above my "pay grade". However, it sounds plausible that a larger buffer size would be an improvement and it's probably worth opening a separate issue to start the conversation.

Feel free to tag me in the issue. I would be happy to explore the benefits with you and make a case for increasing the buffer size if it ends up improving performance or memory efficiency.

boazsegev · 2023-05-18T13:15:13Z

Hi @willkoehler and @nateberkopec ,

As you know, I'm busy maintaining my own server (iodine)... however, please allow me to make a quick case for increasing Puma's buffer size, by comparing the 4Kb buffer to iodine's 8kb buffer (for HTTP read).

Comparison performed on my machine, while using curl with throttling set to 100M:

Puma uses about 21-33% CPU, curl uses ~7% CPU (takes 4.22 seconds to send 0.5Gb)
Iodine uses about 16-43% CPU, curl uses ~6% CPU (takes 4.18 seconds to send 0.5Gb)

I assume non-CPU time is lost to system calls and waiting on IO, as the request is throttled.

When the IO is slow enough, both show ~3-4% CPU usage. But when IO is faster (more saturated), the increased buffer means that the program spends more time running its own tasks (rather than system tasks).

Note that there are other differences (i.e., the iodine server is written in C), but at a quick glance it might indicate it's worthwhile to experiment with this.

I understand this might be better in a new issue, but I do not think I'll open an issue as that would also imply the responsibility to contribute (which might take me a while to get to).

Cheers!

MSP-Greg · 2023-05-18T13:43:00Z

Wondering whether this could be a config option? Maybe req_body_buffer_size or something similar?

Allows for a typical config 'out-of-the-box`, but users dealing with large files can increase it...

nateberkopec reviewed Jan 23, 2023

View reviewed changes

nateberkopec added the perf label Jan 23, 2023

nateberkopec added the waiting-for-changes Waiting on changes from the requestor label Jan 25, 2023

willkoehler added a commit to willkoehler/puma that referenced this pull request Jan 27, 2023

Add benchmark to demonstrate memory improvements in puma#3062

aff129c

willkoehler added a commit to willkoehler/puma that referenced this pull request Jan 27, 2023

Add benchmark to demonstrate memory improvements in puma#3062

4806bd0

willkoehler force-pushed the reduce_read_body_memory branch from aff129c to 4806bd0 Compare January 27, 2023 13:45

Add benchmark to demonstrate memory improvements in puma#3062

084eaff

willkoehler force-pushed the reduce_read_body_memory branch from 4806bd0 to 084eaff Compare January 27, 2023 14:02

nateberkopec merged commit 5973781 into puma:master Jan 29, 2023

willkoehler deleted the reduce_read_body_memory branch January 29, 2023 14:20

sanzstez mentioned this pull request May 10, 2023

Puma >= 6.1.0 + Rails 7 produce empty rack.input in custom middleware #3145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage for large file uploads #3062

Reduce memory usage for large file uploads #3062

willkoehler commented Jan 22, 2023 •

edited

Loading

nateberkopec commented Jan 23, 2023

nateberkopec Jan 23, 2023

dentarg Jan 23, 2023

johnnyshields Jan 23, 2023 •

edited

Loading

willkoehler Jan 23, 2023 •

edited

Loading

nateberkopec Jan 25, 2023

dentarg commented Jan 23, 2023

willkoehler commented Jan 23, 2023 •

edited

Loading

willkoehler commented Jan 23, 2023 •

edited

Loading

willkoehler commented Jan 24, 2023

nateberkopec commented Jan 25, 2023

willkoehler commented Jan 26, 2023

dentarg commented Jan 26, 2023

dentarg commented Jan 26, 2023

willkoehler commented Jan 26, 2023 •

edited

Loading

willkoehler commented Jan 27, 2023 •

edited

Loading

nateberkopec commented Jan 27, 2023

willkoehler commented Jan 27, 2023

nateberkopec commented Jan 29, 2023

willkoehler commented Jan 29, 2023

sanzstez commented May 10, 2023

willkoehler commented May 10, 2023

boazsegev commented May 18, 2023

willkoehler commented May 18, 2023

boazsegev commented May 18, 2023 •

edited

Loading

MSP-Greg commented May 18, 2023

Reduce memory usage for large file uploads #3062

Reduce memory usage for large file uploads #3062

Conversation

willkoehler commented Jan 22, 2023 • edited Loading

Puma v6.0.2 - Initial memory

Puma v6.0.2 - Memory at 95% uploaded

This PR - Initial memory

This PR - Memory at 95% uploaded

Your checklist for this pull request

nateberkopec commented Jan 23, 2023

nateberkopec Jan 23, 2023

Choose a reason for hiding this comment

dentarg Jan 23, 2023

Choose a reason for hiding this comment

johnnyshields Jan 23, 2023 • edited Loading

Choose a reason for hiding this comment

willkoehler Jan 23, 2023 • edited Loading

Choose a reason for hiding this comment

nateberkopec Jan 25, 2023

Choose a reason for hiding this comment

dentarg commented Jan 23, 2023

willkoehler commented Jan 23, 2023 • edited Loading

willkoehler commented Jan 23, 2023 • edited Loading

willkoehler commented Jan 24, 2023

nateberkopec commented Jan 25, 2023

willkoehler commented Jan 26, 2023

dentarg commented Jan 26, 2023

dentarg commented Jan 26, 2023

willkoehler commented Jan 26, 2023 • edited Loading

willkoehler commented Jan 27, 2023 • edited Loading

Puma 6.0.2

This PR

nateberkopec commented Jan 27, 2023

willkoehler commented Jan 27, 2023

nateberkopec commented Jan 29, 2023

willkoehler commented Jan 29, 2023

sanzstez commented May 10, 2023

willkoehler commented May 10, 2023

boazsegev commented May 18, 2023

willkoehler commented May 18, 2023

boazsegev commented May 18, 2023 • edited Loading

MSP-Greg commented May 18, 2023

willkoehler commented Jan 22, 2023 •

edited

Loading

johnnyshields Jan 23, 2023 •

edited

Loading

willkoehler Jan 23, 2023 •

edited

Loading

willkoehler commented Jan 23, 2023 •

edited

Loading

willkoehler commented Jan 23, 2023 •

edited

Loading

willkoehler commented Jan 26, 2023 •

edited

Loading

willkoehler commented Jan 27, 2023 •

edited

Loading

boazsegev commented May 18, 2023 •

edited

Loading