Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high peak memory usage with parallel breadth first directory traversal #1550

Closed
Shnatsel opened this issue Apr 12, 2020 · 6 comments · Fixed by #1554
Closed

high peak memory usage with parallel breadth first directory traversal #1550

Shnatsel opened this issue Apr 12, 2020 · 6 comments · Fixed by #1554
Labels
bug A bug.

Comments

@Shnatsel
Copy link

Shnatsel commented Apr 12, 2020

What version of ripgrep are you using?

ripgrep 12.0.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

cargo install ripgrep on rustc 1.42.0 (b8cedc004 2020-03-09)

What operating system are you using ripgrep on?

Linux, derived from Ubuntu 18.04

Describe your question, feature request, or bug.

rg seems to leak memory when searching for an expression in latest versions of all crates from crates.io. The memory usage gradually increases and peaks at around 1Gb.

If this is a bug, what are the steps to reproduce the behavior?

# get all crates from crates.io
git clone https://github.com/the-lean-crate/criner && cd criner && cargo run --release -- mine
# get latest version of every crate
for dir in ./criner.db/assets/*/*/*/; do echo -n "$dir"; ls "$dir" | grep '\.crate' | sort -Vr | head -n 1; done > ./latest-versions
mkdir crates
cd crates
# decompress the latest version of every crate
cat ../latest-versions | parallel ../decompress-one-crate.sh

where decompress-one-crate.sh is

#!/bin/sh
cat "$1" | tar --gunzip -xf -

Sample rg invocation exhibiting the behavior is:

rg --iglob '*.rs' -iF '#[allow('

If this is a bug, what is the actual behavior?

Memory usage increases as the search progresses, reaching as high as 1Gb. It increases gradually and accumulates.

If this is a bug, what is the expected behavior?

Memory usage should not accumulate

@Shnatsel
Copy link
Author

Using a regex instead of a simple pattern doesn't seem to make a difference - this one behaves identically: rg --iglob '*.rs' -U "enum[^\\n]+\\{\\n[^\\n]+\\n\\}"

@Shnatsel
Copy link
Author

Downloading all of crates.io takes 50Gb of disk space, and latest version of every crate uncompressed is 18Gb. If you don't want to download that, point me to a memory profiler and I can run it myself now that I have the corpus locally.

@BurntSushi
Copy link
Owner

BurntSushi commented Apr 12, 2020

I've actually been meaning to download all crates, since it makes for a nice test corpus. So I will investigate this. Thank you for making it easy.

Without investigating, I'm pretty sure I know the culprit. If I'm right, then it's less of a leak and more of a conscious trade off. Namely, ripgrep must buffer enough room to store at least one line contiguously in memory. If you run across a very long line, then the buffer expands to that capacity. But it will never shrink itself. ripgrep uses one of these buffers per thread. So if each buffer has to expand to a large size, then they never contract. While contracting won't reduce per-thread peak memory usage, it could reduce the overall process peak memory usage in most cases.

With that said, there is no particular reason why the behavior must be this way. It's just that I've never had a good test case for it. Some kind of contraction is probably wise. Although whether contraction actually results in memory being freed back to the OS is perhaps a separate issue that I suspect is mostly out of the control of ripgrep.

But yeah, I'll do some investigation to confirm my hypothesis. Thanks for the report!

@BurntSushi BurntSushi added bug A bug. question An issue that is lacking clarity on one or more points. labels Apr 12, 2020
@BurntSushi BurntSushi removed the question An issue that is lacking clarity on one or more points. label Apr 18, 2020
@BurntSushi BurntSushi changed the title Memory leak when searching all of crates.io high peak memory usage with parallel breadth first directory traversal Apr 18, 2020
@BurntSushi
Copy link
Owner

BurntSushi commented Apr 18, 2020

OK, I did some experimenting and my guess was wrong.

It turns out that this is a result of how ripgrep implements parallel directory traversal. Namely, since it uses a queue to communicate new work among threads, this results in a breadth first traversal. In this case, the initial directory is extremely wide, because it contains the latest version of every crate. Upon entering each directory, a matcher for processing gitignore rules is created. Since this is breadth first, a gitignore matcher is created for every directory before ripgrep ever starts to process a single file. This is primarily what contributes to a large peak memory usage. You can test this hypothesis by searching without parallelism (with -j1) or by disabling gitignore exclusions (with -u/--no-ignore). Both result in very small peak memory usage.

Fixing this is unfortunately not quite so easy. @bmalehorn was actually barking up the right tree a few years ago, and even submitted a PR to switch the traversal to depth first in #450. But the upsides didn't seem so apparent to me then, and it appeared to change the output order a bit too much for my taste. But with this example in front of me, it seems clearer that switching to a depth first traversal is better than what we have today.

Since then though, Crossbeam seems to no longer provide a thread safe stack. One can be implemented without much code, but I'm hesitant to add unsafe, especially when it's in an area of data structures that are outside my area of expertise.

So I think this bug is blocked on either rethinking how traversal works (I've always wanted to use rayon) or switching to using a depth first traversal with minimal risk from unsafe.

@Shnatsel
Copy link
Author

I've consulted my friendly neighbourhood good of concurrency, Acrimon, and their recommendation for a concurrent stack was Mutex<Vec> because a Trieber stack is not meaningfully faster than that. Faster concurrent stack implementations are possible, but "no one is really insane enough to do this stuff".

Listing all files up front before creating the gitignore handlers would solve this, but also create a different kind of scalability limit with all the file paths being loaded in memory at once, and I'm not sure how that interacts with filesystem cache behavior.

@BurntSushi
Copy link
Owner

A Mutex<Vec<Work>> is indeed worth trying and shouldn't be too hard to do so. It is plausible that we won't be able to observe a performance difference.

Listing all files up front is another way of looking at, but is a major architectural change and it is difficult for me to reason about its performance characteristics.

BurntSushi added a commit that referenced this issue Apr 18, 2020
This replaces the use of channels in the parallel directory traversal
with a simple stack. The primary motivation for this change is to reduce
peak memory usage. In particular, when using a channel (which is a
queue), we wind up visiting files in a breadth first fashion. Using a
stack switches us to a depth first traversal. While there are no real
intrinsic differences, depth first traversal generally tends to use less
memory because directory trees are more commonly wide than they are
deep.

In particular, the queue/stack size itself is not the only concern. In
one recent case documented in #1550, a user wanted to search all Rust
crates. The directory structure was shallow but extremely wide, with a
single directory containing all crates. This in turn results is in
descending into each of those directories and building a gitignore
matcher for each (since most crates have `.gitignore` files) before ever
searching a single file. This means that ripgrep has all such matchers
in memory simultaneously, which winds up using quite a bit of memory.

In a depth first traversal, peak memory usage is much lower because
gitignore matches are built and discarded more quickly. In the case of
searching all crates, the peak memory usage decrease is dramatic. On my
system, it shrinks by an order magnitude, from almost 1GB to 50MB. The
decline in peak memory usage is consistent across other use cases as
well, but is typically more modest. For example, searching the Linux
repo has a 50% decrease in peak memory usage and searching the Chromium
repo has a 25% decrease in peak memory usage.

Search times generally remain unchanged, although some ad hoc benchmarks
that I typically run have gotten a bit slower. As far as I can tell,
this appears to be result of scheduling changes. Namely, the depth first
traversal seems to result in searching some very large files towards the
end of the search, which reduces the effectiveness of parallelism and
makes the overall search take longer. This seems to suggest that a stack
isn't optimal. It would instead perhaps be better to prioritize
searching larger files first, but it's not quite clear how to do this
without introducing more overhead (getting the file size for each file
requires a stat call).

Fixes #1550
BurntSushi added a commit that referenced this issue Apr 18, 2020
This replaces the use of channels in the parallel directory traversal
with a simple stack. The primary motivation for this change is to reduce
peak memory usage. In particular, when using a channel (which is a
queue), we wind up visiting files in a breadth first fashion. Using a
stack switches us to a depth first traversal. While there are no real
intrinsic differences, depth first traversal generally tends to use less
memory because directory trees are more commonly wide than they are
deep.

In particular, the queue/stack size itself is not the only concern. In
one recent case documented in #1550, a user wanted to search all Rust
crates. The directory structure was shallow but extremely wide, with a
single directory containing all crates. This in turn results is in
descending into each of those directories and building a gitignore
matcher for each (since most crates have `.gitignore` files) before ever
searching a single file. This means that ripgrep has all such matchers
in memory simultaneously, which winds up using quite a bit of memory.

In a depth first traversal, peak memory usage is much lower because
gitignore matches are built and discarded more quickly. In the case of
searching all crates, the peak memory usage decrease is dramatic. On my
system, it shrinks by an order magnitude, from almost 1GB to 50MB. The
decline in peak memory usage is consistent across other use cases as
well, but is typically more modest. For example, searching the Linux
repo has a 50% decrease in peak memory usage and searching the Chromium
repo has a 25% decrease in peak memory usage.

Search times generally remain unchanged, although some ad hoc benchmarks
that I typically run have gotten a bit slower. As far as I can tell,
this appears to be result of scheduling changes. Namely, the depth first
traversal seems to result in searching some very large files towards the
end of the search, which reduces the effectiveness of parallelism and
makes the overall search take longer. This seems to suggest that a stack
isn't optimal. It would instead perhaps be better to prioritize
searching larger files first, but it's not quite clear how to do this
without introducing more overhead (getting the file size for each file
requires a stat call).

Fixes #1550
BurntSushi added a commit that referenced this issue May 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants