Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document that memory exhaustion is possible when using parallelism #1189

Closed
domenukk opened this issue Feb 7, 2019 · 4 comments
Closed

document that memory exhaustion is possible when using parallelism #1189

domenukk opened this issue Feb 7, 2019 · 4 comments
Labels
doc An issue with or an improvement to documentation.

Comments

@domenukk
Copy link

domenukk commented Feb 7, 2019

What version of ripgrep are you using?

ripgrep 0.10.0 (rev 8a7db1a)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

Precompiled msvc binary for Windows-x64

What operating system are you using ripgrep on?

Windows 7, some current pach level

Describe your question, feature request, or bug.

If I redirect the ripgrep output to a file on windows, the memory usage of rg.exe increases slowly but steady, probably with each found element.
After a few GB, rg.exe then crashes with a segmentation fault.
Inside procmon, I see the results seem to only be flushed to the file after the crash.
A workaround is to force single threading with -j1. This seems to be directly related to this old issue

Or are the reads and hits simply too fast to be written to disk if multi threaded?

If this is a bug, what are the steps to reproduce the behavior?

Trying to extract all mail addresses from a large current password leak with 130k files, 8k folders and 1.62TB in total (with different filessizes) crashes rg.

Inside git bash, run:

$ ./rg.exe -uu --no-filename -o '[a-zA-Z0-9.%&’*+/=?^_`{}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*' ./Passwords/ > all_mails.txt

For obvious reasons, I cannot include the corpus here.

If this is a bug, what is the actual behavior?

$ ./rg.exe --debug -uu --no-filename -o '[a-zA-Z0-9.%&’*+/=?^_`{}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*' ./Passwords/ > all_mails.txt
DEBUG|grep_regex::literal|grep-regex\src\literal.rs:110: required literal found: "@"
DEBUG|globset|globset\src\lib.rs:429: built glob set; 0 literals, 0 basenames, 8 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|globset\src\lib.rs:429: built glob set; 0 literals, 0 basenames, 8 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|globset\src\lib.rs:429: built glob set; 0 literals, 0 basenames, 8 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|globset\src\lib.rs:429: built glob set; 0 literals, 0 basenames, 8 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|globset\src\lib.rs:429: built glob set; 0 literals, 0 basenames, 8 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
memory allocation of 5795684597665262959 bytes failedSegmentation fault

If this is a bug, what is the expected behavior?

The user should be able to rg multi-threaded on any kind of large dataset.
Maybe a synchronization point when the buffer gets too big or the choice to disable caching (I don't need the output in order, for example) could be options.

@domenukk domenukk changed the title Buffers never seem to be flushed on multithreading Buffers don't seem to be flushed (quickly enough?) on multithreading Feb 7, 2019
@BurntSushi
Copy link
Owner

This happens when the search results for a single file exceed the amount of memory available. This is fundamentally a result of combining parallelism and the requirement that the output of each file is not interleaved. It sounds like you said you'd be OK with the output from different files being interleaved, but I'm not keen on adding this option to ripgrep. Instead, you have a few work-arounds available to you, assuming we've correctly diagnosed the problem:

  • Use parallelism via xargs or some other tool, just like you would with standard grep. e.g., find ./ -print0 | xargs -0 rg foo.
  • Use ripgrep's -m/--max-count flag to limit the number of matching lines it prints per file. This could be quite undesirable since it makes it hard to know whether a match was missed or not.

@BurntSushi BurntSushi added the question An issue that is lacking clarity on one or more points. label Feb 7, 2019
@domenukk
Copy link
Author

domenukk commented Feb 7, 2019

Thank you for your quick response! Awesome!
Indeed there is a file with 80 gb which might match your description of a single large file pretty well.
I guess I'll try my luck with the workarounds then and see what happens.

That being said, it could still be handy to have a "single file, multiple workers, as fast as possible, interleave if you must" grep mode for these rare occasions - as long as a single result always arrives in one piece in the output. But the occasion might be rare enough.

And a different idea: print which files it choked on when it crashed (just to have fewer github support issues)

Keep up the good work ;)

(oh off-topic, that it tries to allocate 5 exabyte seems like a fun overflow somewhere below the rust layer...)

@BurntSushi
Copy link
Owner

Rust's standard library doesn't allow one to easily recover from allocation failure, so it's impractical to print the file on which this choked. Of course, I agree it would be nice to improve failure modes, but I think we're stuck here.

Your proposed option is undoubtedly handy, but it's not a good fit since it's an extraordinarily niche feature with some simple work-arounds.

I'm mark this ticket as a doc bug and find a place to add a note about memory exhaustion being possible when parallelism is enabled, and document the work-arounds.

@BurntSushi BurntSushi added doc An issue with or an improvement to documentation. and removed question An issue that is lacking clarity on one or more points. labels Feb 7, 2019
@BurntSushi BurntSushi changed the title Buffers don't seem to be flushed (quickly enough?) on multithreading document that memory exhaustion is possible when using parallelism Feb 7, 2019
@BurntSushi
Copy link
Owner

Note that memory exhaustion is not unique to ripgrep, or even parallelism. Both grep and ripgrep are subject to memory exhaustion errors when searching files that have a single line that exceeds available memory. For example, something like grep ZQZQZQZQZQ /dev/sda -c -a is one way I've caused this to happen fairly reliably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc An issue with or an improvement to documentation.
Projects
None yet
Development

No branches or pull requests

2 participants