Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should --vimgrep run in single threaded mode by default? #2505

Closed
voidus opened this issue May 5, 2023 · 11 comments
Closed

should --vimgrep run in single threaded mode by default? #2505

voidus opened this issue May 5, 2023 · 11 comments
Labels
question An issue that is lacking clarity on one or more points. rollup A PR that has been merged with many others in a rollup.

Comments

@voidus
Copy link

voidus commented May 5, 2023

I think I'm hitting #999 again

What version of ripgrep are you using?

ripgrep 13.0.0

How did you install ripgrep?

Via nix' home-manager @nixpkgs rev 402cc3633cc60dfc50378197305c984518b30773

What operating system are you using ripgrep on?

Arch linux with a lot of stuff (including rg) coming from nix

Describe your bug.

Project info:

98M	.
incl gitignore
5717
90682
excl gitignore
410
27827

 > nix run nixpkgs#time -- -v rg a | wc -l        
	Command being timed: "rg a"
	User time (seconds): 0.00
	System time (seconds): 0.00
	Percent of CPU this job got: 160%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.01
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 10880
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 2214
	Voluntary context switches: 211
	Involuntary context switches: 29
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
14008

> nix run nixpkgs#time -- -v rg --vimgrep a | wc -l
Command terminated by signal 9
	Command being timed: "rg --vimgrep a"
	User time (seconds): 4.67
	System time (seconds): 35.86
	Percent of CPU this job got: 85%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:47.30
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 26359456
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 15488
	Minor (reclaiming a frame) page faults: 7088456
	Voluntary context switches: 467064
	Involuntary context switches: 52122
	Swaps: 0
	File system inputs: 1722872
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
89447

26Gigs of memory to search through a hundred megs of stuff seems like an issue.

I ^c'd the vimgrep one once my 32GB of memory + 18GB of swap were full.

I'm running this in this subfolder: https://gitlab.com/sea-watch.org/planner/-/tree/main/backoffice but I'm working in this repo so it might not be the same.
I did move the big .direnv folder with the python virtualenvs out of the way to run the above test.

I'd be more than happy to debug this further if you have any guidance.

@BurntSushi
Copy link
Owner

This is just #999 but with different data. The walkthrough I did in that issue is precisely relevant to your case.

First, let's capture all of the results (in my case, --vimgrep uses less memory than it does for you, but it's still an exorbitant amount):

$ time rg a > /tmp/rg2505-normal.txt

real    0.020
user    0.027
sys     0.018
maxmem  13 MB
faults  0

$ time rg a --vimgrep > /tmp/rg2505-vimgrep.txt

real    6.090
user    1.063
sys     8.381
maxmem  18370 MB
faults  0

OK, that seems crazy, but how big is the actual output?

$ ls -lh /tmp/rg2505-*.txt
-rw-rw-r-- 1 andrew users 2.7M May  5 10:10 /tmp/rg2505-normal.txt
-rw-rw-r-- 1 andrew users  18G May  5 10:10 /tmp/rg2505-vimgrep.txt

Well, okay, there's your problem. Case closed. ripgrep is using a lot of memory because you've asked it to. Note the docs of --vimgrep:

--vimgrep                                
    Show results with every match on its own line, including line numbers and
    column numbers. With this option, a line with more than one match will be
    printed more than once.

That last sentence is telling you what's happening here: for every match ripgrep finds, the entire line is printed. This is presumably what editors like vim expect.

This means that even if your corpus is very small, the output as a result of --vimgrep can still be extremely large. Indeed, the relationship is worst case quadratic. You can compute the worst case concretely by multiplying the total size of the corpus (4.5 MB in this case it looks like) by the total number of bytes in the corpus (4,500,000). That gives us about 20 GB here, which is a bit more than what ripgrep actually used. It's not a precise bound, but it's pretty close.

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale May 5, 2023
@BurntSushi BurntSushi added the duplicate An issue that is duplicative of another. label May 5, 2023
@voidus
Copy link
Author

voidus commented May 5, 2023

Yes, that makes sense. Sorry for not reading the other issue close enough.

I am not sure I understand the memory usage though. Shouldn't this be somewhat streamable? The output size being this large makes total sense, but the resident memory should be manageable, right?

@BurntSushi
Copy link
Owner

That's also discussed (very briefly) in #999 too. Output from each file is buffered in memory before being printed. There really isn't another choice if you want to prevent interleaving, which we do. If you disable parallelism then ripgrep will of course stream the output to stdout. For example:

$ time rg a --vimgrep > /tmp/rg2505-vimgrep.txt

real    6.080
user    0.959
sys     8.501
maxmem  18372 MB
faults  0

$ time rg -j1 a --vimgrep > /tmp/rg2505-vimgrep.txt

real    3.365
user    0.017
sys     3.340
maxmem  9 MB
faults  0

@voidus
Copy link
Author

voidus commented May 5, 2023

Thank you for pointing this out again. I know how annoying this kind of work is sometimes so I really appreciate you taking the time to answer in this way.

Do you think it might be reasonable to make -j1 default when --vimgrep is given? It seems to me that a bunch of people are running into this.

I personally thought it was just my 16Gigs beeing too little memory, so I upgraded and was a little surprised to see the same problem.

@BurntSushi
Copy link
Owner

Hmmm. I don't know, to be honest. It's an interesting idea. I'll re-open the issue. I'll note the following though:

  1. Text editor plugins and what not should be able to make the choice of whether to set -j1 or not without ripgrep doing it.
  2. It's not at all clear to me that making -j1 the default is the right trade off. If 99.999% of all rg --vimgrep runs use "reasonable" memory, then setting -j1 by default is likely to result in a massive performance regression in the overwhelmingly common case.
  3. --vimgrep is itself a strange thing, and is IMO legacy. It would probably be better if text editor plugins used ripgrep's --json flag to get structured matches. This is what VS Code does, for example.

@BurntSushi BurntSushi reopened this May 5, 2023
@BurntSushi BurntSushi added question An issue that is lacking clarity on one or more points. and removed duplicate An issue that is duplicative of another. labels May 5, 2023
@BurntSushi BurntSushi changed the title --vimgrep memory usage again should --vimgrep run in single threaded mode by default? May 5, 2023
@voidus
Copy link
Author

voidus commented May 5, 2023

Yeah I agree, it's a weird decision to make. For very short inputs, it's probably sensible, for longer ones it probably isn't. But making it conditional on the input length is weird.

I'll push to add -j1 by default in the telescope threads, I think this is where most people are bitten by this since it searches on every input. So if you type abc, it'll kick off a rg --vimgrep a in the background (at least sometimes, I think)

Again, thank you for your answer even though it's an old topic <3

@BurntSushi
Copy link
Owner

So if you type abc, it'll kick off a rg --vimgrep a in the background (at least sometimes, I think)

Ah! I see. Yeah I don't actually use any ripgrep text editor plugins (except I guess for fzf but it only uses ripgrep to list out the files, not for searching), so I don't have a good sense of how they actually work. But yeah, indeed, rg --vimgrep a is a dangerous thing to run!

This is indeed a tricky situation. I do feel like --json is probably the right answer here. --vimgrep feels fatally flawed unfortunately. There are other various things that could be done, but they all feel yucky:

  • As you mention, set -j1 based on needle length. I'd venture needle.len() <= 2 => -j1 might be good enough. Once you get beyond two bytes (assuming it's a literal), match frequency tends to drop off quite a bit in non-pathological cases. Of course, you could run a regex like .*, but then you get what you deserve I guess? Hah.
  • We could add an option to ripgrep that caps the size of the output buffer used by each thread. And if ripgrep would exceed that size it.... emits an error? (I don't like this for a lot of reasons.)
  • Don't use --vimgrep and instead just parse the normal output. Maybe with --column? I don't know how these editor plugins work and whether they really really need --vimgrep. This might be easier than going full hog with --json.

@voidus
Copy link
Author

voidus commented May 5, 2023

Good points, I guess maybe it's time to shift the whole ecosystem away from the --vimgrep format...

Without knowing the actual code or any of the pitfalls, wouldn't it be possible to have limited buffers with blocking until they are consumed? (I'm completely ignoring a possibly huge refactoring here of course)
I think the keyword is backpressure, but I only know it in the abstract, not if it would be a fit here or the intricacies of actually implementing it.

@BurntSushi
Copy link
Owner

In the abstract it is theoretically possible I believe. Playing it forward: All of the matches from a particular file need to be emitted in one contiguous chunk. Backpressure could work if one particular file resulted in a large output buffer and backpressure was based on total memory consumption. In that case, you'd wait for some of the other output buffers to get printed and then allocate their space to the one that is stuck because it ran out of room. At that point, you'd need to refactor the code to dynamically switch to streaming just the output of that one file to stdout, and once complete, switch back to the normal parallel strategy of buffering output. But what if there are multiple output buffers that are at or near capacity? You'd have to stop the entire search and focus on one buffer at a time until you've drained the buffers at capacity, and then resume the standard parallel search. In the worst case (and I kind of expect rg a --vimgrep to provoke exactly that), it will devolve to single threaded search, but with a lot of overhead.

The overall strategy is infeasible to implement from my perspective. More generally, adding backpressure to a system that wasn't designed for it at the start is quite difficult. And it probably doesn't really solve the problem here any better than using -j1 does with one exception: it would dynamically switch to -j1 only when needed. But popping up a level, all of this is a result of incidental complexity caused by how editor plugins use ripgrep in the first place. So it's definitely not worth adding backpressure IMO.

@bluss
Copy link

bluss commented Aug 2, 2023

Does --max-columns 4096 help on the test case for this issue? From my reading of my vim docs, grepprg functionality does not look past the first 4096 bytes of a line anyway. It seems like an option that can cut down the amount of work needing to be done here.

@BurntSushi
Copy link
Owner

@bluss Interesting idea. I tried that out and it doesn't seem to help unfortunately.

I thought about this and I think I'm going to leave the behavior of --vimgrep alone. Disabling parallelism for that mode seems like a pretty huge bummer. But I have added some extra docs warning about these failure modes.

@BurntSushi BurntSushi added the rollup A PR that has been merged with many others in a rollup. label Nov 24, 2023
BurntSushi added a commit that referenced this issue Nov 25, 2023
The --vimgrep flag has some severe footguns when using a pattern that
matches very frequently. We had already written some docs to warn about
that, but now we also include a suggestion to avoid exorbitant heap
usage.

Closes #2505
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Nov 28, 2023
14.0.2 (2023-11-27)
===================
This is a patch release with a few small bug fixes.

Bug fixes:

* [BUG #2654](BurntSushi/ripgrep#2654):
  Fix `deb` release sha256 sum file.
* [BUG #2658](BurntSushi/ripgrep#2658):
  Fix partial regression in the behavior of `--null-data --line-regexp`.
* [BUG #2659](BurntSushi/ripgrep#2659):
  Fix Fish shell completions.
* [BUG #2662](BurntSushi/ripgrep#2662):
  Fix typo in documentation for `-i/--ignore-case`.


14.0.1 (2023-11-26)
===================
This a patch release meant to fix `cargo install ripgrep` on Windows.

Bug fixes:

* [BUG #2653](BurntSushi/ripgrep#2653):
  Include `pkg/windows/Manifest.xml` in crate package.


14.0.0 (2023-11-26)
===================
ripgrep 14 is a new major version release of ripgrep that has some new
features, performance improvements and a lot of bug fixes.

The headlining feature in this release is hyperlink support. In this release,
they are an opt-in feature but may change to an opt-out feature in the future.
To enable them, try passing `--hyperlink-format default`. If you use [VS Code],
then try passing `--hyperlink-format vscode`. Please [report your experience
with hyperlinks][report-hyperlinks], positive or negative.

[VS Code]: https://code.visualstudio.com/
[report-hyperlinks]: BurntSushi/ripgrep#2611

Another headlining development in this release is that it contains a rewrite
of its regex engine. You generally shouldn't notice any changes, except for
some searches may get faster. You can read more about the [regex engine rewrite
on my blog][regex-internals]. Please [report your performance improvements or
regressions that you notice][report-perf].

[report-perf]: BurntSushi/ripgrep#2652

Finally, ripgrep switched the library it uses for argument parsing. Users
should not notice a difference in most cases (error messages have changed
somewhat), but flag overrides should generally be more consistent. For example,
things like `--no-ignore --ignore-vcs` work as one would expect (disables all
filtering related to ignore rules except for rules found in version control
systems such as `git`).

[regex-internals]: https://blog.burntsushi.net/regex-internals/

**BREAKING CHANGES**:

* `rg -C1 -A2` used to be equivalent to `rg -A2`, but now it is equivalent to
  `rg -B1 -A2`. That is, `-A` and `-B` no longer completely override `-C`.
  Instead, they only partially override `-C`.

Build process changes:

* ripgrep's shell completions and man page are now created by running ripgrep
with a new `--generate` flag. For example, `rg --generate man` will write a
man page in `roff` format on stdout. The release archives have not changed.
* The optional build dependency on `asciidoc` or `asciidoctor` has been
dropped. Previously, it was used to produce ripgrep's man page. ripgrep now
owns this process itself by writing `roff` directly.

Performance improvements:

* [PERF #1746](BurntSushi/ripgrep#1746):
  Make some cases with inner literals faster.
* [PERF #1760](BurntSushi/ripgrep#1760):
  Make most searches with `\b` look-arounds (among others) much faster.
* [PERF #2591](BurntSushi/ripgrep#2591):
  Parallel directory traversal now uses work stealing for faster searches.
* [PERF #2642](BurntSushi/ripgrep#2642):
  Parallel directory traversal has some contention reduced.

Feature enhancements:

* Added or improved file type filtering for Ada, DITA, Elixir, Fuchsia, Gentoo,
  Gradle, GraphQL, Markdown, Prolog, Raku, TypeScript, USD, V
* [FEATURE #665](BurntSushi/ripgrep#665):
  Add a new `--hyperlink-format` flag that turns file paths into hyperlinks.
* [FEATURE #1709](BurntSushi/ripgrep#1709):
  Improve documentation of ripgrep's behavior when stdout is a tty.
* [FEATURE #1737](BurntSushi/ripgrep#1737):
  Provide binaries for Apple silicon.
* [FEATURE #1790](BurntSushi/ripgrep#1790):
  Add new `--stop-on-nonmatch` flag.
* [FEATURE #1814](BurntSushi/ripgrep#1814):
  Flags are now categorized in `-h/--help` output and ripgrep's man page.
* [FEATURE #1838](BurntSushi/ripgrep#1838):
  An error is shown when searching for NUL bytes with binary detection enabled.
* [FEATURE #2195](BurntSushi/ripgrep#2195):
  When `extra-verbose` mode is enabled in zsh, show extra file type info.
* [FEATURE #2298](BurntSushi/ripgrep#2298):
  Add instructions for installing ripgrep using `cargo binstall`.
* [FEATURE #2409](BurntSushi/ripgrep#2409):
  Added installation instructions for `winget`.
* [FEATURE #2425](BurntSushi/ripgrep#2425):
  Shell completions (and man page) can be created via `rg --generate`.
* [FEATURE #2524](BurntSushi/ripgrep#2524):
  The `--debug` flag now indicates whether stdin or `./` is being searched.
* [FEATURE #2643](BurntSushi/ripgrep#2643):
  Make `-d` a short flag for `--max-depth`.
* [FEATURE #2645](BurntSushi/ripgrep#2645):
  The `--version` output will now also contain PCRE2 availability information.

Bug fixes:

* [BUG #884](BurntSushi/ripgrep#884):
  Don't error when `-v/--invert-match` is used multiple times.
* [BUG #1275](BurntSushi/ripgrep#1275):
  Fix bug with `\b` assertion in the regex engine.
* [BUG #1376](BurntSushi/ripgrep#1376):
  Using `--no-ignore --ignore-vcs` now works as one would expect.
* [BUG #1622](BurntSushi/ripgrep#1622):
  Add note about error messages to `-z/--search-zip` documentation.
* [BUG #1648](BurntSushi/ripgrep#1648):
  Fix bug where sometimes short flags with values, e.g., `-M 900`, would fail.
* [BUG #1701](BurntSushi/ripgrep#1701):
  Fix bug where some flags could not be repeated.
* [BUG #1757](BurntSushi/ripgrep#1757):
  Fix bug when searching a sub-directory didn't have ignores applied correctly.
* [BUG #1891](BurntSushi/ripgrep#1891):
  Fix bug when using `-w` with a regex that can match the empty string.
* [BUG #1911](BurntSushi/ripgrep#1911):
  Disable mmap searching in all non-64-bit environments.
* [BUG #1966](BurntSushi/ripgrep#1966):
  Fix bug where ripgrep can panic when printing to stderr.
* [BUG #2046](BurntSushi/ripgrep#2046):
  Clarify that `--pre` can accept any kind of path in the documentation.
* [BUG #2108](BurntSushi/ripgrep#2108):
  Improve docs for `-r/--replace` syntax.
* [BUG #2198](BurntSushi/ripgrep#2198):
  Fix bug where `--no-ignore-dot` would not ignore `.rgignore`.
* [BUG #2201](BurntSushi/ripgrep#2201):
  Improve docs for `-r/--replace` flag.
* [BUG #2288](BurntSushi/ripgrep#2288):
  `-A` and `-B` now only each partially override `-C`.
* [BUG #2236](BurntSushi/ripgrep#2236):
  Fix gitignore parsing bug where a trailing `\/` resulted in an error.
* [BUG #2243](BurntSushi/ripgrep#2243):
  Fix `--sort` flag for values other than `path`.
* [BUG #2246](BurntSushi/ripgrep#2246):
  Add note in `--debug` logs when binary files are ignored.
* [BUG #2337](BurntSushi/ripgrep#2337):
  Improve docs to mention that `--stats` is always implied by `--json`.
* [BUG #2381](BurntSushi/ripgrep#2381):
  Make `-p/--pretty` override flags like `--no-line-number`.
* [BUG #2392](BurntSushi/ripgrep#2392):
  Improve global git config parsing of the `excludesFile` field.
* [BUG #2418](BurntSushi/ripgrep#2418):
  Clarify sorting semantics of `--sort=path`.
* [BUG #2458](BurntSushi/ripgrep#2458):
  Make `--trim` run before `-M/--max-columns` takes effect.
* [BUG #2479](BurntSushi/ripgrep#2479):
  Add documentation about `.ignore`/`.rgignore` files in parent directories.
* [BUG #2480](BurntSushi/ripgrep#2480):
  Fix bug when using inline regex flags with `-e/--regexp`.
* [BUG #2505](BurntSushi/ripgrep#2505):
  Improve docs for `--vimgrep` by mentioning footguns and some work-arounds.
* [BUG #2519](BurntSushi/ripgrep#2519):
  Fix incorrect default value in documentation for `--field-match-separator`.
* [BUG #2523](BurntSushi/ripgrep#2523):
  Make executable searching take `.com` into account on Windows.
* [BUG #2574](BurntSushi/ripgrep#2574):
  Fix bug in `-w/--word-regexp` that would result in incorrect match offsets.
* [BUG #2623](BurntSushi/ripgrep#2623):
  Fix a number of bugs with the `-w/--word-regexp` flag.
* [BUG #2636](BurntSushi/ripgrep#2636):
  Strip release binaries for macOS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question An issue that is lacking clarity on one or more points. rollup A PR that has been merged with many others in a rollup.
Projects
None yet
Development

No branches or pull requests

4 participants
@voidus @BurntSushi @bluss and others