Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate slowness of -f flag #497

Closed
BurntSushi opened this issue May 30, 2017 · 3 comments · Fixed by #1238
Closed

investigate slowness of -f flag #497

BurntSushi opened this issue May 30, 2017 · 3 comments · Fixed by #1238
Labels
enhancement An enhancement to the functionality of the software. icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. question An issue that is lacking clarity on one or more points.

Comments

@BurntSushi
Copy link
Owner

This blog post highlights a couple interesting problems with the -f flag. Or more explicitly, when a large set of literal patterns are given.

The first issue is known. ripgrep should use Aho-Corasick explicitly when -F is provided with multiple patterns.

The second issue is more interesting, but it appears that the order of the literal patterns provided actually impacts performance quite a bit. I wonder if the literal detection inside the regex engine is hitting a bad case.

@BurntSushi BurntSushi added enhancement An enhancement to the functionality of the software. question An issue that is lacking clarity on one or more points. labels May 30, 2017
@BurntSushi
Copy link
Owner Author

BurntSushi commented May 30, 2017

The first issue is known. ripgrep should use Aho-Corasick explicitly when -F is provided with multiple patterns.

To add more context to this: right now, we take all of the patterns and join them together and sprinkle a | between each pattern. This works fine for small lists, but for large lists, it's not so great. Not only does it have to go through the entire regex parsing/compiling machinery, but the automaton construction itself is going to be quite wasteful. Instead, ripgrep should explicitly construct an Aho-Corasick automaton when it knows all of the patterns are literals using the aho-corasick crate. While this is conceptually simple to do, it requires plumbing an aho_corasick::AcAutomaton (or perhaps aho_corasick::FullAcAutomaton for small sets of literals for faster searching) through all of the search code, which is current hard-coded to take a Regex. We could do this by either introducing an enum of possible strategies, or using generics. There's a lot of code that knows about a Regex, so I worry about the hit on compilation time if we force the compiler to monomorphize all of that. Using an enum requires a branch for every search, but this intuitively feels fine since the branch predictor should make that effectively free.

With all that said, it might be wise to tackle this in libripgrep rather than trying to force it into the existing infrastructure.

@BurntSushi BurntSushi added the libripgrep An issue related to modularizing ripgrep into libraries. label May 30, 2017
@BurntSushi BurntSushi added this to the libripgrep milestone May 30, 2017
@pevalme
Copy link

pevalme commented Jun 28, 2017

I've found some interesting results related to the performance of rg -f on big sets of regexs.

For the experiments I'll describe I have used a 100 MB English subtitles file and a big set of 580 regular expressions.

Let's feed rg with the first 100 regular expressions of the file; with these 100 regexs plus regex "a" and with these 100 regex plus regex "\.". In my machine (Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz; 4 Cores; 8 GB RAM), I get the following results (time given in minutes:seconds):

Command 100 regex 100 + "a" 100 + "\."
rg -f file --dfa-size-limit=10M 4:17 0:08 1:53
rg -f file --dfa-size-limit=50M 2:58 0:06 0:57
rg -f file --dfa-size-limit=100M 0:52 0:06 0:20
rg -f file --dfa-size-limit=200M 0:10 0:06 0:15
rg -f file --dfa-size-limit=500M 0:08 0:06 0:14
rg -f file --dfa-size-limit=1G 0:08 0:06 0:14

Why "a" and "\."?

Because each of these two regular expression matches most of the lines in the file (2319098 and 2207743 out of 3414821, respectively) while there is a great difference between them: the letter 'a' uses to appear within the first characters of any line while the dot uses to appear only at the end each line.

My guess

According to what I have read, rg is doing a lazy determinization of the NFA that represents the big regex (an OR of all patterns). From time to time, due to the space limit for storing the DFA, rg removes the least used parts of it, having to built them again when needed.

By adding the regex "a", the NFA reaches a final state fast (in terms of the number of symbols of the line read before reaching it) and the required DFA remains small, without reaching the 10M bound and, thus, it is built only once.

The case in which the regex "\." is added instead of "a" could also be explained this way.
Since the dot uses to appear only at the end of each string, the NFA is much less fast. Thus, the advantage of adding this regex is not that big since most of the transitions required to check whether a line matches the NFA before adding it are still taking place. Basically, the DFA is bigger now since longer substrings must be matched against the NFA before deciding if the whole line matches or not.
Nevertheless, since lines in this example are generally short, the DFA remains smaller than in the 100 regex case.

Conclusion

It seems that detecting all matches of a big OR regex in a single line works really fast. However, the algorithm deciding whether a line matches the NFA suffers from the DFA-memory bound. Besides, even when the memory is not a problem, a simple regex that matches fast the line greatly improves the performance of rg.

@BurntSushi
Copy link
Owner Author

BurntSushi commented Jun 28, 2017

@pevalme Great analysis! I think I agree with all of it. There's probably some general advice that should be mentioned in the docs of -f that says that if you're trying to match against a large set of regexes, then you probably want to increase the DFA size limit using --dfa-size-limit. We could also just use a very large default value, since it represents a limit, to reduce the number of cases where --dfa-size-limit is needed.

@BurntSushi BurntSushi removed the libripgrep An issue related to modularizing ripgrep into libraries. label Aug 19, 2018
@BurntSushi BurntSushi removed this from the libripgrep milestone Aug 19, 2018
@BurntSushi BurntSushi added the icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. label Jan 27, 2019
BurntSushi added a commit that referenced this issue Apr 7, 2019
This makes the case of searching for a dictionary of a very large number
of literals much much faster. (~10x or so.) In particular, we achieve this
by short-circuiting the construction of a full regex when we know we have
a simple alternation of literals. Building the regex for a large dictionary
(>100,000 literals) turns out to be quite slow, even if it internally will
dispatch to Aho-Corasick.

Even that isn't quite enough. It turns out that even *parsing* such a regex
is quite slow. So when the -F/--fixed-strings flag is set, we short
circuit regex parsing completely and jump straight to Aho-Corasick.

We aren't quite as fast as GNU grep here, but it's much closer (less than
2x slower).

In general, this is somewhat of a hack. In particular, it seems plausible
that this optimization could be implemented entirely in the regex engine.
Unfortunately, the regex engine's internals are just not amenable to this
at all, so it would require a larger refactoring effort. For now, it's
good enough to add this fairly simple hack at a higher level.

Unfortunately, if you don't pass -F/--fixed-strings, then ripgrep will
be slower, because of the aforementioned missing optimization. Moreover,
passing flags like `-i` or `-S` will cause ripgrep to abandon this
optimization and fall back to something potentially much slower. Again,
this fix really needs to happen inside the regex engine, although we
might be able to special case -i when the input literals are pure ASCII
via Aho-Corasick's `ascii_case_insensitive`.

Fixes #497, Fixes #838
BurntSushi added a commit that referenced this issue Apr 7, 2019
This makes the case of searching for a dictionary of a very large number
of literals much much faster. (~10x or so.) In particular, we achieve this
by short-circuiting the construction of a full regex when we know we have
a simple alternation of literals. Building the regex for a large dictionary
(>100,000 literals) turns out to be quite slow, even if it internally will
dispatch to Aho-Corasick.

Even that isn't quite enough. It turns out that even *parsing* such a regex
is quite slow. So when the -F/--fixed-strings flag is set, we short
circuit regex parsing completely and jump straight to Aho-Corasick.

We aren't quite as fast as GNU grep here, but it's much closer (less than
2x slower).

In general, this is somewhat of a hack. In particular, it seems plausible
that this optimization could be implemented entirely in the regex engine.
Unfortunately, the regex engine's internals are just not amenable to this
at all, so it would require a larger refactoring effort. For now, it's
good enough to add this fairly simple hack at a higher level.

Unfortunately, if you don't pass -F/--fixed-strings, then ripgrep will
be slower, because of the aforementioned missing optimization. Moreover,
passing flags like `-i` or `-S` will cause ripgrep to abandon this
optimization and fall back to something potentially much slower. Again,
this fix really needs to happen inside the regex engine, although we
might be able to special case -i when the input literals are pure ASCII
via Aho-Corasick's `ascii_case_insensitive`.

Fixes #497, Fixes #838
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement to the functionality of the software. icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. question An issue that is lacking clarity on one or more points.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants