investigate slowness of -f flag #497

BurntSushi · 2017-05-30T11:09:51Z

This blog post highlights a couple interesting problems with the -f flag. Or more explicitly, when a large set of literal patterns are given.

The first issue is known. ripgrep should use Aho-Corasick explicitly when -F is provided with multiple patterns.

The second issue is more interesting, but it appears that the order of the literal patterns provided actually impacts performance quite a bit. I wonder if the literal detection inside the regex engine is hitting a bad case.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2017-05-30T11:37:47Z

The first issue is known. ripgrep should use Aho-Corasick explicitly when -F is provided with multiple patterns.

To add more context to this: right now, we take all of the patterns and join them together and sprinkle a | between each pattern. This works fine for small lists, but for large lists, it's not so great. Not only does it have to go through the entire regex parsing/compiling machinery, but the automaton construction itself is going to be quite wasteful. Instead, ripgrep should explicitly construct an Aho-Corasick automaton when it knows all of the patterns are literals using the aho-corasick crate. While this is conceptually simple to do, it requires plumbing an aho_corasick::AcAutomaton (or perhaps aho_corasick::FullAcAutomaton for small sets of literals for faster searching) through all of the search code, which is current hard-coded to take a Regex. We could do this by either introducing an enum of possible strategies, or using generics. There's a lot of code that knows about a Regex, so I worry about the hit on compilation time if we force the compiler to monomorphize all of that. Using an enum requires a branch for every search, but this intuitively feels fine since the branch predictor should make that effectively free.

With all that said, it might be wise to tackle this in libripgrep rather than trying to force it into the existing infrastructure.

pevalme · 2017-06-28T16:37:04Z

I've found some interesting results related to the performance of rg -f on big sets of regexs.

For the experiments I'll describe I have used a 100 MB English subtitles file and a big set of 580 regular expressions.

Let's feed rg with the first 100 regular expressions of the file; with these 100 regexs plus regex "a" and with these 100 regex plus regex "\.". In my machine (Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz; 4 Cores; 8 GB RAM), I get the following results (time given in minutes:seconds):

Command	100 regex	100 + "a"	100 + "\."
`rg -f file --dfa-size-limit=10M`	4:17	0:08	1:53
`rg -f file --dfa-size-limit=50M`	2:58	0:06	0:57
`rg -f file --dfa-size-limit=100M`	0:52	0:06	0:20
`rg -f file --dfa-size-limit=200M`	0:10	0:06	0:15
`rg -f file --dfa-size-limit=500M`	0:08	0:06	0:14
`rg -f file --dfa-size-limit=1G`	0:08	0:06	0:14

Why "a" and "\."?

Because each of these two regular expression matches most of the lines in the file (2319098 and 2207743 out of 3414821, respectively) while there is a great difference between them: the letter 'a' uses to appear within the first characters of any line while the dot uses to appear only at the end each line.

My guess

According to what I have read, rg is doing a lazy determinization of the NFA that represents the big regex (an OR of all patterns). From time to time, due to the space limit for storing the DFA, rg removes the least used parts of it, having to built them again when needed.

By adding the regex "a", the NFA reaches a final state fast (in terms of the number of symbols of the line read before reaching it) and the required DFA remains small, without reaching the 10M bound and, thus, it is built only once.

The case in which the regex "\." is added instead of "a" could also be explained this way.
Since the dot uses to appear only at the end of each string, the NFA is much less fast. Thus, the advantage of adding this regex is not that big since most of the transitions required to check whether a line matches the NFA before adding it are still taking place. Basically, the DFA is bigger now since longer substrings must be matched against the NFA before deciding if the whole line matches or not.
Nevertheless, since lines in this example are generally short, the DFA remains smaller than in the 100 regex case.

Conclusion

It seems that detecting all matches of a big OR regex in a single line works really fast. However, the algorithm deciding whether a line matches the NFA suffers from the DFA-memory bound. Besides, even when the memory is not a problem, a simple regex that matches fast the line greatly improves the performance of rg.

BurntSushi · 2017-06-28T16:46:41Z

@pevalme Great analysis! I think I agree with all of it. There's probably some general advice that should be mentioned in the docs of -f that says that if you're trying to match against a large set of regexes, then you probably want to increase the DFA size limit using --dfa-size-limit. We could also just use a very large default value, since it represents a limit, to reduce the number of cases where --dfa-size-limit is needed.

This makes the case of searching for a dictionary of a very large number of literals much much faster. (~10x or so.) In particular, we achieve this by short-circuiting the construction of a full regex when we know we have a simple alternation of literals. Building the regex for a large dictionary (>100,000 literals) turns out to be quite slow, even if it internally will dispatch to Aho-Corasick. Even that isn't quite enough. It turns out that even *parsing* such a regex is quite slow. So when the -F/--fixed-strings flag is set, we short circuit regex parsing completely and jump straight to Aho-Corasick. We aren't quite as fast as GNU grep here, but it's much closer (less than 2x slower). In general, this is somewhat of a hack. In particular, it seems plausible that this optimization could be implemented entirely in the regex engine. Unfortunately, the regex engine's internals are just not amenable to this at all, so it would require a larger refactoring effort. For now, it's good enough to add this fairly simple hack at a higher level. Unfortunately, if you don't pass -F/--fixed-strings, then ripgrep will be slower, because of the aforementioned missing optimization. Moreover, passing flags like `-i` or `-S` will cause ripgrep to abandon this optimization and fall back to something potentially much slower. Again, this fix really needs to happen inside the regex engine, although we might be able to special case -i when the input literals are pure ASCII via Aho-Corasick's `ascii_case_insensitive`. Fixes #497, Fixes #838

BurntSushi added enhancement An enhancement to the functionality of the software. question An issue that is lacking clarity on one or more points. labels May 30, 2017

BurntSushi mentioned this issue May 30, 2017

Unable to use on queries that start with singlequote,' ,ascii char 39 #493

Closed

BurntSushi added the libripgrep An issue related to modularizing ripgrep into libraries. label May 30, 2017

BurntSushi added this to the libripgrep milestone May 30, 2017

victoriastuart mentioned this issue Jan 7, 2018

Compiled regex exceeds size limit of 10485760 bytes. #362

Closed

BurntSushi removed the libripgrep An issue related to modularizing ripgrep into libraries. label Aug 19, 2018

BurntSushi removed this from the libripgrep milestone Aug 19, 2018

BurntSushi added the icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. label Jan 27, 2019

BurntSushi mentioned this issue Apr 7, 2019

regex: make multi-literal searcher faster #1238

Merged

BurntSushi closed this as completed in #1238 Apr 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate slowness of -f flag #497

investigate slowness of -f flag #497

BurntSushi commented May 30, 2017

BurntSushi commented May 30, 2017 •

edited

pevalme commented Jun 28, 2017 •

edited

BurntSushi commented Jun 28, 2017 •

edited

investigate slowness of -f flag #497

investigate slowness of -f flag #497

Comments

BurntSushi commented May 30, 2017

BurntSushi commented May 30, 2017 • edited

pevalme commented Jun 28, 2017 • edited

Why "a" and "\."?

My guess

Conclusion

BurntSushi commented Jun 28, 2017 • edited

BurntSushi commented May 30, 2017 •

edited

pevalme commented Jun 28, 2017 •

edited

BurntSushi commented Jun 28, 2017 •

edited