A biased barometer for gauging the relative speed of some regex engines on a curated set of tasks.
This section shows the results of a curated and biased set of benchmarks. These reflect only a small subset of the benchmarks defined in this repository, but were carefully crafted to attempt to represent a broad range of use cases and annotated where possible with analysis to aide in the interpretation of results.
The results begin with a summary, then a list of links to each benchmark group and then finally the results for each group. Results are shown one benchmark group at a time, where a single group is meant to combine related regexes or workloads, where it is intended to be useful to see how results change across regex engines. Analysis is provided, at minimum, for every group. Although, analysis is heavily biased towards Rust's regex crate, as it is what this author knows best. However, contributions that discuss other regex engines are very welcomed.
Below each group of results are the parameters for each individual benchmark
within that group. An individual benchmark may contain some analysis specific
to it, but it will at least contain a summary of the benchmark details. Some
parameters, such as the haystack, are usually too big to show in this README.
One can use rebar to look at the haystack directly. Just take the full name
of the benchmark and give it to the rebar haystack
command. For example:
$ rebar haystack unicode/compile/fifty-letters
ͱͳͷΐάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϙϛϝϟϡϸϻͱͳͷΐάέή
Similarly, the full benchmark execution details (including the haystack) can
be seen with the rebar klv
command:
$ rebar klv unicode/compile/fifty-letters
name:29:unicode/compile/fifty-letters
model:7:compile
pattern:7:\pL{50}
case-insensitive:5:false
unicode:4:true
haystack:106:ͱͳͷΐάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϙϛϝϟϡϸϻͱͳͷΐάέή
max-iters:1:0
max-warmup-iters:1:0
max-time:1:0
max-warmup-time:1:0
Finally, you can run the benchmark yourself and look at results on the command line:
$ rebar measure -f '^unicode/compile/fifty-letters$' | tee results.csv
$ rebar cmp results.csv
Below are two tables summarizing the results of regex engines benchmarked. Each regex engine includes its version at the time measurements were captured, a summary score that ranks it relative to other regex engines across all benchmarks and the total number of measurements collected.
The first table ranks regex engines based on search time. The second table ranks regex engines based on compile time.
The summary statistic used is the geometric mean of the speed ratios for
each regex engine across all benchmarks that include it. The ratios within
each benchmark are computed from the median of all timing samples taken, and
dividing it by the best median of the regex engines that participated in the
benchmark. For example, given two regex engines A
and B
with results 35 ns
and 25 ns
on a single benchmark, A
has a speed ratio of 1.4
and
B
has a speed ratio of 1.0
. The geometric mean reported here is then the
"average" speed ratio for that regex engine across all benchmarks.
Each regex engine is linked to the directory containing the runner program
responsible for compiling a regex, using it in a search and reporting timing
results. Each directory contains a README
file briefly describing any engine
specific details for the runner program.
Each regex engine is also defined in benchmarks/engines.toml, using the same name listed in the table below. Each definition includes instructions for how to run, build, clean and obtain the version of each regex engine.
Caution: Using a single number to describe the overall performance of a regex engine is a fraught endeavor, and it is debatable whether it should be included here at all. It is included primarily because the number of benchmarks is quite large and overwhelming. It can be quite difficult to get a general sense of things without a summary statistic. In particular, a summary statistic is also useful to observe how the overall picture itself changes as changes are made to the barometer. (Whether it be by adding new regex engines or adding/removing/changing existing benchmarks.) One particular word of caution is that while geometric mean is more robust with respect to outliers than arithmetic mean, it is not unaffected by them. Therefore, it is still critical to examine individual benchmarks if one wants to better understanding the performance profile of any specific regex engine or workload.
Engine | Version | Geometric mean of speed ratios | Benchmark count |
---|---|---|---|
rust/regex/meta | 0.2.0 | 1.76 | 31 |
hyperscan | 5.4.1 2023-02-22 | 2.34 | 22 |
rust/regex | 1.7.1 | 3.12 | 26 |
pcre2/jit | 10.42 2022-12-11 | 3.97 | 30 |
rust/regexold | 1.7.1 | 4.11 | 26 |
dotnet/compiled | 7.0.3 | 4.33 | 29 |
re2 | 2023-03-01 | 7.04 | 27 |
dotnet/nobacktrack | 7.0.3 | 8.42 | 26 |
regress | 0.5.0 | 25.25 | 21 |
python/regex | 2022.10.31 | 27.93 | 30 |
python/re | 3.10.9 | 28.33 | 29 |
go/regexp | 1.20.1 | 55.36 | 27 |
pcre2 | 10.42 2022-12-11 | 94.74 | 30 |
Engine | Version | Geometric mean of speed ratios | Benchmark count |
---|---|---|---|
pcre2 | 10.42 2022-12-11 | 1.00 | 9 |
regress | 0.5.0 | 3.37 | 5 |
pcre2/jit | 10.42 2022-12-11 | 4.68 | 9 |
re2 | 2023-03-01 | 9.51 | 8 |
rust/regexold | 1.7.1 | 12.58 | 9 |
go/regexp | 1.20.1 | 12.68 | 8 |
dotnet/compiled | 7.0.3 | 13.64 | 8 |
rust/regex/meta | 0.2.0 | 13.66 | 9 |
rust/regex | 1.7.1 | 13.70 | 9 |
python/re | 3.10.9 | 28.95 | 9 |
dotnet/nobacktrack | 7.0.3 | 55.20 | 5 |
python/regex | 2022.10.31 | 94.23 | 9 |
hyperscan | 5.4.1 2023-02-22 | 1564.09 | 4 |
Below is a list of links to each benchmark group in this particular barometer. Each benchmark group contains 1 or more related benchmarks. The idea of each group is to tell some kind of story about related workloads, and to give a sense of how performance changes based on the variations between each benchmark.
- literal
- literal-alternate
- date
- ruff-noqa
- lexer-veryl
- cloud-flare-redos
- unicode-character-data
- words
- aws-keys
- bounded-repeat
- unstructured-to-json
This group of benchmarks measures regex patterns that are simple literals. When possible, we also measure case insensitive versions of the same pattern. We do this across three languages: English, Russian and Chinese. For English, Unicode mode is disabled while it is enabled for Russian and Chinese. (Which mostly only matters for the case insensitive benchmarks.)
This group is mainly meant to demonstrate two things. Firstly is whether the regex engine does some of the most basic forms of optimization by recognizing that a pattern is just a literal, and that a full blown regex engine is probably not needed. Indeed, naively using a regex engine for this case is likely to produce measurements much worse than most regex engines. Secondly is how the performance of simple literal searches changes with respect to both case insensitivity and Unicode. Namely, substring search algorithms that work well on ASCII text don't necessarily also work well on UTF-8 that contains many non-ASCII codepoints. This is especially true for case insensitive searches.
Notice, for example, how RE2 seems to be faster in the sherlock-casei-ru
benchmark than in the sherlock-ru
benchmark, even though the latter is "just"
a simple substring search where as the former is a multiple substring search.
In the case of sherlock-ru
, RE2 actually attempts a literal optimization that
likely gets caught up in dealing with a high false positive rate of candidates.
Where as in the case of sherlock-casei-ru
, no literal optimization is
attempted and instead its lazy DFA is used. The high false positive rate in the
simpler literal case winds up making it overall slower than it likely would be
if it would just use the DFA.
This is not in any way to pick on RE2. Every regex engine that does literal optimizations (and most do) will suffer from this kind of setback in one way or another.
Engine | sherlock-en | sherlock-casei-en | sherlock-ru | sherlock-casei-ru | sherlock-zh |
---|---|---|---|---|---|
dotnet/compiled | 12.5 GB/s | 6.0 GB/s | 17.5 GB/s | 5.1 GB/s | 34.3 GB/s |
dotnet/nobacktrack | 8.2 GB/s | 4.0 GB/s | 7.1 GB/s | 2.5 GB/s | 28.5 GB/s |
go/regexp | 3.9 GB/s | 43.2 MB/s | 2.1 GB/s | 32.6 MB/s | 2036.6 MB/s |
hyperscan | 33.7 GB/s | 30.5 GB/s | 4.4 GB/s | 7.3 GB/s | 50.3 GB/s |
pcre2 | 7.2 GB/s | 969.9 MB/s | 2.1 MB/s | 2044.5 KB/s | 57.9 MB/s |
pcre2/jit | 24.9 GB/s | 16.6 GB/s | 31.9 GB/s | 18.8 GB/s | 37.1 GB/s |
python/re | 3.7 GB/s | 295.7 MB/s | 6.7 GB/s | 455.3 MB/s | 11.2 GB/s |
python/regex | 3.6 GB/s | 2.8 GB/s | 4.6 GB/s | 4.1 GB/s | 6.8 GB/s |
re2 | 10.6 GB/s | 2.5 GB/s | 764.2 MB/s | 942.0 MB/s | 2.7 GB/s |
regress | 3.5 GB/s | 1189.8 MB/s | 3.6 GB/s | 314.0 MB/s | 3.6 GB/s |
rust/regex | 32.2 GB/s | 8.8 GB/s | 32.4 GB/s | 5.8 GB/s | 39.7 GB/s |
rust/regex/meta | 28.6 GB/s | 10.7 GB/s | 32.3 GB/s | 9.1 GB/s | 37.5 GB/s |
rust/regexold | 32.5 GB/s | 8.5 GB/s | 31.3 GB/s | 6.8 GB/s | 39.5 GB/s |
Show individual benchmark parameters.
sherlock-en
Parameter | Value |
---|---|
full name | curated/01-literal/sherlock-en |
model | count |
regex | Sherlock Holmes |
case-insensitive | false |
unicode | false |
haystack-path | opensubtitles/en-sampled.txt |
count | 513 |
sherlock-casei-en
Parameter | Value |
---|---|
full name | curated/01-literal/sherlock-casei-en |
model | count |
regex | Sherlock Holmes |
case-insensitive | true |
unicode | false |
haystack-path | opensubtitles/en-sampled.txt |
count | 522 |
sherlock-ru
Parameter | Value |
---|---|
full name | curated/01-literal/sherlock-ru |
model | count |
regex | Шерлок Холмс |
case-insensitive | false |
unicode | true |
haystack-path | opensubtitles/ru-sampled.txt |
count | 724 |
sherlock-casei-ru
Parameter | Value |
---|---|
full name | curated/01-literal/sherlock-casei-ru |
model | count |
regex | Шерлок Холмс |
case-insensitive | true |
unicode | true |
haystack-path | opensubtitles/ru-sampled.txt |
count | 746 |
sherlock-zh
Parameter | Value |
---|---|
full name | curated/01-literal/sherlock-zh |
model | count |
regex | 夏洛克·福尔摩斯 |
case-insensitive | false |
unicode | true |
haystack-path | opensubtitles/zh-sampled.txt |
count | 30 |
This group is like literal
, but expands the complexity from a simple literal
to a small alternation of simple literals, including case insensitive variants
where applicable. Once again, we do this across three languages: English,
Russian and Chinese. We disable Unicode mode for English but enable it for
Russian and Chinese. Enabling Unicode here generally only means that case
insensitivity takes Unicode case folding rules into account.
This benchmark ups the ante when it comes to literal optimizations. Namely,
for a regex engine to optimize this case, it generally needs to be capable of
reasoning about literal optimizations that require one or more literals from
a set to match. Many regex engines don't deal with this case well, or at all.
For example, after a quick scan at comparing the sherlock-en
benchmark here
and in the previous literal
group, one thing that should stand out is the
proportion of regex engines that now measure throughput in MB/s instead of
GB/s.
One of the difficulties in optimizing for this case is that multiple substring
search is difficult to do in a way that is fast. In particular, this benchmark
carefully selected each alternation literal to start with a different character
than the other alternation literals. This, for example, inhibits clever regex
engines from noticing that all literals begin with the same byte (or small
number of bytes). Consider an alternation like foo|far|fight
. It is not hard
to see that a regex engine could just scan for the letter f
as a prefilter
optimization. Here, we pick our regex such that this sort of shortcut isn't
available. For the regex engine to optimize this case, it really needs to deal
with the problem of multiple substring search.
Multiple substring search can be implemented via a DFA, and perhaps in some cases, quite quickly via a shift DFA. Beyond that though, multiple substring search can be implemented by other various algorithms such as Aho-Corasick or Rabin-Karp. (The standard Aho-Corasick formulation is an NFA, but it can also be converted to a DFA by pre-computing all failure transitions. This winds up with a similar result as using Thompson's construction to produce an NFA and then powerset construction to get a DFA, but the Aho-Corasick construction algorithm is usually quite a bit faster because it doesn't need to deal with a full NFA.)
The problem here is that DFA speeds may or may not help you. For example, in the case of RE2 and Rust's regex engine, it will already get DFA speeds by virtue of their lazy DFAs. Indeed, in this group, RE2 performs roughly the same across all benchmarks. So even if you, say build an Aho-Corasick DFA, it's not going to help much if at all. So it makes sense to avoid it.
But Rust's regex crate has quite a bit higher throughputs than RE2 on most of
the benchmarks in this group. So how is it done? Currently, this is done via
the Teddy algorithm, which was ported out of Hyperscan. It is an algorithm
that makes use of SIMD to accelerate searching for a somewhat small set of
literals. Most regex engines don't have this sort of optimization, and indeed,
it seems like Teddy is not particularly well known. Alas, regex engines that
want to move past typical DFA speeds for multiple substring search likely need
some kind of vectorized algorithm to do so. (Teddy is also used by Rust's
regex crate in the previous literal
group of benchmarks for accelerating
case insensitive searches. Namely, it enumerates some finite set of prefixes
like she
, SHE
, ShE
and so on, and then looks for matches of those as a
prefilter.)
Engine | sherlock-en | sherlock-casei-en | sherlock-ru | sherlock-casei-ru | sherlock-zh |
---|---|---|---|---|---|
dotnet/compiled | 3.7 GB/s | 446.7 MB/s | 2.2 GB/s | 784.2 MB/s | 16.2 GB/s |
dotnet/nobacktrack | 2.6 GB/s | 374.5 MB/s | 1012.0 MB/s | 293.7 MB/s | 10.5 GB/s |
go/regexp | 24.8 MB/s | 15.3 MB/s | 31.8 MB/s | 9.0 MB/s | 45.7 MB/s |
hyperscan | 17.2 GB/s | 15.3 GB/s | 4.6 GB/s | 4.0 GB/s | 20.0 GB/s |
pcre2 | 884.1 MB/s | 157.9 MB/s | 1741.0 KB/s | 1630.0 KB/s | 8.5 MB/s |
pcre2/jit | 1559.1 MB/s | 654.6 MB/s | 1198.2 MB/s | 297.2 MB/s | 2.3 GB/s |
python/re | 439.8 MB/s | 36.4 MB/s | 411.5 MB/s | 55.7 MB/s | 1031.1 MB/s |
python/regex | 302.0 MB/s | 71.6 MB/s | 302.6 MB/s | 78.6 MB/s | 874.3 MB/s |
re2 | 923.5 MB/s | 922.7 MB/s | 918.9 MB/s | 930.3 MB/s | 964.8 MB/s |
regress | 1640.2 MB/s | 310.7 MB/s | 275.8 MB/s | 115.9 MB/s | 283.1 MB/s |
rust/regex | 15.2 GB/s | 2.7 GB/s | 2.9 GB/s | 516.5 MB/s | 18.4 GB/s |
rust/regex/meta | 12.7 GB/s | 2.9 GB/s | 6.3 GB/s | 1646.6 MB/s | 14.9 GB/s |
rust/regexold | 16.7 GB/s | 2.9 GB/s | 3.0 GB/s | 453.9 MB/s | 16.4 GB/s |
Show individual benchmark parameters.
sherlock-en
Parameter | Value |
---|---|
full name | curated/02-literal-alternate/sherlock-en |
model | count |
regex | Sherlock Holmes|John Watson|Irene Adler|Inspector Lestrade|Professor Moriarty |
case-insensitive | false |
unicode | false |
haystack-path | opensubtitles/en-sampled.txt |
count | 714 |
sherlock-casei-en
Parameter | Value |
---|---|
full name | curated/02-literal-alternate/sherlock-casei-en |
model | count |
regex | Sherlock Holmes|John Watson|Irene Adler|Inspector Lestrade|Professor Moriarty |
case-insensitive | true |
unicode | false |
haystack-path | opensubtitles/en-sampled.txt |
count | 725 |
sherlock-ru
Parameter | Value |
---|---|
full name | curated/02-literal-alternate/sherlock-ru |
model | count |
regex | Шерлок Холмс|Джон Уотсон|Ирен Адлер|инспектор Лестрейд|профессор Мориарти |
case-insensitive | false |
unicode | true |
haystack-path | opensubtitles/ru-sampled.txt |
count | 899 |
sherlock-casei-ru
Parameter | Value |
---|---|
full name | curated/02-literal-alternate/sherlock-casei-ru |
model | count |
regex | Шерлок Холмс|Джон Уотсон|Ирен Адлер|инспектор Лестрейд|профессор Мориарти |
case-insensitive | true |
unicode | true |
haystack-path | opensubtitles/ru-sampled.txt |
count | 971 |
sherlock-zh
Parameter | Value |
---|---|
full name | curated/02-literal-alternate/sherlock-zh |
model | count |
regex | 夏洛克·福尔摩斯|约翰华生|阿德勒|雷斯垂德|莫里亚蒂教授 |
case-insensitive | false |
unicode | true |
haystack-path | opensubtitles/zh-sampled.txt |
count | 207 |
This is a monster regex for extracting dates from unstructured text from
the datefinder project written in Python. The regex itself was taken from
printing the DATES_PATTERN
variable in the datefinder
project. I then removed all names from the capture groups, unnecessary escapes
and collapsed it to a single line (because not all regex engines support
verbose mode).
The regex is more akin to a tokenizer, and the datefinder
library attempts to
combine these tokens into timestamps.
We measure an ASCII only version of it and a Unicode-aware version of it.
Unicode is relevant here because of case insensitivity, and because the regex
makes use of the character classes \s
and \d
, which are bigger when they're
Unicode aware. We also measure the compilation time of each.
The results here can be a little tricky to interpret. Namely, it looks like
backtrackers tend to do worse than automata oriented regex engines, but
go/regexp
uses automata and is itself quite slow here. Notice, though, that
hyperscan
, re2
and rust/regex
do well here. While I'm less familiar with
hyperscan
, the explanation for re2
and rust/regex
is obvious once you
look at a profile: it's the lazy DFA. Both have implementations of a regex
engine that build a DFA during search time, with at most one new transition
(and one new state) being create per byte of haystack. In practice, most
transitions get reused, which means that it tends to act like a real DFA most
of the time for most regexes on most haystacks.
Compilation time of this monster regex is also all over the place. PCRE2 does
the best, and Hyperscan winds up being quite slow. Once you enable Unicode
mode, compilation time generally gets worse, and especially so for re2
and
rust/regex
. In particular, both compile byte oriented automata, which means
the transitions are defined over bytes and not codepoints. That means large
Unicode classes like \d
tend to balloon in size, because they get converted
into UTF-8 automata.
Engine | ascii | unicode | compile-ascii | compile-unicode |
---|---|---|---|---|
dotnet/compiled | - | 1156.5 KB/s | - | 1.67ms |
go/regexp | 252.4 KB/s | - | 4.06ms | - |
hyperscan | 105.1 MB/s | - | 651.02ms | - |
pcre2 | 1218.8 KB/s | 173.8 KB/s | 114.49us | 131.45us |
pcre2/jit | 21.2 MB/s | 13.0 MB/s | 714.35us | 0.98ms |
python/re | 1082.2 KB/s | 796.4 KB/s | 4.82ms | 4.99ms |
python/regex | 1111.0 KB/s | 1004.0 KB/s | 15.38ms | 36.18ms |
re2 | 75.4 MB/s | - | 1.16ms | - |
regress | - | 2.0 MB/s | 1.05ms | - |
rust/regex | 134.2 MB/s | 133.5 MB/s | 1.95ms | 6.19ms |
rust/regex/meta | 123.2 MB/s | 120.9 MB/s | 1.49ms | 5.33ms |
rust/regexold | 137.8 MB/s | 413.7 KB/s | 1.69ms | 6.19ms |
Show individual benchmark parameters.
ascii
Parameter | Value |
---|---|
full name | curated/03-date/ascii |
model | count-spans |
regex-path | wild/date.txt |
case-insensitive | true |
unicode | false |
haystack-path | rust-src-tools-3b0d4813.txt |
count(*) | 111817 |
count(hyperscan) | 547662 |
As with many other benchmarks, Hyperscan reports all matches, even ones that are overlapping. This particular regex is too big to analyze closely, but it seems plausible one could still use it (possibly with a slightly tweaked regex) for this task.
unicode
Parameter | Value |
---|---|
full name | curated/03-date/unicode |
model | count-spans |
regex-path | wild/date.txt |
case-insensitive | true |
unicode | true |
haystack-path | rust-src-tools-3b0d4813.txt |
count(*) | 111841 |
count(dotnet/compiled) | 111825 |
regress
is included here despite its \d
not being Unicode-aware (as
required by ECMAScript). Notably, its \s
is Unicode aware. (\w
is too,
but it's not used in this regex.) In this particular haystack, \d
being
ASCII-only doesn't impact the match count.
However, neither re2
nor go/regexp
are included here because neither \d
nor \s
are Unicode-aware, and the \s
being ASCII-only does impact the match
count.
hyperscan
is excluded here because the pattern results in a "too large"
compilation error. As far as I know, Hyperscan doesn't expose any knobs for
increasing this limit.
dotnet/compiled
gets a different count here, but it's not clear why.
compile-ascii
Parameter | Value |
---|---|
full name | curated/03-date/compile-ascii |
model | compile |
regex-path | wild/date.txt |
case-insensitive | true |
unicode | false |
haystack | 2010-03-14 |
count(*) | 5 |
count(hyperscan) | 10 |
Notice that regress
is now include in the ASCII benchmark, because in
compile-unicode
we specifically test that the \d
used in this regex is
Unicode-aware. regress
does not make \d
Unicode-aware, so it gets thrown
into the ASCII group. But do note that it does appear to have some Unicode
awareness.
compile-unicode
Parameter | Value |
---|---|
full name | curated/03-date/compile-unicode |
model | compile |
regex-path | wild/date.txt |
case-insensitive | true |
unicode | true |
haystack | ۲۰۱۰-۰۳-۱۴ |
count | 5 |
We use "extended arabic-indic digits" to represent the same date, 2010-03-14
,
that we use for verification in compile-ascii
. These digits are part of \d
when it is Unicode aware.
The regex benchmarked here comes from the Ruff project, which is a Python linter written in Rust. The project uses many regexes, but we pluck one out in particular that is likely to be run more frequently than the others:
(\s*)((?i:# noqa)(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?)
This is a regex that looks for # noqa
annotations on each line. The noqa
annotation generally causes the linter to ignore those lines with respect to
warnings it emits. The regex also tries to extract annotations following the
noqa
that permit ignoring only specific rules in the linter.
This benchmark has a few interesting characteristics worth pointing out:
- It is line oriented, which means the haystacks it searches are likely to be small. This in turn means that the overhead of the regex engine is likely to matter more than in throughput oriented benchmarks.
- On this particular haystack (the CPython source code), the number of matches
is quite small. Therefore, it is quite beneficial here to be able to have a
fast path to say "there is no match" without doing any extra work. While the
number of matches here is perhaps uncharacteristically small for a Python
project, you would generally expect most lines to not have
# noqa
in them, and so the presumption of a fast rejection is probably a decent assumption for this particular regex. - Ruff uses capturing groups to pick out parts of the match, so when a match is found, the regex engine needs to report additional information beyond just the overall match spans. The spans of each matching capture group also need to be reported.
- There are no prefix (or suffix) literals in the regex to enable any straight-forward prefilter optimizations.
With respect to the point about no prefix or suffix literals, we also include
a tweaked version of the regex that removes the leading (\s*)
:
(?i:# noqa)(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?
In this case, the regex now starts with a literal, albeit one that is asked
to match case insensitively. We can actually see pretty clearly the impact
the tweaked version has on the speed for each regex engine. pcre2/jit
, for
example, improves its throughput from around 500 MB/s to 1.5 GB/s. go/regexp
has an even more dramatic (relatively speaking) improvement.
rust/regex
is a little different in that it's quite fast in both cases.
The key optimization that applies for rust/regex
is the "reverse inner"
optimization. Even in the original regex, rust/regex
will pluck out the # noqa
literal and search for it case insensitively. When a candidate is found,
it then searches for (\s*)
in reverse to find the start position, and then
finally does a standard forward search from that point to find the reverse
position.
Engine | real | tweaked |
---|---|---|
dotnet/compiled | 154.3 MB/s | 506.0 MB/s |
dotnet/nobacktrack | 242.1 MB/s | 444.9 MB/s |
go/regexp | 32.7 MB/s | 712.3 MB/s |
pcre2 | 122.2 MB/s | 1415.9 MB/s |
pcre2/jit | 571.1 MB/s | 1507.5 MB/s |
python/re | 28.7 MB/s | 108.6 MB/s |
python/regex | 97.4 MB/s | 97.0 MB/s |
re2 | 535.6 MB/s | 710.9 MB/s |
rust/regex | 735.1 MB/s | 1318.4 MB/s |
rust/regex/meta | 1665.3 MB/s | 1523.8 MB/s |
rust/regexold | 196.2 MB/s | 1225.6 MB/s |
Show individual benchmark parameters.
real
Parameter | Value |
---|---|
full name | curated/04-ruff-noqa/real |
model | grep-captures |
regex | (\s*)((?i:# noqa)(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?) |
case-insensitive | false |
unicode | false |
haystack-path | wild/cpython-226484e4.py |
count | 84 |
tweaked
Parameter | Value |
---|---|
full name | curated/04-ruff-noqa/tweaked |
model | grep-captures |
regex | (?i:# noqa)(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))? |
case-insensitive | false |
unicode | false |
haystack-path | wild/cpython-226484e4.py |
count | 44 |
This group benchmarks a "lexer" where it combines a whole bunch of different patterns that identify tokens in a language into a single regex. It then uses capture groups to determine which branch of the alternation actually matched, and thus, which token matched. We also benchmark a variant of this that asks the regex engine to search for each pattern individually (most regex engines don't support this mode).
This is used by the Veryl project by way of the Parol parser generator. The regex was extracted by the Parol maintainers upon my request.
We use this regex to represent the "lexing" use case, where sometimes folks
will build a pretty big regex with a bunch of small regexes for identifying
tokens. Usually the idea is that the lexer matches literally everything in the
haystack (indeed, the last branch in this regex is a .
and the first is any
newline), and thus these sorts of regexes tend to be quite latency sensitive.
Namely, it really matters just how much overhead is involved in reporting
matches. This is likely one of the reasons why most regex engines are overall
pretty slow here.
The other aspect of this that's quite difficult is the sheer number of capturing groups. There's several dozen of them, which means regex engines have to keep track of a fair bit of state to handle it.
You might think this would be bad for backtrackers and good for automata
engines, since automata engines are supposed to be able to handle large
alternations better than backtrackers. But that's not the case here. Even for
example Python's regex engine (backtracker) beats RE2 (automata). My hypothesis
for why this is, is latency. Automata engines tend to have multiple engines
internally and therefore tend to have higher latency, and sometimes multiple
engines run to service one search. Backtrackers tend to have one engine that
handles everything. But still, shouldn't the huge alternation be disastrous for
the backtracker? Perhaps, unless many of the matches occur in an early branch,
which is likely the case here. Namely, the second alternation matches a
(single ASCII space), which is probably the most frequently occurring byte in
the haystack. An automata engine that doesn't use a DFA (which might be the
case here, because the regex is so big), will wind up spending a lot of time
keeping track of all branches of the alternation, even if it doesn't need to
explore all of them. In contrast, a backtracker will try one after the other,
and if most cases match an early branch, the backtracker is likely to take less
overall time.
Most regex engines are stuck in the 1 MB/s (or less) range. The regex crate and PCRE2's JIT get up to about 10 MB/s, with PCRE2 edging out the regex crate.
Note that the regex was lightly modified from the original to increase
portability across different regex engines. For example, the [\s--\r\n]
class
was changed to [\t\v\f ]
.
As for the second benchmark, multiple
, it uses the same patterns from each
alternation in the single
benchmark, but treats each one as a distinct
pattern. Doing this requires explicit support for searching multiple regex
patterns. (RE2's and Rust's regex crate "regex set" functionality is not enough
for this, as it only reports which patterns match a haystack, and not where
they match. That's why the lower level regex/automata/meta
engine is used,
which exposes a more powerful but more complex API.)
In the multiple
case, the rust/regex/meta
does very well and the key reason
is the abdication of capture groups as a necessary tool to determine which
token matched. Namely, now we can simply use a pattern ID from the match to
determine which "branch" in the original regex was taken. We no longer need to
ask for or inspect capture groups. This gives a critical benefit to automata
engines that support searching for multiple patterns, because it no longer
requires them to use slower engines for resolving capturing groups.
Engine | single | compile-single | multi |
---|---|---|---|
dotnet/compiled | 195.3 KB/s | 270.13us | - |
go/regexp | 317.1 KB/s | 186.96us | - |
hyperscan | - | - | 17.8 MB/s |
pcre2 | 3.0 MB/s | 19.84us | - |
pcre2/jit | 13.0 MB/s | 121.71us | - |
python/re | 1572.1 KB/s | 936.11us | - |
python/regex | 1426.9 KB/s | 2.92ms | - |
re2 | 1217.7 KB/s | 141.79us | - |
rust/regex | 235.9 KB/s | 280.24us | - |
rust/regex/meta | 9.5 MB/s | 257.56us | 62.4 MB/s |
rust/regexold | 238.8 KB/s | 249.67us | - |
Show individual benchmark parameters.
single
Parameter | Value |
---|---|
full name | curated/05-lexer-veryl/single |
model | count-captures |
regex-path | wild/parol-veryl.txt |
case-insensitive | false |
unicode | false |
haystack-path | wild/parol-veryl.vl |
count | 124800 |
Note that we don't include Hyperscan here because it doesn't support the
count-captures
benchmark model. It is included in the multiple
benchmark
below, which doesn't require capture groups.
Also, I tried to use dotnet/nobacktrack
here, but it failed because it was
too big and it wasn't obvious to me how to raise the limit.
compile-single
Parameter | Value |
---|---|
full name | curated/05-lexer-veryl/compile-single |
model | compile |
regex-path | wild/parol-veryl.txt |
case-insensitive | false |
unicode | false |
haystack | abcdefg_foobar |
count | 1 |
This measures how long it takes to a compile a moderately large lexer.
multi
Parameter | Value |
---|---|
full name | curated/05-lexer-veryl/multi |
model | count-spans |
regex-path | wild/parol-veryl.txt |
case-insensitive | false |
unicode | false |
haystack-path | wild/parol-veryl.vl |
count(*) | 150600 |
count(hyperscan) | 669500 |
Hyperscan reports everything that matches, including overlapping matches, and that's why its count is higher. It is likely still serviceable for this use case, but might in practice require changing the regex to suit Hyperscan's match semantics. Still, it's a decent barometer to include it here, particularly because of its multi-regex support.
Most regex engines do not support searching for multiple patterns and finding the corresponding match offsets, which is why this benchmark has very few entries.
This benchmark uses a regex that helped cause an outage at Cloudflare. This class of vulnerability is typically called a "regular expression denial of service," or "ReDoS" for short. It doesn't always require a malicious actor to trigger. Since it can be difficult to reason about the worst case performance of a regex when using an unbounded backtracking implementation, it might happen entirely accidentally on valid inputs.
The particular regex that contributed to the outage was:
(?:(?:"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|`|\-|\+)+[)]*;?((?:\s|-|~|!|\{\}|\|\||\+)*.*(?:.*=.*)))
As discussed in Cloudflare's post mortem, the specific problematic portion of the regex is:
.*(?:.*=.*)
Or more simply:
.*.*=.*;
We benchmark the original regex along with the simplified variant. We also split the simplified variant into one with a short haystack (about 100 bytes) and one with a long haystack (about 10,000 bytes). The benchmark results for the original and simplified short variant should be roughly similar, but the difference between the short and long variant is where things get interesting. The automata based engines generally maintain a similar throughput for both the short and long benchmarks, but the backtrackers slow way down. This is because the backtracking algorithm for this specific regex and haystack doesn't scale linearly with increases in the size of the haystack.
The purpose of this benchmark is to show a real world scenario where the use of a backtracking engine can bite you in production if you aren't careful.
We include Hyperscan in this benchmark, although it is questionable to do so.
Hyperscan reports many overlapping matches from the regex used by Cloudflare
because of the trailing .*
, so it is probably not a great comparison.
In particular, this regex was originally used in a firewall, so it seems
likely that it would be used in a "is a match" or "not a match" scenario. But
our benchmark here reproduces the analysis in the appendix of Cloudflare's
port mortem. But the real utility in including Hyperscan here is that it
demonstrates that it is not a backtracking engine. While its throughput is not
as high as some other engines, it remains roughly invariant with respect to
haystack length, just like other automata oriented engines.
Note that rust/regex
has very high throughput here because the regex is
small enough to get compiled into a full DFA. The compilation process also
"accelerates" some states, particularly the final .*
. This acceleration works
by noticing that almost all of the state's transitions loop back on itself, and
only a small number transition to another state. The final .*
for example
only leaves its state if it sees the end of the haystack or a \n
. So the DFA
will actually run memchr
on \n
and skip right to the end of the haystack.
Engine | original | simplified-short | simplified-long |
---|---|---|---|
dotnet/compiled | 131.0 MB/s | 845.9 MB/s | 13.5 GB/s |
dotnet/nobacktrack | 12.9 MB/s | 188.5 MB/s | 288.9 MB/s |
go/regexp | 41.0 MB/s | 44.6 MB/s | 47.1 MB/s |
hyperscan | 85.8 MB/s | 81.7 MB/s | 84.9 MB/s |
pcre2 | 2.9 MB/s | 2.8 MB/s | 30.2 KB/s |
pcre2/jit | 49.8 MB/s | 41.6 MB/s | 671.2 KB/s |
python/re | 22.2 MB/s | 22.0 MB/s | 337.9 KB/s |
python/regex | 6.2 MB/s | 6.0 MB/s | 91.8 KB/s |
re2 | 347.1 MB/s | 333.1 MB/s | 493.7 MB/s |
regress | 9.2 MB/s | 8.9 MB/s | 115.3 KB/s |
rust/regex | 457.6 MB/s | 498.8 MB/s | 588.7 MB/s |
rust/regex/meta | 570.1 MB/s | 1768.6 MB/s | 82.4 GB/s |
rust/regexold | 468.1 MB/s | 491.3 MB/s | 599.5 MB/s |
Show individual benchmark parameters.
original
Parameter | Value |
---|---|
full name | curated/06-cloud-flare-redos/original |
model | count-spans |
regex | (?:(?:"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|`|-|\+)+[)]*;?((?:\s|-|~|!|\{\}|\|\||\+)*.*(?:.*=.*))) |
case-insensitive | false |
unicode | false |
haystack | math x=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [.. snip ..] |
count(*) | 107 |
count(hyperscan) | 5757 |
simplified-short
Parameter | Value |
---|---|
full name | curated/06-cloud-flare-redos/simplified-short |
model | count-spans |
regex | .*.*=.* |
case-insensitive | false |
unicode | false |
haystack | x=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [.. snip ..] |
count(*) | 102 |
count(hyperscan) | 5252 |
simplified-long
Parameter | Value |
---|---|
full name | curated/06-cloud-flare-redos/simplified-long |
model | count-spans |
regex | .*.*=.* |
case-insensitive | false |
unicode | false |
haystack-path | cloud-flare-redos.txt |
count(*) | 10000 |
count(hyperscan) | 50004999 |
This regex parses data from UnicodeData.txt
, which is part of the Unicode
Character Database. This regex was extracted from the ucd-parse
crate, which is part of the ucd-generate project.
This benchmark works by iterating over every line in the haystack and then
running the regex on each line. Every line matches the regex, so regex engines
that attempt to do some extra work to reject non-matches quickly will get
penalized. For example, rust/regex
looks for a semi-colon first via its
"reverse inner" optimization, since a semi-colon is a required part of the
regex. But this optimization is just extra work here. Indeed, disabling it will
improve the thoughput of rust/regex
on this benchmark.
pcre2/jit
does remarkably well here, and these types of regexes are one of
the many things that pcre2/jit
does quickly compared to most other regex
engines.
We also include compilation time for this regex, where PCRE2 again does quite well.
Engine | parse-line | compile |
---|---|---|
dotnet/compiled | 86.0 MB/s | 41.61us |
dotnet/nobacktrack | 17.1 MB/s | 152.18us |
go/regexp | 65.4 MB/s | 39.51us |
pcre2 | 200.1 MB/s | 2.11us |
pcre2/jit | 707.4 MB/s | 12.50us |
python/re | 44.7 MB/s | 128.16us |
python/regex | 35.3 MB/s | 406.11us |
re2 | 108.4 MB/s | 14.89us |
regress | 213.0 MB/s | 5.87us |
rust/regex | 94.2 MB/s | 22.54us |
rust/regex/meta | 225.6 MB/s | 28.18us |
rust/regexold | 94.7 MB/s | 18.95us |
Show individual benchmark parameters.
parse-line
Parameter | Value |
---|---|
full name | curated/07-unicode-character-data/parse-line |
model | grep-captures |
regex-path | wild/ucd-parse.txt |
case-insensitive | false |
unicode | false |
haystack-path | wild/UnicodeData-15.0.0.txt |
count | 558784 |
compile
Parameter | Value |
---|---|
full name | curated/07-unicode-character-data/compile |
model | compile |
regex-path | wild/ucd-parse.txt |
case-insensitive | false |
unicode | false |
haystack | 249D;PARENTHESIZED LATIN SMALL LETTER B;So;0;L;<compat> 0028 [.. snip ..] |
count | 1 |
This benchmark measures how long it takes for a regex engine to find words in
a haystack. We compare one regex that finds all words, \b\w+\b
and another
regex that only looks for longer words, \b\w{12,}\b
. We also compare ASCII
regexes on English text with Unicode regexes on Russian text.
The split between finding all words and finding only long words tends to
highlight the overhead of matching in each regex engine. Regex engines that are
quicker to get in and out of its match routine do better at finding all words
than regex engines that have higher overhead. For example, regress
is faster
than rust/regex
on all-english
, but substantially slower than rust/regex
on long-english
. This is likely because rust/regex
is doing more work per
search call than regress
, which is in part rooted in the optimizations it
performs to gain higher throughput.
Otherwise, pcre2/jit
does quite well here across the board, but especially on
the Unicode variants. When comparing it against rust/regex
for example, it
is substantially faster. In the case of rust/regex
, its faster DFA oriented
engines cannot handle the Unicode aware \b
on non-ASCII haystacks, and this
causes rust/regex
to use a slower internal engine. It's so slow in fact
that python/re
and python/regex
are both faster than rust/regex
for the
Unicode benchmarks. For the ASCII long-english
benchmark, rust/regex
and
re2
both do well because most of the time is spent in its lazy DFA, which has
pretty good throughput performance when compared to a pure backtracker.
Note that several regex engines can't be used in the Unicode variants because
either they don't support a Unicode aware \w
or because they don't support a
Unicode aware \b
(or both).
Engine | all-english | all-russian | long-english | long-russian |
---|---|---|---|---|
dotnet/compiled | 59.2 MB/s | 94.6 MB/s | 178.6 MB/s | 114.8 MB/s |
dotnet/nobacktrack | 29.5 MB/s | 38.4 MB/s | 142.8 MB/s | 124.4 MB/s |
go/regexp | 11.2 MB/s | - | 44.7 MB/s | - |
hyperscan | 158.0 MB/s | - | 445.0 MB/s | - |
pcre2 | 95.6 MB/s | 130.3 KB/s | 70.7 MB/s | 6.2 MB/s |
pcre2/jit | 190.1 MB/s | 228.2 MB/s | 245.2 MB/s | 195.7 MB/s |
python/re | 33.6 MB/s | 43.9 MB/s | 110.1 MB/s | 113.7 MB/s |
python/regex | 24.6 MB/s | 45.1 MB/s | 36.4 MB/s | 101.9 MB/s |
re2 | 62.8 MB/s | - | 920.8 MB/s | - |
regress | 167.9 MB/s | - | 153.1 MB/s | - |
rust/regex/meta | 110.9 MB/s | 17.1 MB/s | 800.9 MB/s | 30.2 MB/s |
Show individual benchmark parameters.
all-english
Parameter | Value |
---|---|
full name | curated/08-words/all-english |
model | count-spans |
regex | \b[0-9A-Za-z_]+\b |
case-insensitive | false |
unicode | false |
haystack-path | opensubtitles/en-sampled.txt |
count(*) | 56691 |
count(dotnet/compiled) | 56601 |
count(dotnet/nobacktrack) | 56601 |
We specifically write out [0-9A-Za-z_]
instead of using \w
because some
regex engines, such as the one found in .NET, make \w
Unicode aware and there
doesn't appear to be any easy way of disabling it.
Also, the .NET engine makes \b
Unicode-aware, which also appears impossible
to disable. To account for that, we permit a different count.
all-russian
Parameter | Value |
---|---|
full name | curated/08-words/all-russian |
model | count-spans |
regex | \b\w+\b |
case-insensitive | false |
unicode | true |
haystack-path | opensubtitles/ru-sampled.txt |
count(*) | 107391 |
count(dotnet/compiled) | 53960 |
count(dotnet/nobacktrack) | 53960 |
regress
, re2
and go/regexp
are excluded because \w
is not Unicode
aware. hyperscan
is exclude because it doesn't support a Unicode aware \b
.
For dotnet/compiled
, since the length of matching spans is in the number of
UTF-16 code units, its expected count is smaller.
long-english
Parameter | Value |
---|---|
full name | curated/08-words/long-english |
model | count-spans |
regex | \b[0-9A-Za-z_]{12,}\b |
case-insensitive | false |
unicode | false |
haystack-path | opensubtitles/en-sampled.txt |
count | 839 |
We specifically write out [0-9A-Za-z_]
instead of using \w
because some
regex engines, such as the one found in .NET, make \w
Unicode aware and there
doesn't appear to be any easy way of disabling it.
Also, the fact that \b
is Unicode-aware in .NET does not seem to impact the
match counts in this benchmark.
long-russian
Parameter | Value |
---|---|
full name | curated/08-words/long-russian |
model | count-spans |
regex | \b\w{12,}\b |
case-insensitive | false |
unicode | true |
haystack-path | opensubtitles/ru-sampled.txt |
count(*) | 5481 |
count(dotnet/compiled) | 2747 |
count(dotnet/nobacktrack) | 2747 |
regress
, re2
and go/regexp
are excluded because \w
is not Unicode
aware. hyperscan
is exclude because it doesn't support a Unicode aware \b
.
For dotnet/compiled
, since the length of matching spans is in the number of
UTF-16 code units, its expected count is smaller.
This measures a regex for detecting AWS keys in source codeaws-key-blog. In particular, to reduce false positives, it looks for both an access key and a secret key within a few lines of one another.
We also measure a "quick" version of the regex that is used to find possible candidates by searching for things that look like an AWS access key.
The measurements here demonstrate why the pypi-aws-secrets project splits this task into two pieces. First it uses the "quick" version to identify candidates, and then it uses the "full" version to lower the false positive rate of the "quick" version. The "quick" version of the regex runs around an order of magnitude faster than the "full" version across the board. To understand why, let's look at the "quick" regex:
((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))
Given this regex, every match starts with one of ASIA
, AKIA
, AROA
or
AIDA
. This makes it quite amenable to prefilter optimizations where a regex
engine can look for matches of one of those 4 literals, and only then use the
regex engine to confirm whether there is a match at that position. Some regex
engines will also notice that every match starts with an A
and use memchr
to look for occurrences of A
as a fast prefilter.
We also include compilation times to give an idea of how long it takes to compile a moderately complex regex, and how that might vary with the compilation time of a much simpler version of the regex.
Note that in all of the measurements for this group, we search the CPython
source code (concatenated into one file). We also lossily convert it to UTF-8
so that regex engines like regress
can participate in this benchmark. (The
CPython source code contains a very small amount of invalid UTF-8.)
Engine | full | quick | compile-full | compile-quick |
---|---|---|---|---|
dotnet/compiled | 502.5 MB/s | 793.1 MB/s | 104.52us | 41.36us |
dotnet/nobacktrack | - | 666.1 MB/s | - | 211.23us |
go/regexp | 110.0 MB/s | 866.9 MB/s | 64.76us | 9.97us |
hyperscan | - | 1341.2 MB/s | - | 6.97ms |
pcre2 | 969.6 MB/s | 1498.0 MB/s | 3.61us | 867.00ns |
pcre2/jit | 1241.8 MB/s | 1018.7 MB/s | 20.31us | 4.84us |
python/re | 94.9 MB/s | 165.0 MB/s | 208.10us | 48.18us |
python/regex | 101.9 MB/s | 116.9 MB/s | 685.42us | 136.74us |
re2 | 536.8 MB/s | 998.3 MB/s | 71.00us | 9.16us |
regress | 278.9 MB/s | 708.9 MB/s | 8.56us | 2.09us |
rust/regex | 664.3 MB/s | 1465.4 MB/s | 78.68us | 20.44us |
rust/regex/meta | 1735.2 MB/s | 1809.1 MB/s | 84.42us | 14.67us |
rust/regexold | 645.7 MB/s | 1440.9 MB/s | 71.60us | 18.64us |
Show individual benchmark parameters.
full
Parameter | Value |
---|---|
full name | curated/09-aws-keys/full |
model | grep-captures |
regex | (('|")((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))('|").*?(\n^.*?){0,4}(('|")[a-zA-Z0-9+/]{40}('|"))+|('|")[a-zA-Z0-9+/]{40}('|").*?(\n^.*?){0,3}('|")((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))('|"))+ |
case-insensitive | false |
unicode | false |
haystack-path | wild/cpython-226484e4.py |
count | 0 |
quick
Parameter | Value |
---|---|
full name | curated/09-aws-keys/quick |
model | grep |
regex | ((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16})) |
case-insensitive | false |
unicode | false |
haystack-path | wild/cpython-226484e4.py |
count | 0 |
compile-full
Parameter | Value |
---|---|
full name | curated/09-aws-keys/compile-full |
model | compile |
regex | (('|")((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))('|").*?(\n^.*?){0,4}(('|")[a-zA-Z0-9+/]{40}('|"))+|('|")[a-zA-Z0-9+/]{40}('|").*?(\n^.*?){0,3}('|")((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))('|"))+ |
case-insensitive | false |
unicode | false |
haystack | "AIDAABCDEFGHIJKLMNOP""aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [.. snip ..] |
count | 1 |
compile-quick
Parameter | Value |
---|---|
full name | curated/09-aws-keys/compile-quick |
model | compile |
regex | ((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16})) |
case-insensitive | false |
unicode | false |
haystack | AIDAABCDEFGHIJKLMNOP |
count | 1 |
This group of benchmarks measures how well regex engines do with bounded
repeats. Bounded repeats are sub-expressions that are permitted to match
up to some fixed number of times. For example, a{3,5}
matches 3, 4 or 5
consecutive a
characters. Unlike unbounded repetition operators, the regex
engine needs some way to track when the bound has reached its limit. For this
reason, many regex engines will translate a{3,5}
to aaaa?a?
. Given that
the bounds may be much higher than 5
and that the sub-expression may be much
more complicated than a single character, bounded repeats can quickly cause the
underlying matcher to balloon in size.
We measure three different types of bounded repeats:
- A search for a number of consecutive letters, both ASCII only and Unicode aware.
- A search for certain types of words surrounding a
Result
type in Rust source code. - A search for consecutive words, all beginning with a capital letter.
We also include measurements for the compilation time of the last two.
Hyperscan does unusually well here, particularly for an automata oriented engine. It's plausible that it has some specific optimizations in place for bounded repeats.
rust/regex
slows down quite a bit on the context
regex. Namely, the
context
regex is quite gnarly and its (?s:.)
sub-expression coupled with
the bounded repeat causes a large portion of its transition table to get filled
out. This in turn results in more time than usual being spent actually building
the lazy DFA's transition table during a search. Typically, the lazy DFA's
transition table is built pretty quickly and then mostly reused on subsequent
searches. But in this case, the transition table exceeds the lazy DFA's cache
capacity and results in the cache getting cleared. However, the rate at which
new transitions are created is still low enough that the lazy DFA is used
instead of falling back to a slower engine.
Engine | letters-en | letters-ru | context | capitals | compile-context | compile-capitals |
---|---|---|---|---|---|---|
dotnet/compiled | 257.3 MB/s | 191.4 MB/s | 332.5 MB/s | 866.2 MB/s | 33.50us | 28.69us |
dotnet/nobacktrack | 146.0 MB/s | 110.9 MB/s | 53.4 MB/s | 552.8 MB/s | 200.63us | 48.59us |
go/regexp | 29.5 MB/s | 27.5 MB/s | 29.3 MB/s | 52.6 MB/s | 74.82us | 54.98us |
hyperscan | 733.8 MB/s | 266.1 MB/s | 499.5 MB/s | 2.6 GB/s | 24.64ms | 664.52us |
pcre2 | 70.8 MB/s | 415.9 KB/s | 29.8 MB/s | 555.0 MB/s | 4.41us | 28.36us |
pcre2/jit | 333.4 MB/s | 285.1 MB/s | 323.9 MB/s | 1537.7 MB/s | 13.17us | 36.42us |
python/re | 73.0 MB/s | - | 72.4 MB/s | 66.4 MB/s | 57.77us | 32.01us |
python/regex | 35.6 MB/s | 77.1 MB/s | 36.4 MB/s | 255.5 MB/s | 148.77us | 75.79us |
re2 | 494.6 MB/s | 6.6 MB/s | 94.2 MB/s | 982.2 MB/s | 93.31us | 120.34us |
regress | 168.3 MB/s | - | - | - | - | - |
rust/regex | 621.6 MB/s | 538.8 MB/s | 20.2 MB/s | 810.4 MB/s | 42.00us | 63.75us |
rust/regex/meta | 695.2 MB/s | 611.5 MB/s | 102.5 MB/s | 825.6 MB/s | 61.35us | 61.81us |
rust/regexold | 611.3 MB/s | 546.7 MB/s | 19.7 MB/s | 825.6 MB/s | 40.89us | 69.28us |
Show individual benchmark parameters.
letters-en
Parameter | Value |
---|---|
full name | curated/10-bounded-repeat/letters-en |
model | count |
regex | [A-Za-z]{8,13} |
case-insensitive | false |
unicode | false |
haystack-path | opensubtitles/en-sampled.txt |
count(*) | 1833 |
count(hyperscan) | 3724 |
letters-ru
Parameter | Value |
---|---|
full name | curated/10-bounded-repeat/letters-ru |
model | count |
regex | \p{L}{8,13} |
case-insensitive | false |
unicode | true |
haystack-path | opensubtitles/ru-sampled.txt |
count(*) | 3475 |
count(hyperscan) | 8570 |
context
Parameter | Value |
---|---|
full name | curated/10-bounded-repeat/context |
model | count |
regex | [A-Za-z]{10}\s+(?s:.){0,100}Result(?s:.){0,100}\s+[A-Za-z]{10} |
case-insensitive | false |
unicode | false |
haystack-path | rust-src-tools-3b0d4813.txt |
count(*) | 53 |
count(hyperscan) | 109 |
capitals
Parameter | Value |
---|---|
full name | curated/10-bounded-repeat/capitals |
model | count |
regex | (?:[A-Z][a-z]+\s*){10,100} |
case-insensitive | false |
unicode | false |
haystack-path | rust-src-tools-3b0d4813.txt |
count(*) | 11 |
count(hyperscan) | 237 |
compile-context
Parameter | Value |
---|---|
full name | curated/10-bounded-repeat/compile-context |
model | compile |
regex | [A-Za-z]{10}\s+(?s:.){0,100}Result(?s:.){0,100}\s+[A-Za-z]{10} |
case-insensitive | false |
unicode | false |
haystack | abcdefghij blah blah blah Result blib blab klmnopqrst |
count | 1 |
compile-capitals
Parameter | Value |
---|---|
full name | curated/10-bounded-repeat/compile-capitals |
model | compile |
regex | (?:[A-Z][a-z]+\s*){10,100} |
case-insensitive | false |
unicode | false |
haystack | Crazy Janey Mission Man Wild Billy Greasy Lake Hazy Davy Kil [.. snip ..] |
count(*) | 1 |
count(hyperscan) | 12 |
These benchmarks come from a task that converts unstructured log data to structured JSON data. It works by iterating over every line in the log file and parsing various parts of each line into different sections using capture groups. The regex matches every line, so any fast logic design to reject non-matches will generally penalize regex engines here.
The original regex looks like this:
(?x)
^
(?P<timestamp>[^\ ]+\ [^\ ]+)
[\ ](?P<level>[DIWEF])[1234]:[\ ]
(?P<header>
(?:
(?:
\[ [^\]]*? \] | \( [^\)]*? \)
):[\ ]
)*
)
(?P<body>.*?)
[\ ]\{(?P<location>[^\}]*)\}
$
(The actual regex is flattened since not all engines support verbose mode. We also remove the names from each capture group.)
pcre2/jit
does really well here. I'm not personally familiar with how
PCRE2's JIT works, but if I had to guess, I'd say there are some clever
optimizations with respect to the [^ ]+
(and similar) sub-expressions in this
regex.
Otherwise, the backtracking engines generally outperform the automata engines
in this benchmark. Interestingly, all of re2
, go/regexp
and rust/regex
principally use their own bounded backtracking algorithms. But it looks like
"proper" backtrackers tend to be better optimized than the ones found in RE2
and its descendants. (Bounded backtracking does have to pay for checking that
no combination of haystack position and NFA state is visited more than once,
but even removing that check does not bring, e.g., rust/regex
up to speeds
similar to other backtrackers.)
Engine | extract | compile |
---|---|---|
dotnet/compiled | 560.1 MB/s | 44.75us |
dotnet/nobacktrack | 17.7 MB/s | 505.07us |
go/regexp | 82.7 MB/s | 21.23us |
pcre2 | 206.9 MB/s | 1.35us |
pcre2/jit | 1440.3 MB/s | 7.11us |
python/re | 119.9 MB/s | 88.83us |
python/regex | 126.1 MB/s | 279.58us |
re2 | 117.9 MB/s | 9.45us |
regress | 301.1 MB/s | 4.01us |
rust/regex | 89.9 MB/s | 17.27us |
rust/regex/meta | 114.4 MB/s | 20.37us |
rust/regexold | 89.5 MB/s | 14.05us |
Show individual benchmark parameters.
extract
Parameter | Value |
---|---|
full name | curated/11-unstructured-to-json/extract |
model | grep-captures |
regex-path | wild/unstructured-to-json.txt |
case-insensitive | false |
unicode | false |
haystack-path | wild/unstructured-to-json.log |
count | 600 |
compile
Parameter | Value |
---|---|
full name | curated/11-unstructured-to-json/compile |
model | compile |
regex-path | wild/unstructured-to-json.txt |
case-insensitive | false |
unicode | false |
haystack | 2022/06/17 06:25:22 I4: [17936:140245395805952:(17998)]: (8f [.. snip ..] |
count | 1 |
It would be great to add more regex engines to this barometer. I am thinking of at least the following, but I'm generally open to any regex engine that has a reasonable build process with stable tooling:
- Perl's regex engine.
- Ruby's regex engine, or perhaps just Onigmo directly.
- ICU's regex engine
nim-regex
- D's std.regex
- CTRE. (This one may prove tricky since "compile a regex" probably means "compile a C++ program." The rebar tool supports this, but it will be annoying. If you want to add this, please file an issue to discuss an implementation path.)
- A POSIX regex engine.
- Javascript's regex engine. (Is it accessible outside of a browser?)
java.util.regex
System.Text.RegularExpressions
from.NET
.NSRegularExpression
, perhaps through Swift?- Lisp's CL-PPCRE.
- A selected subset of the mess that is regex libraries for Haskell.
Here are some other regex engines I'm aware of, but I have reservations about including them:
- PHP's
preg
functions. This "just" uses PCRE2, which is already included in this benchmark, so it's not clear whether it's also worth measuring here too. But maybe it is. Maybe PHP introduces some interesting performance characteristics that meaningfully alter the picture presented by using PCRE2 directly. - Julia's standard regex engine, which last I checked was also PCRE2. So a similar reasoning as for PHP applies here.
- C++'s
std::regex
or Boost's regex engine. These are known to be horribly slow. Maybe we should still include them for completeness. - re2c does regex matching through code generation, so this would likely work similarly to CTRE if it were to be added. It serves a very different use case than most regex engines, so I'm not sure if it fits here, but it could be interesting.
- Regex engines embedded in grep tools like GNU grep. These may be a little
tricky to correctly benchmark given the methodology here, but I think it
should be possible to satisfy at least some of the models. The idea here would
be to actually call the
grep
program and not try to rip the regex engine out of it. - Tcl's regex library is something I've benchmarked in the past and I recall it also being extraordinarily slow. So I'm not sure if it's worth it? Also, does anyone still use Tcl?
I think the main criteria for inclusion are:
- Someone has to actually be using the regex engine for something. It's not scalable to include every regex engine someone threw up on GitHub five years ago that isn't being maintained and nobody is using.
- The build process for the regex engine needs to be somewhat reasonable, or it needs to somehow be maintained by someone else. For example, we don't build Python's regex engine. Instead, we just require that Python be installed.