This directory contains a Perl runner program for benchmarking Perl's regex engine. Perl's regex engine uses backtracking.
This runner program makes a few choices worth highlighting:
- Regexes are specifically compiled using
my $re = qr/.../;
, but in order to do an iterative search, they seemingly need to be reformulated in regex syntax like/$re/g
, since it seems theg
flag cannot be used in combination withqr/.../
syntax. (Which makes sense, since theg
flag is likely connected to search time state, where asqr/.../
is really about just getting the regex into a compiled form.) This also makes sense conceptually, but using the syntax/$re/g
to execute a search kind of makes it look like the regex is being re-compiled. We do not want to measure regex compilation during a search if we can help it. So the question is: does/$re/g
re-compile$re
if$re
is already a compiled regex object? I could not find the answer to this question. - The
grep
andgrep-captures
modles use a regex to iterate over lines. This is somewhat odd, but appears idiomatic. It's also probably pretty likely to be the fastest way to do such a thing in Perl, although I don't know for sure. If there's a better and/or more idiomatic approach, I'd be happy for a PR where we can discuss it.
While I don't know for sure, my suspicion is that the compilation benchmark
model is not implemented correctly. My guess is that Perl regexes are being
cached, and so this runner program doesn't actually measure what it's supposed
to measure. After a cursory search, I could neither confirm nor deny this
hypothesis and could find no way to clear any cache that Perl might be using.
(Python has re.purge()
, which we use for exactly this purpose.)
Because of this, Perl's regex engine is not included in any of the curated compilation benchmarks.
Perl's documentation boasts quite impressive support for Unicode, and it can be
toggled via the a
and u
flags (among some others). Unfortunately, actually
getting them to work is quite a challenge, and I'm still not quite sure that
this program does it correctly. Review on this point would be appreciated.
But, what I do believe is true is that when Unicode is enabled in the benchmark
definition, then things like \w
/\s
/\d
, \b
and case insensitivity are
all Unicode-aware. And when Unicode is disabled, then all of those things are
limited to their ASCII only interpretations. Which is perfect and ultimately
works similarly to Rust's regex engine. (The main problem is that whether any
of this works correctly is up to certain properties of the strings themselves,
and most of it is pretty silent.)
This is the first real Perl program I've ever written. It is probably not idiomatic. I'm open to PRs making it more idiomatic, but I'd prefer if it not become even more cryptic than it already is.