Skip to content

Commit

Permalink
doc: updates to MODELs and .NET README
Browse files Browse the repository at this point in the history
Specifically, .NET's runner program made me realize that "count the
number of bytes in a match" is not really viable in some cases. For .NET
(and also, I imagine, Java), the most natural thing to do is count the
number of UTF-16 code units.

"counting bytes" is precisely equivalent to "count UTF-8 code units" in
cases where the haystack is valid UTF-8. There are some haystacks that
aren't valid UTF-8, so "counting bytes" really is the right thing there.
But conceptually speaking, it's fine to think of `count-spans` as
summing the lengths of matches in terms of UTF-8 or UTF-16 code units.
  • Loading branch information
BurntSushi committed Mar 26, 2023
1 parent 43bf8e0 commit 8192bdc
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 10 deletions.
14 changes: 11 additions & 3 deletions MODELS.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,9 +108,17 @@ start of a match in automata oriented engines.

## `count-spans`

This model is like `count`, except it returns a sum of the lengths (in bytes)
of all matches found in a single haystack. The verification step simply
confirms that the sum matches what is expected.
This model is like `count`, except it returns a sum of the lengths of all
matches found in a single haystack. The verification step simply confirms that
the sum matches what is expected.

The length of a match should ideally be in terms of the number of bytes, but
it is also permissible to count the number of code units. For example, .NET's
regex engine can only run on sequences of UTF-16 code units, so using a length
derived from anything other than UTF-16 code units implies an overhead cost
that would otherwise be artificial to this benchmark. Benchmark definitions
will need to account for this by specifying different counts expected for regex
engines that count something other than individual bytes.

For example, given the regex `[0-9]{2}|[a-z]` and the haystack `12a!!34`, the
total sum reported should be `len(12) + len(a) + len(34) = 5`.
Expand Down
14 changes: 7 additions & 7 deletions engines/dotnet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@ expressions][dotnet-regex].

.NET's regex engine is principally backtracking based, with an option for JIT
compilation. As of .NET 7, there is also an option to use a non-backtracking
engine. In total, this results in three total regex engines that this runner
program can execute: `dotnet`, `dotnet/compiled` and `dotnet/nobacktrack`. In
general, we only measure the latter two, since it's expected that if one cares
about performance, then they'll avoid using the pure interpreter based regex
engine. (.NET also provides generated regexes. In theory, we could measure
those too, but it would require writing a program to build .NET programs, which
is perhaps more trouble than it's worth.)
engine. In total, this results in three regex engines that this runner program
can execute: `dotnet`, `dotnet/compiled` and `dotnet/nobacktrack`. In general,
we only measure the latter two, since it's expected that if one cares about
performance, then they'll avoid using the pure interpreter based regex engine.
(.NET also provides regexes that compile to C# source code. In theory, we
could measure those too, but it would require writing a program to build .NET
programs, which is perhaps more trouble than it's worth.)

This program otherwise makes the following choices:

Expand Down

0 comments on commit 8192bdc

Please sign in to comment.