doc: updates to MODELs and .NET README

Specifically, .NET's runner program made me realize that "count the number of bytes in a match" is not really viable in some cases. For .NET (and also, I imagine, Java), the most natural thing to do is count the number of UTF-16 code units. "counting bytes" is precisely equivalent to "count UTF-8 code units" in cases where the haystack is valid UTF-8. There are some haystacks that aren't valid UTF-8, so "counting bytes" really is the right thing there. But conceptually speaking, it's fine to think of `count-spans` as summing the lengths of matches in terms of UTF-8 or UTF-16 code units.
BurntSushi · Mar 26, 2023 · 8192bdc · 8192bdc
1 parent 43bf8e0
commit 8192bdc
Show file tree

Hide file tree

Showing 2 changed files with 18 additions and 10 deletions.
diff --git a/MODELS.md b/MODELS.md
@@ -108,9 +108,17 @@ start of a match in automata oriented engines.
 
 ## `count-spans`
 
-This model is like `count`, except it returns a sum of the lengths (in bytes)
-of all matches found in a single haystack. The verification step simply
-confirms that the sum matches what is expected.
+This model is like `count`, except it returns a sum of the lengths of all
+matches found in a single haystack. The verification step simply confirms that
+the sum matches what is expected.
+
+The length of a match should ideally be in terms of the number of bytes, but
+it is also permissible to count the number of code units. For example, .NET's
+regex engine can only run on sequences of UTF-16 code units, so using a length
+derived from anything other than UTF-16 code units implies an overhead cost
+that would otherwise be artificial to this benchmark. Benchmark definitions
+will need to account for this by specifying different counts expected for regex
+engines that count something other than individual bytes.
 
 For example, given the regex `[0-9]{2}|[a-z]` and the haystack `12a!!34`, the
 total sum reported should be `len(12) + len(a) + len(34) = 5`.

diff --git a/engines/dotnet/README.md b/engines/dotnet/README.md
@@ -3,13 +3,13 @@ expressions][dotnet-regex].
 
 .NET's regex engine is principally backtracking based, with an option for JIT
 compilation. As of .NET 7, there is also an option to use a non-backtracking
-engine. In total, this results in three total regex engines that this runner
-program can execute: `dotnet`, `dotnet/compiled` and `dotnet/nobacktrack`. In
-general, we only measure the latter two, since it's expected that if one cares
-about performance, then they'll avoid using the pure interpreter based regex
-engine. (.NET also provides generated regexes. In theory, we could measure
-those too, but it would require writing a program to build .NET programs, which
-is perhaps more trouble than it's worth.)
+engine. In total, this results in three regex engines that this runner program
+can execute: `dotnet`, `dotnet/compiled` and `dotnet/nobacktrack`. In general,
+we only measure the latter two, since it's expected that if one cares about
+performance, then they'll avoid using the pure interpreter based regex engine.
+(.NET also provides regexes that compile to C# source code. In theory, we
+could measure those too, but it would require writing a program to build .NET
+programs, which is perhaps more trouble than it's worth.)
 
 This program otherwise makes the following choices: