proposal: unicode: improvement of rune-checking funcs #68064

diegommm · 2024-06-19T04:29:20Z

Proposal Details

I was writing a text parser and came across the need to check some runes repeatedly. I also used a lot functions like utf8.ValidRune, unicode.IsSpace, etc. I found some of them were slower compared to some versions of other rune-checker funcs I wrote for the parser, so I setup a repo to make a PoC with my findings. I also found that using unicode.RangeTable is for a very specific use cases, and is not very ergonomic for many mundane use cases. It's also true that I found myself calling Contains and Index family of functions in strings and bytes packages, but many times with the same "needles" (things looked up) for many different "haystacks" (where we looked them up). In some cases, these functions will create an internal, temporary data structure to aid with the lookup, but they are discarded and cannot be reused. It would be great if they could be reused somehow.

(aside: I don't want to bother with the needle/haystack terminology, feel free to point me to more Go-ish alternatives).

The proposal can be divided in two items:

Add a way to create efficient func(rune) bool functions so they can be reused many times. Bikeshedding starter:
1. unicode.IsBytesFunc([]byte) func(rune) bool
2. unicode.IsRunesFunc([]rune) func(rune) bool
3. unicode.IsStringFunc(string) func(rune) bool
Improve the performance of some implementations of these kind of functions in unicode, utf8, strings and bytes packages, especially when there are runes > utf8.RuneSelf. I would like to contribute to the project, so let me know if these are changes that are wanted in the standard library and to what extent should they be implemented (there might be other things to consider that I'm missing to strike the right balance for the project). If so, I can go through the contribution guides and do all that stuff once the "what" is agreed, we can discuss starting with code in the PoC repo (which can have some cleanup for sure).

If both items cannot be attended in this same issue, I would prefer that number 1 is addressed here, as that is more of a proposal, and number 2 sounds more like an improvement that doesn't change API and can probably be addressed in other place (and please, could you kindly point me where to do it).

benchstat results for unicode and utf8 improvements

goos: linux
goarch: amd64
pkg: github.com/diegommm/runes
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
                                 │    stdlib    │                local                │
                                 │    sec/op    │   sec/op     vs base                │
Stdlib/IsDigit/rune=-1-8            1.382n ± 2%   1.296n ± 4%   -6.22% (p=0.000 n=20)
Stdlib/IsDigit/rune=48-8            1.643n ± 1%   1.215n ± 4%  -26.07% (p=0.000 n=20)
Stdlib/IsDigit/rune=97-8            1.632n ± 1%   1.201n ± 2%  -26.43% (p=0.000 n=20)
Stdlib/IsDigit/rune=209-8           1.642n ± 2%   1.210n ± 5%  -26.32% (p=0.000 n=20)
Stdlib/IsDigit/rune=26412-8         7.547n ± 2%   1.198n ± 1%  -84.12% (p=0.000 n=20)
Stdlib/IsDigit/rune=55296-8         7.686n ± 1%   1.220n ± 1%  -84.13% (p=0.000 n=20)
Stdlib/IsDigit/rune=128169-8        7.380n ± 1%   1.207n ± 1%  -83.64% (p=0.000 n=20)
Stdlib/IsDigit/rune=130041-8        6.396n ± 3%   1.212n ± 1%  -81.06% (p=0.000 n=20)
Stdlib/IsDigit/rune=33554432-8      6.628n ± 2%   1.289n ± 5%  -80.55% (p=0.000 n=20)
Stdlib/IsLower/rune=-1-8            3.033n ± 2%   1.290n ± 0%  -57.47% (p=0.000 n=20)
Stdlib/IsLower/rune=97-8            1.705n ± 3%   1.216n ± 4%  -28.65% (p=0.000 n=20)
Stdlib/IsLower/rune=209-8           1.712n ± 3%   1.211n ± 2%  -29.28% (p=0.000 n=20)
Stdlib/IsLower/rune=26412-8         8.889n ± 2%   1.195n ± 2%  -86.56% (p=0.000 n=20)
Stdlib/IsLower/rune=55296-8         9.441n ± 3%   1.213n ± 2%  -87.16% (p=0.000 n=20)
Stdlib/IsLower/rune=125251-8        7.268n ± 1%   1.194n ± 2%  -83.57% (p=0.000 n=20)
Stdlib/IsLower/rune=128169-8        7.566n ± 3%   1.317n ± 3%  -82.59% (p=0.000 n=20)
Stdlib/IsLower/rune=33554432-8      7.747n ± 5%   1.287n ± 0%  -83.39% (p=0.000 n=20)
Stdlib/IsLower/rune=0-8             1.813n ± 2%   1.289n ± 5%  -28.90% (p=0.000 n=20)
Stdlib/IsPrint/rune=-1-8           11.520n ± 3%   1.291n ± 0%  -88.79% (p=0.000 n=20)
Stdlib/IsPrint/rune=32-8            1.731n ± 2%   1.199n ± 1%  -30.73% (p=0.000 n=20)
Stdlib/IsPrint/rune=97-8            1.738n ± 1%   1.201n ± 2%  -30.92% (p=0.000 n=20)
Stdlib/IsPrint/rune=209-8           1.707n ± 4%   1.218n ± 4%  -28.63% (p=0.000 n=20)
Stdlib/IsPrint/rune=26412-8        10.620n ± 2%   1.218n ± 2%  -88.53% (p=0.000 n=20)
Stdlib/IsPrint/rune=55296-8        49.325n ± 2%   1.197n ± 2%  -97.57% (p=0.000 n=20)
Stdlib/IsPrint/rune=128169-8       39.490n ± 1%   1.197n ± 2%  -96.97% (p=0.000 n=20)
Stdlib/IsPrint/rune=917999-8       18.710n ± 3%   1.214n ± 2%  -93.51% (p=0.000 n=20)
Stdlib/IsPrint/rune=33554432-8     39.165n ± 6%   1.286n ± 5%  -96.72% (p=0.000 n=20)
Stdlib/IsTitle/rune=-1-8            1.546n ± 4%   1.295n ± 4%  -16.18% (p=0.000 n=20)
Stdlib/IsTitle/rune=97-8            1.571n ± 4%   1.295n ± 4%  -17.57% (p=0.000 n=20)
Stdlib/IsTitle/rune=209-8           1.582n ± 6%   1.286n ± 0%  -18.68% (p=0.000 n=20)
Stdlib/IsTitle/rune=453-8           5.665n ± 2%   1.213n ± 4%  -78.59% (p=0.000 n=20)
Stdlib/IsTitle/rune=8188-8          7.506n ± 6%   1.204n ± 1%  -83.96% (p=0.000 n=20)
Stdlib/IsTitle/rune=26412-8         3.200n ± 5%   1.292n ± 4%  -59.61% (p=0.000 n=20)
Stdlib/IsTitle/rune=55296-8         3.206n ± 4%   1.287n ± 0%  -59.86% (p=0.000 n=20)
Stdlib/IsTitle/rune=128169-8        3.194n ± 3%   1.287n ± 5%  -59.68% (p=0.000 n=20)
Stdlib/IsTitle/rune=33554432-8      3.005n ± 2%   1.287n ± 0%  -57.15% (p=0.000 n=20)
Stdlib/IsNumber/rune=-1-8           3.024n ± 2%   1.327n ± 3%  -56.11% (p=0.000 n=20)
Stdlib/IsNumber/rune=48-8           1.602n ± 2%   1.214n ± 2%  -24.19% (p=0.000 n=20)
Stdlib/IsNumber/rune=97-8           1.605n ± 1%   1.221n ± 5%  -23.93% (p=0.000 n=20)
Stdlib/IsNumber/rune=209-8          1.584n ± 1%   1.202n ± 2%  -24.15% (p=0.000 n=20)
Stdlib/IsNumber/rune=26412-8        8.251n ± 7%   1.202n ± 2%  -85.43% (p=0.000 n=20)
Stdlib/IsNumber/rune=55296-8        8.683n ± 3%   1.209n ± 1%  -86.08% (p=0.000 n=20)
Stdlib/IsNumber/rune=128169-8       8.423n ± 3%   1.197n ± 3%  -85.79% (p=0.000 n=20)
Stdlib/IsNumber/rune=130041-8       8.373n ± 3%   1.200n ± 1%  -85.67% (p=0.000 n=20)
Stdlib/IsNumber/rune=33554432-8     8.539n ± 6%   1.287n ± 0%  -84.92% (p=0.000 n=20)
Stdlib/IsSymbol/rune=-1-8           3.163n ± 5%   1.321n ± 2%  -58.26% (p=0.000 n=20)
Stdlib/IsSymbol/rune=36-8           1.683n ± 3%   1.214n ± 4%  -27.92% (p=0.000 n=20)
Stdlib/IsSymbol/rune=97-8           1.601n ± 1%   1.211n ± 5%  -24.36% (p=0.000 n=20)
Stdlib/IsSymbol/rune=209-8          1.581n ± 1%   1.195n ± 1%  -24.39% (p=0.000 n=20)
Stdlib/IsSymbol/rune=26412-8        9.007n ± 3%   1.216n ± 4%  -86.50% (p=0.000 n=20)
Stdlib/IsSymbol/rune=55296-8        7.319n ± 2%   1.214n ± 1%  -83.41% (p=0.000 n=20)
Stdlib/IsSymbol/rune=128169-8       8.978n ± 2%   1.211n ± 1%  -86.51% (p=0.000 n=20)
Stdlib/IsSymbol/rune=129994-8       8.339n ± 3%   1.211n ± 1%  -85.48% (p=0.000 n=20)
Stdlib/IsSymbol/rune=33554432-8     8.183n ± 2%   1.291n ± 0%  -84.22% (p=0.000 n=20)
Stdlib/IsUpper/rune=-1-8            2.943n ± 1%   1.288n ± 4%  -56.22% (p=0.000 n=20)
Stdlib/IsUpper/rune=65-8            1.754n ± 2%   1.197n ± 6%  -31.78% (p=0.000 n=20)
Stdlib/IsUpper/rune=97-8            1.746n ± 2%   1.194n ± 0%  -31.63% (p=0.000 n=20)
Stdlib/IsUpper/rune=209-8           1.730n ± 6%   1.212n ± 4%  -29.94% (p=0.000 n=20)
Stdlib/IsUpper/rune=26412-8        10.400n ± 4%   1.216n ± 4%  -88.31% (p=0.000 n=20)
Stdlib/IsUpper/rune=55296-8        10.775n ± 2%   1.204n ± 1%  -88.83% (p=0.000 n=20)
Stdlib/IsUpper/rune=125217-8        7.255n ± 3%   1.196n ± 1%  -83.52% (p=0.000 n=20)
Stdlib/IsUpper/rune=128169-8        7.401n ± 4%   1.286n ± 0%  -82.63% (p=0.000 n=20)
Stdlib/IsUpper/rune=33554432-8      7.433n ± 3%   1.290n ± 5%  -82.64% (p=0.000 n=20)
Stdlib/IsControl/rune=-1-8          1.389n ± 1%   1.296n ± 4%   -6.69% (p=0.000 n=20)
Stdlib/IsControl/rune=0-8           1.175n ± 1%   1.224n ± 4%   +4.13% (p=0.000 n=20)
Stdlib/IsControl/rune=97-8          1.190n ± 2%   1.198n ± 2%   +0.63% (p=0.050 n=20)
Stdlib/IsControl/rune=159-8         1.211n ± 4%   1.213n ± 4%        ~ (p=0.365 n=20)
Stdlib/IsControl/rune=209-8         1.175n ± 0%   1.292n ± 1%  +10.00% (p=0.000 n=20)
Stdlib/IsControl/rune=26412-8       1.406n ± 4%   1.291n ± 1%   -8.18% (p=0.000 n=20)
Stdlib/IsControl/rune=55296-8       1.387n ± 6%   1.290n ± 0%   -7.03% (p=0.000 n=20)
Stdlib/IsControl/rune=128169-8      1.421n ± 3%   1.295n ± 4%   -8.90% (p=0.000 n=20)
Stdlib/IsControl/rune=33554432-8    1.458n ± 3%   1.292n ± 4%  -11.36% (p=0.000 n=20)
Stdlib/IsGraphic/rune=-1-8         13.175n ± 4%   1.287n ± 5%  -90.23% (p=0.000 n=20)
Stdlib/IsGraphic/rune=32-8          1.737n ± 3%   1.202n ± 1%  -30.75% (p=0.000 n=20)
Stdlib/IsGraphic/rune=97-8          1.736n ± 2%   1.213n ± 4%  -30.16% (p=0.000 n=20)
Stdlib/IsGraphic/rune=209-8         1.658n ± 1%   1.217n ± 4%  -26.60% (p=0.000 n=20)
Stdlib/IsGraphic/rune=26412-8      10.055n ± 2%   1.198n ± 6%  -88.09% (p=0.000 n=20)
Stdlib/IsGraphic/rune=55296-8      46.660n ± 2%   1.213n ± 4%  -97.40% (p=0.000 n=20)
Stdlib/IsGraphic/rune=128169-8     38.305n ± 1%   1.220n ± 4%  -96.82% (p=0.000 n=20)
Stdlib/IsGraphic/rune=917999-8     18.635n ± 8%   1.196n ± 2%  -93.58% (p=0.000 n=20)
Stdlib/IsGraphic/rune=33554432-8   45.490n ± 5%   1.291n ± 1%  -97.16% (p=0.000 n=20)
Stdlib/IsLetter/rune=-1-8           3.203n ± 3%   1.289n ± 1%  -59.76% (p=0.000 n=20)
Stdlib/IsLetter/rune=65-8           1.632n ± 3%   1.197n ± 1%  -26.65% (p=0.000 n=20)
Stdlib/IsLetter/rune=97-8           1.598n ± 1%   1.201n ± 2%  -24.84% (p=0.000 n=20)
Stdlib/IsLetter/rune=209-8          1.603n ± 2%   1.214n ± 4%  -24.26% (p=0.000 n=20)
Stdlib/IsLetter/rune=26412-8       10.205n ± 9%   1.201n ± 2%  -88.24% (p=0.000 n=20)
Stdlib/IsLetter/rune=55296-8       11.520n ± 6%   1.212n ± 5%  -89.48% (p=0.000 n=20)
Stdlib/IsLetter/rune=128169-8      10.395n ± 3%   1.214n ± 1%  -88.33% (p=0.000 n=20)
Stdlib/IsLetter/rune=205743-8      10.135n ± 2%   1.215n ± 2%  -88.01% (p=0.000 n=20)
Stdlib/IsLetter/rune=33554432-8    10.180n ± 2%   1.288n ± 1%  -87.35% (p=0.000 n=20)
Stdlib/IsMark/rune=-1-8             3.040n ± 2%   1.296n ± 5%  -57.34% (p=0.000 n=20)
Stdlib/IsMark/rune=97-8             4.468n ± 2%   1.295n ± 1%  -71.00% (p=0.000 n=20)
Stdlib/IsMark/rune=209-8            4.430n ± 2%   1.311n ± 3%  -70.41% (p=0.000 n=20)
Stdlib/IsMark/rune=768-8            8.604n ± 2%   1.236n ± 2%  -85.63% (p=0.000 n=20)
Stdlib/IsMark/rune=26412-8          9.505n ± 4%   1.247n ± 2%  -86.89% (p=0.000 n=20)
Stdlib/IsMark/rune=55296-8         10.530n ± 1%   1.218n ± 1%  -88.43% (p=0.000 n=20)
Stdlib/IsMark/rune=128169-8         8.802n ± 3%   1.198n ± 2%  -86.39% (p=0.000 n=20)
Stdlib/IsMark/rune=917999-8         8.084n ± 2%   1.223n ± 2%  -84.87% (p=0.000 n=20)
Stdlib/IsMark/rune=33554432-8       8.223n ± 3%   1.300n ± 1%  -84.19% (p=0.000 n=20)
Stdlib/IsPunct/rune=-1-8            3.086n ± 8%   1.288n ± 1%  -58.25% (p=0.000 n=20)
Stdlib/IsPunct/rune=33-8            1.698n ± 4%   1.213n ± 1%  -28.59% (p=0.000 n=20)
Stdlib/IsPunct/rune=97-8            1.677n ± 5%   1.220n ± 4%  -27.22% (p=0.000 n=20)
Stdlib/IsPunct/rune=209-8           1.665n ± 4%   1.215n ± 2%  -27.05% (p=0.000 n=20)
Stdlib/IsPunct/rune=26412-8         8.579n ± 5%   1.197n ± 2%  -86.05% (p=0.000 n=20)
Stdlib/IsPunct/rune=55296-8         9.496n ± 6%   1.193n ± 2%  -87.44% (p=0.000 n=20)
Stdlib/IsPunct/rune=125279-8        7.013n ± 7%   1.198n ± 1%  -82.91% (p=0.000 n=20)
Stdlib/IsPunct/rune=128169-8        7.309n ± 1%   1.287n ± 5%  -82.39% (p=0.000 n=20)
Stdlib/IsPunct/rune=33554432-8      7.533n ± 3%   1.287n ± 5%  -82.91% (p=0.000 n=20)
Stdlib/IsSpace/rune=-1-8            3.048n ± 3%   1.288n ± 0%  -57.73% (p=0.000 n=20)
Stdlib/IsSpace/rune=9-8             1.754n ± 5%   1.196n ± 1%  -31.82% (p=0.000 n=20)
Stdlib/IsSpace/rune=97-8            1.679n ± 5%   1.208n ± 1%  -28.07% (p=0.000 n=20)
Stdlib/IsSpace/rune=209-8           1.669n ± 1%   1.195n ± 2%  -28.42% (p=0.000 n=20)
Stdlib/IsSpace/rune=12288-8         6.161n ± 2%   1.240n ± 3%  -79.87% (p=0.000 n=20)
Stdlib/IsSpace/rune=26412-8         3.161n ± 3%   1.287n ± 0%  -59.29% (p=0.000 n=20)
Stdlib/IsSpace/rune=55296-8         3.062n ± 3%   1.286n ± 1%  -57.98% (p=0.000 n=20)
Stdlib/IsSpace/rune=128169-8        3.080n ± 2%   1.291n ± 4%  -58.06% (p=0.000 n=20)
Stdlib/IsSpace/rune=33554432-8      3.067n ± 3%   1.288n ± 1%  -58.01% (p=0.000 n=20)
geomean                             4.342n        1.240n       -71.45%

The text was updated successfully, but these errors were encountered:

gabyhelp · 2024-06-19T04:30:06Z

Similar Issues

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

ianlancetaylor · 2024-06-19T05:12:52Z

Can you show us the doc comments for the three functions that you are proposing? It's not obvious what they do.

diegommm · 2024-06-19T06:39:31Z

Doc comment

Can you show us the doc comments for the three functions that you are proposing? It's not obvious what they do.

The three do the same but with different input types. A possible doc comment for unicode.IsStringFunc(string) func(rune) bool is:

// IsStringFunc returns a function that checks if a rune is found in the provided string. Example:
//
//  openingRune := IsStringFunc("([{<《（")
//  for s != "" {
//      i := strings.IndexFunc(s, openingRune)
//      // if i >= 0 ...
//  }

I initially considered the name IsInStringFunc, which I think may sound clearer, but then dropped the In word to make it more like unicode.IsSpace.

More on motivation using the previous example

Note that it is perfectly fine to use strings.IndexAny(s, "([{<《（"), but then in each iteration strings.IndexAny goes through some strategies and ends up doing the following:

for i, c := range s {
	if IndexRune(chars, c) >= 0 { // IndexRune("([{<《（", c)
		return i
	}
}

Then IndexRune calls IndexByte as long as it works on byte-sized c runes, which is great because it's very fast. But if my text has a lot of non-ASCII text (e.g. if you're parsing chinese, then your c will likely be ASCII very few times) then it will call Index(s, string(r)). And there it gets hairy. For our use case, it will end up calling bytealg.IndexString("([{<《（", c), but if instead of "([{<《（" we have something longer (>31 even on some -old, pre 2013- amd64) then it may still try a few more looping tricks.

Having said all that, it is no longer clear what's going to be the O of strings.IndexAny for our mostly non-ASCII case. It may be some O(x * y * z) in the worst case (I may be wrong in the * z, after all there's some architecture dependent stuff too). And that should be multiplied by the number of iterations in our enclosing for loop.

If we had the chance to cache the strategy of what needs to be searched, then that could be reused. In the shared repo, I'm proposing a solution that provides constant time when checking any rune. So calling strings.IndexFunc(s, openingRune) becomes O( utf8.RuneCountString(s) ).

diegommm · 2024-06-19T07:06:16Z

Another option:

IsInSet([]rune) func(rune) bool
IsInSetString(string) func(rune) bool
IsInSetBytes([]byte) func(rune) bool

(Thinking that we provide a "RuneSet" to match against)
And yet another one:

func IsInSet[S interface{ []rune | string | []byte }](s S) func(rune) bool {
    swtich ss := any(s).(type){
    case []rune:
    // ...
    }
}

ianlancetaylor · 2024-06-19T17:19:52Z

This sounds like something that might be appropriate for a third party library outside of the standard library. https://go.dev/doc/faq#x_in_std

diegommm · 2024-06-20T00:33:41Z

This sounds like something that might be appropriate for a third party library outside of the standard library.

I'm thinking of these functions as a way to compile a set of runes to be later matched, much like a regular expression, but for runes. The performance advantage is huge since this is done everywhere, and given that rune matching is cached and you can reuse that in parsing loops. Sure, regular expressions are an infinitely more complex case, but consider that even in the standard library there are many hot paths in parsers using functions like bytes.IndexAny and bytes.ContainsAny. Imagine if instead of providing regexp.Compile() we would just have regexp.Match(regexp string, test string) (EDIT: this is an exaggeration, just to exemplify the concept).

I found that the bitmask approach even performs better than a func checking with a switch a handful of runes. This is very common. There are also cases where we want to check if it's just any of " \t\n", so we call one of the IndexAny functions. This is also ubiquitous. And it is not always just ASCII text being queried, take as an example html/template/escape.go, where it is parsing HTML and it needs to handle some JavaScript space characters and calls bytes.ContainsAny(s[written:i1], "\n\r\u2028\u2029"). There's also the case of mime/grammar.go:

func isToken(s string) bool {
        if s == "" {
                return false
        }
        return strings.IndexFunc(s, isNotTokenChar) < 0
}

func isNotTokenChar(r rune) bool {
        return !isTokenChar(r)
}

func isTokenChar(r rune) bool {
        // token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
        //             or tspecials>
        return r > 0x20 && r < 0x7f && !isTSpecial(r)
}

func isTSpecial(r rune) bool {
        return strings.ContainsRune(`()<>@,;:\"/[]?=`, r)
}

Now consider this alternative:

var (
        isTSpecialChars = `()<>@,;:\"/[]?=`
        isTSpecial = whatever.IsInStringSet(isTSpecialChars)
        isTokenChar func(rune) bool
)

func init(){
        runes := make([]rune, 0, 0x7f - 0x21 - len(isTSpecialChars))
        for r := rune(0x21); r < 0x7f; r++ {
                if !isTSpecial(r) {
                        runes = append(runes, r)
                }
        }
        isTokenChar = whatever.IsInRunesSet(runes)
}

func isNotTokenChar(r rune) bool {
        return !isTokenChar(r)
}

Note that all these function checks are done in constant time in the approach I'm proposing in the linked repo, as opposed to O(x * y) (and possibly O(x * y * z)?). In my machine, the version I'm proposing runs in 52% the time of the original, using a basic benchmark with the test strings I found in mime/mediatype_test.go (actually, I duplicated the strings and added an @ at the end to force some negatives too).

Benchmark stuff

goos: linux
goarch: amd64
pkg: test/mime-test
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
          │   old.txt    │               new.txt               │
          │    sec/op    │   sec/op     vs base                │
IsToken-8   1108.0n ± 1%   574.9n ± 0%  -48.11% (p=0.000 n=10)

The code

package main

import (
        "strings"
        "testing"
)

// original

func isToken(s string) bool {
        if s == "" {
                return false
        }
        return strings.IndexFunc(s, isNotTokenChar) < 0
}

func isNotTokenChar(r rune) bool {
        return !isTokenChar(r)
}

func isTokenChar(r rune) bool {
        return r > 0x20 && r < 0x7f && !isTSpecial(r)
}

func isTSpecial(r rune) bool {
        return strings.ContainsRune(`()<>@,;:\"/[]?=`, r)
}

// changed

var (
        isTSpecialChars = `()<>@,;:\"/[]?=`
        isTSpecial2     = IsStringFunc(isTSpecialChars)
        isTokenChar2    func(rune) bool
)

func init() {
        runes := make([]rune, 0, 0x7f-0x21-len(isTSpecialChars))
        for r := rune(0x21); r < 0x7f; r++ {
                if !isTSpecial(r) {
                        runes = append(runes, r)
                }
        }
        isTokenChar2 = IsRunesFunc(runes)
}

func isNotTokenChar2(r rune) bool {
        return !isTokenChar2(r)
}

func isToken2(s string) bool {
        if s == "" {
                return false
        }
        return strings.IndexFunc(s, isNotTokenChar2) < 0
}

// bench stuff

var mediaTypes = []string{
        "noslash noslash @",
        "foo bar/baz foo bar/baz @",
        "foo/bar baz foo/bar baz @",
        "attachment attachment @",
        "attachment attachment @",
        "attachment attachment @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/bar foo/bar @",
        "foo/bar foo/bar @",
        "foo foo @",

        "noslash",
        "foo bar/baz",
        "foo/bar baz",
        "attachment",
        "attachment",
        "attachment",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/bar",
        "foo/bar",
        "foo",
}

func BenchmarkIsToken(b *testing.B) {
        for i := 0; i < b.N; i++ {
                for _, mt := range mediaTypes {
                        isToken(mt)
                }
        }
}

func BenchmarkIsToken2(b *testing.B) {
        for i := 0; i < b.N; i++ {
                for _, mt := range mediaTypes {
                        isToken2(mt)
                }
        }
}

This is also not something uncommon to find in the wild, I think unicode.IsSpace is a great example of what rune checking does, but we need to write parsers everywhere and we need to check for special chars as well.

Examples in standard library where this would help

This is not a final list, there are some other cases as well.

File	Func/Method	Operation	Context
cmd/cgo/ast.go	(*File).ParseGo	strings.ContainsAny(abspath, "\r\n")	2nd level nested loop
cmd/cover/cover.go	annotate	strings.ContainsAny(name, "\r\n")	1st level loop
cmd/cover/cover.go	(*Package).annotateFile	strings.ContainsAny(name, "\r\n")	1st level loop
cmd/go/internal/fsys/fsys.go	hasMeta	strings.ContainsAny(path, magicChars)	magicChars depends on OS, called twice in Glob
cmd/vendor/github.com/google/pprof/internal/symbolizer/symbolizer.go	looksLikeDemangledCPlusPlus and removeMatching	strings.ContainsAny	called in a loop, top func is Demangle
go/types/resolver.go	dir	strings.LastIndexAny(path, `/\`)	Called in a loop in (*Checker).collectObjects, called for each file being type-checked
html/template/escape.go	(*escaper).escapeText and contextAfterText	bytes.IndexAny	Called in a loop at least once, thrice worst case, whenever escaping a text node, and includes funny characters because of JavaScript
html/template/html.go	stripTags	bytes.IndexAny	called in a loop, runs in many places, processes a lot of HTML
mime/mediatype.go	consumeToken	strings.IndexFunc and strings.ContainsRune	called in many loops
mime/grammar.go	isToken	strings.IndexFunc and strings.ContainsRune	called several times in FormatMediaType
log/slog/level.go	(*Level).parse	strings.IndexAny	called in UnmarshalJSON and in MarshalText
runtime/debug/mod.go	quoteKey and quoteValue	strings.ContainsAny	called in loops to parse BuildInfo and to convert it to string
archive/tar/reader.go	(*Reader).readHeader	bytes.IndexFunc	called in a loop to read the next block header of the tar archive each time you want to advance to the next entry in the archive, which means it's called in many other loops when iterating over the entries
encoding/csv/writer.go	(*Writer).Write	strings.IndexAny and strings.ContainsAny	1st and 2nd level nested loop
encoding/pem/pem.go	bytes.ContainsAny	bytes.ContainsAny	2nd level nested loop when decoding each PEM
net/http/cookiejar/jar.go	isIP	strings.ContainsAny	Called in a loop to add each cookie, when setting, and when retrieving a cookie
net/http/cookie.go	sanitizeCookieValue and isCookieNameValid	strings.IndexFunc and strings.ContainsAny	Many loops
net/http/request.go	validMethod	strings.IndexFunc	Called every time you need to validate an HTTP method name
net/mail/message.go	(Address).String, (addrParser).consumeDisplayNameComment, ParseDate	strings.FieldsFunc, strings.IndexAny, and strings.ContainsAny	Called in many loops

diegommm added the Proposal label Jun 19, 2024

gopherbot added this to the Proposal milestone Jun 19, 2024

seankhliao changed the title ~~proposal: improvement of rune-checking funcs~~ proposal: unicode: improvement of rune-checking funcs Jun 19, 2024

This was referenced Jun 22, 2024

proposal: unicode/utf8: improve EncodeRune performance by requiring a fixed 4 bytes width #68129

Closed

unicode/utf8: improve performance of AppendRune and EncodeRune #68131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: unicode: improvement of rune-checking funcs #68064

proposal: unicode: improvement of rune-checking funcs #68064

diegommm commented Jun 19, 2024 •

edited

Loading

gabyhelp commented Jun 19, 2024

ianlancetaylor commented Jun 19, 2024

diegommm commented Jun 19, 2024 •

edited

Loading

diegommm commented Jun 19, 2024 •

edited

Loading

ianlancetaylor commented Jun 19, 2024

diegommm commented Jun 20, 2024 •

edited

Loading

proposal: unicode: improvement of rune-checking funcs #68064

proposal: unicode: improvement of rune-checking funcs #68064

Comments

diegommm commented Jun 19, 2024 • edited Loading

Proposal Details

gabyhelp commented Jun 19, 2024

ianlancetaylor commented Jun 19, 2024

diegommm commented Jun 19, 2024 • edited Loading

Doc comment

More on motivation using the previous example

diegommm commented Jun 19, 2024 • edited Loading

ianlancetaylor commented Jun 19, 2024

diegommm commented Jun 20, 2024 • edited Loading

diegommm commented Jun 19, 2024 •

edited

Loading

diegommm commented Jun 19, 2024 •

edited

Loading

diegommm commented Jun 19, 2024 •

edited

Loading

diegommm commented Jun 20, 2024 •

edited

Loading