Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: unicode: improvement of rune-checking funcs #68064

Open
diegommm opened this issue Jun 19, 2024 · 6 comments
Open

proposal: unicode: improvement of rune-checking funcs #68064

diegommm opened this issue Jun 19, 2024 · 6 comments
Labels
Milestone

Comments

@diegommm
Copy link
Contributor

diegommm commented Jun 19, 2024

Proposal Details

I was writing a text parser and came across the need to check some runes repeatedly. I also used a lot functions like utf8.ValidRune, unicode.IsSpace, etc. I found some of them were slower compared to some versions of other rune-checker funcs I wrote for the parser, so I setup a repo to make a PoC with my findings. I also found that using unicode.RangeTable is for a very specific use cases, and is not very ergonomic for many mundane use cases. It's also true that I found myself calling Contains and Index family of functions in strings and bytes packages, but many times with the same "needles" (things looked up) for many different "haystacks" (where we looked them up). In some cases, these functions will create an internal, temporary data structure to aid with the lookup, but they are discarded and cannot be reused. It would be great if they could be reused somehow.

(aside: I don't want to bother with the needle/haystack terminology, feel free to point me to more Go-ish alternatives).

The proposal can be divided in two items:

  1. Add a way to create efficient func(rune) bool functions so they can be reused many times. Bikeshedding starter:
    1. unicode.IsBytesFunc([]byte) func(rune) bool
    2. unicode.IsRunesFunc([]rune) func(rune) bool
    3. unicode.IsStringFunc(string) func(rune) bool
  2. Improve the performance of some implementations of these kind of functions in unicode, utf8, strings and bytes packages, especially when there are runes > utf8.RuneSelf. I would like to contribute to the project, so let me know if these are changes that are wanted in the standard library and to what extent should they be implemented (there might be other things to consider that I'm missing to strike the right balance for the project). If so, I can go through the contribution guides and do all that stuff once the "what" is agreed, we can discuss starting with code in the PoC repo (which can have some cleanup for sure).

If both items cannot be attended in this same issue, I would prefer that number 1 is addressed here, as that is more of a proposal, and number 2 sounds more like an improvement that doesn't change API and can probably be addressed in other place (and please, could you kindly point me where to do it).

benchstat results for unicode and utf8 improvements

goos: linux
goarch: amd64
pkg: github.com/diegommm/runes
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
                                 │    stdlib    │                local                │
                                 │    sec/op    │   sec/op     vs base                │
Stdlib/IsDigit/rune=-1-8            1.382n ± 2%   1.296n ± 4%   -6.22% (p=0.000 n=20)
Stdlib/IsDigit/rune=48-8            1.643n ± 1%   1.215n ± 4%  -26.07% (p=0.000 n=20)
Stdlib/IsDigit/rune=97-8            1.632n ± 1%   1.201n ± 2%  -26.43% (p=0.000 n=20)
Stdlib/IsDigit/rune=209-8           1.642n ± 2%   1.210n ± 5%  -26.32% (p=0.000 n=20)
Stdlib/IsDigit/rune=26412-8         7.547n ± 2%   1.198n ± 1%  -84.12% (p=0.000 n=20)
Stdlib/IsDigit/rune=55296-8         7.686n ± 1%   1.220n ± 1%  -84.13% (p=0.000 n=20)
Stdlib/IsDigit/rune=128169-8        7.380n ± 1%   1.207n ± 1%  -83.64% (p=0.000 n=20)
Stdlib/IsDigit/rune=130041-8        6.396n ± 3%   1.212n ± 1%  -81.06% (p=0.000 n=20)
Stdlib/IsDigit/rune=33554432-8      6.628n ± 2%   1.289n ± 5%  -80.55% (p=0.000 n=20)
Stdlib/IsLower/rune=-1-8            3.033n ± 2%   1.290n ± 0%  -57.47% (p=0.000 n=20)
Stdlib/IsLower/rune=97-8            1.705n ± 3%   1.216n ± 4%  -28.65% (p=0.000 n=20)
Stdlib/IsLower/rune=209-8           1.712n ± 3%   1.211n ± 2%  -29.28% (p=0.000 n=20)
Stdlib/IsLower/rune=26412-8         8.889n ± 2%   1.195n ± 2%  -86.56% (p=0.000 n=20)
Stdlib/IsLower/rune=55296-8         9.441n ± 3%   1.213n ± 2%  -87.16% (p=0.000 n=20)
Stdlib/IsLower/rune=125251-8        7.268n ± 1%   1.194n ± 2%  -83.57% (p=0.000 n=20)
Stdlib/IsLower/rune=128169-8        7.566n ± 3%   1.317n ± 3%  -82.59% (p=0.000 n=20)
Stdlib/IsLower/rune=33554432-8      7.747n ± 5%   1.287n ± 0%  -83.39% (p=0.000 n=20)
Stdlib/IsLower/rune=0-8             1.813n ± 2%   1.289n ± 5%  -28.90% (p=0.000 n=20)
Stdlib/IsPrint/rune=-1-8           11.520n ± 3%   1.291n ± 0%  -88.79% (p=0.000 n=20)
Stdlib/IsPrint/rune=32-8            1.731n ± 2%   1.199n ± 1%  -30.73% (p=0.000 n=20)
Stdlib/IsPrint/rune=97-8            1.738n ± 1%   1.201n ± 2%  -30.92% (p=0.000 n=20)
Stdlib/IsPrint/rune=209-8           1.707n ± 4%   1.218n ± 4%  -28.63% (p=0.000 n=20)
Stdlib/IsPrint/rune=26412-8        10.620n ± 2%   1.218n ± 2%  -88.53% (p=0.000 n=20)
Stdlib/IsPrint/rune=55296-8        49.325n ± 2%   1.197n ± 2%  -97.57% (p=0.000 n=20)
Stdlib/IsPrint/rune=128169-8       39.490n ± 1%   1.197n ± 2%  -96.97% (p=0.000 n=20)
Stdlib/IsPrint/rune=917999-8       18.710n ± 3%   1.214n ± 2%  -93.51% (p=0.000 n=20)
Stdlib/IsPrint/rune=33554432-8     39.165n ± 6%   1.286n ± 5%  -96.72% (p=0.000 n=20)
Stdlib/IsTitle/rune=-1-8            1.546n ± 4%   1.295n ± 4%  -16.18% (p=0.000 n=20)
Stdlib/IsTitle/rune=97-8            1.571n ± 4%   1.295n ± 4%  -17.57% (p=0.000 n=20)
Stdlib/IsTitle/rune=209-8           1.582n ± 6%   1.286n ± 0%  -18.68% (p=0.000 n=20)
Stdlib/IsTitle/rune=453-8           5.665n ± 2%   1.213n ± 4%  -78.59% (p=0.000 n=20)
Stdlib/IsTitle/rune=8188-8          7.506n ± 6%   1.204n ± 1%  -83.96% (p=0.000 n=20)
Stdlib/IsTitle/rune=26412-8         3.200n ± 5%   1.292n ± 4%  -59.61% (p=0.000 n=20)
Stdlib/IsTitle/rune=55296-8         3.206n ± 4%   1.287n ± 0%  -59.86% (p=0.000 n=20)
Stdlib/IsTitle/rune=128169-8        3.194n ± 3%   1.287n ± 5%  -59.68% (p=0.000 n=20)
Stdlib/IsTitle/rune=33554432-8      3.005n ± 2%   1.287n ± 0%  -57.15% (p=0.000 n=20)
Stdlib/IsNumber/rune=-1-8           3.024n ± 2%   1.327n ± 3%  -56.11% (p=0.000 n=20)
Stdlib/IsNumber/rune=48-8           1.602n ± 2%   1.214n ± 2%  -24.19% (p=0.000 n=20)
Stdlib/IsNumber/rune=97-8           1.605n ± 1%   1.221n ± 5%  -23.93% (p=0.000 n=20)
Stdlib/IsNumber/rune=209-8          1.584n ± 1%   1.202n ± 2%  -24.15% (p=0.000 n=20)
Stdlib/IsNumber/rune=26412-8        8.251n ± 7%   1.202n ± 2%  -85.43% (p=0.000 n=20)
Stdlib/IsNumber/rune=55296-8        8.683n ± 3%   1.209n ± 1%  -86.08% (p=0.000 n=20)
Stdlib/IsNumber/rune=128169-8       8.423n ± 3%   1.197n ± 3%  -85.79% (p=0.000 n=20)
Stdlib/IsNumber/rune=130041-8       8.373n ± 3%   1.200n ± 1%  -85.67% (p=0.000 n=20)
Stdlib/IsNumber/rune=33554432-8     8.539n ± 6%   1.287n ± 0%  -84.92% (p=0.000 n=20)
Stdlib/IsSymbol/rune=-1-8           3.163n ± 5%   1.321n ± 2%  -58.26% (p=0.000 n=20)
Stdlib/IsSymbol/rune=36-8           1.683n ± 3%   1.214n ± 4%  -27.92% (p=0.000 n=20)
Stdlib/IsSymbol/rune=97-8           1.601n ± 1%   1.211n ± 5%  -24.36% (p=0.000 n=20)
Stdlib/IsSymbol/rune=209-8          1.581n ± 1%   1.195n ± 1%  -24.39% (p=0.000 n=20)
Stdlib/IsSymbol/rune=26412-8        9.007n ± 3%   1.216n ± 4%  -86.50% (p=0.000 n=20)
Stdlib/IsSymbol/rune=55296-8        7.319n ± 2%   1.214n ± 1%  -83.41% (p=0.000 n=20)
Stdlib/IsSymbol/rune=128169-8       8.978n ± 2%   1.211n ± 1%  -86.51% (p=0.000 n=20)
Stdlib/IsSymbol/rune=129994-8       8.339n ± 3%   1.211n ± 1%  -85.48% (p=0.000 n=20)
Stdlib/IsSymbol/rune=33554432-8     8.183n ± 2%   1.291n ± 0%  -84.22% (p=0.000 n=20)
Stdlib/IsUpper/rune=-1-8            2.943n ± 1%   1.288n ± 4%  -56.22% (p=0.000 n=20)
Stdlib/IsUpper/rune=65-8            1.754n ± 2%   1.197n ± 6%  -31.78% (p=0.000 n=20)
Stdlib/IsUpper/rune=97-8            1.746n ± 2%   1.194n ± 0%  -31.63% (p=0.000 n=20)
Stdlib/IsUpper/rune=209-8           1.730n ± 6%   1.212n ± 4%  -29.94% (p=0.000 n=20)
Stdlib/IsUpper/rune=26412-8        10.400n ± 4%   1.216n ± 4%  -88.31% (p=0.000 n=20)
Stdlib/IsUpper/rune=55296-8        10.775n ± 2%   1.204n ± 1%  -88.83% (p=0.000 n=20)
Stdlib/IsUpper/rune=125217-8        7.255n ± 3%   1.196n ± 1%  -83.52% (p=0.000 n=20)
Stdlib/IsUpper/rune=128169-8        7.401n ± 4%   1.286n ± 0%  -82.63% (p=0.000 n=20)
Stdlib/IsUpper/rune=33554432-8      7.433n ± 3%   1.290n ± 5%  -82.64% (p=0.000 n=20)
Stdlib/IsControl/rune=-1-8          1.389n ± 1%   1.296n ± 4%   -6.69% (p=0.000 n=20)
Stdlib/IsControl/rune=0-8           1.175n ± 1%   1.224n ± 4%   +4.13% (p=0.000 n=20)
Stdlib/IsControl/rune=97-8          1.190n ± 2%   1.198n ± 2%   +0.63% (p=0.050 n=20)
Stdlib/IsControl/rune=159-8         1.211n ± 4%   1.213n ± 4%        ~ (p=0.365 n=20)
Stdlib/IsControl/rune=209-8         1.175n ± 0%   1.292n ± 1%  +10.00% (p=0.000 n=20)
Stdlib/IsControl/rune=26412-8       1.406n ± 4%   1.291n ± 1%   -8.18% (p=0.000 n=20)
Stdlib/IsControl/rune=55296-8       1.387n ± 6%   1.290n ± 0%   -7.03% (p=0.000 n=20)
Stdlib/IsControl/rune=128169-8      1.421n ± 3%   1.295n ± 4%   -8.90% (p=0.000 n=20)
Stdlib/IsControl/rune=33554432-8    1.458n ± 3%   1.292n ± 4%  -11.36% (p=0.000 n=20)
Stdlib/IsGraphic/rune=-1-8         13.175n ± 4%   1.287n ± 5%  -90.23% (p=0.000 n=20)
Stdlib/IsGraphic/rune=32-8          1.737n ± 3%   1.202n ± 1%  -30.75% (p=0.000 n=20)
Stdlib/IsGraphic/rune=97-8          1.736n ± 2%   1.213n ± 4%  -30.16% (p=0.000 n=20)
Stdlib/IsGraphic/rune=209-8         1.658n ± 1%   1.217n ± 4%  -26.60% (p=0.000 n=20)
Stdlib/IsGraphic/rune=26412-8      10.055n ± 2%   1.198n ± 6%  -88.09% (p=0.000 n=20)
Stdlib/IsGraphic/rune=55296-8      46.660n ± 2%   1.213n ± 4%  -97.40% (p=0.000 n=20)
Stdlib/IsGraphic/rune=128169-8     38.305n ± 1%   1.220n ± 4%  -96.82% (p=0.000 n=20)
Stdlib/IsGraphic/rune=917999-8     18.635n ± 8%   1.196n ± 2%  -93.58% (p=0.000 n=20)
Stdlib/IsGraphic/rune=33554432-8   45.490n ± 5%   1.291n ± 1%  -97.16% (p=0.000 n=20)
Stdlib/IsLetter/rune=-1-8           3.203n ± 3%   1.289n ± 1%  -59.76% (p=0.000 n=20)
Stdlib/IsLetter/rune=65-8           1.632n ± 3%   1.197n ± 1%  -26.65% (p=0.000 n=20)
Stdlib/IsLetter/rune=97-8           1.598n ± 1%   1.201n ± 2%  -24.84% (p=0.000 n=20)
Stdlib/IsLetter/rune=209-8          1.603n ± 2%   1.214n ± 4%  -24.26% (p=0.000 n=20)
Stdlib/IsLetter/rune=26412-8       10.205n ± 9%   1.201n ± 2%  -88.24% (p=0.000 n=20)
Stdlib/IsLetter/rune=55296-8       11.520n ± 6%   1.212n ± 5%  -89.48% (p=0.000 n=20)
Stdlib/IsLetter/rune=128169-8      10.395n ± 3%   1.214n ± 1%  -88.33% (p=0.000 n=20)
Stdlib/IsLetter/rune=205743-8      10.135n ± 2%   1.215n ± 2%  -88.01% (p=0.000 n=20)
Stdlib/IsLetter/rune=33554432-8    10.180n ± 2%   1.288n ± 1%  -87.35% (p=0.000 n=20)
Stdlib/IsMark/rune=-1-8             3.040n ± 2%   1.296n ± 5%  -57.34% (p=0.000 n=20)
Stdlib/IsMark/rune=97-8             4.468n ± 2%   1.295n ± 1%  -71.00% (p=0.000 n=20)
Stdlib/IsMark/rune=209-8            4.430n ± 2%   1.311n ± 3%  -70.41% (p=0.000 n=20)
Stdlib/IsMark/rune=768-8            8.604n ± 2%   1.236n ± 2%  -85.63% (p=0.000 n=20)
Stdlib/IsMark/rune=26412-8          9.505n ± 4%   1.247n ± 2%  -86.89% (p=0.000 n=20)
Stdlib/IsMark/rune=55296-8         10.530n ± 1%   1.218n ± 1%  -88.43% (p=0.000 n=20)
Stdlib/IsMark/rune=128169-8         8.802n ± 3%   1.198n ± 2%  -86.39% (p=0.000 n=20)
Stdlib/IsMark/rune=917999-8         8.084n ± 2%   1.223n ± 2%  -84.87% (p=0.000 n=20)
Stdlib/IsMark/rune=33554432-8       8.223n ± 3%   1.300n ± 1%  -84.19% (p=0.000 n=20)
Stdlib/IsPunct/rune=-1-8            3.086n ± 8%   1.288n ± 1%  -58.25% (p=0.000 n=20)
Stdlib/IsPunct/rune=33-8            1.698n ± 4%   1.213n ± 1%  -28.59% (p=0.000 n=20)
Stdlib/IsPunct/rune=97-8            1.677n ± 5%   1.220n ± 4%  -27.22% (p=0.000 n=20)
Stdlib/IsPunct/rune=209-8           1.665n ± 4%   1.215n ± 2%  -27.05% (p=0.000 n=20)
Stdlib/IsPunct/rune=26412-8         8.579n ± 5%   1.197n ± 2%  -86.05% (p=0.000 n=20)
Stdlib/IsPunct/rune=55296-8         9.496n ± 6%   1.193n ± 2%  -87.44% (p=0.000 n=20)
Stdlib/IsPunct/rune=125279-8        7.013n ± 7%   1.198n ± 1%  -82.91% (p=0.000 n=20)
Stdlib/IsPunct/rune=128169-8        7.309n ± 1%   1.287n ± 5%  -82.39% (p=0.000 n=20)
Stdlib/IsPunct/rune=33554432-8      7.533n ± 3%   1.287n ± 5%  -82.91% (p=0.000 n=20)
Stdlib/IsSpace/rune=-1-8            3.048n ± 3%   1.288n ± 0%  -57.73% (p=0.000 n=20)
Stdlib/IsSpace/rune=9-8             1.754n ± 5%   1.196n ± 1%  -31.82% (p=0.000 n=20)
Stdlib/IsSpace/rune=97-8            1.679n ± 5%   1.208n ± 1%  -28.07% (p=0.000 n=20)
Stdlib/IsSpace/rune=209-8           1.669n ± 1%   1.195n ± 2%  -28.42% (p=0.000 n=20)
Stdlib/IsSpace/rune=12288-8         6.161n ± 2%   1.240n ± 3%  -79.87% (p=0.000 n=20)
Stdlib/IsSpace/rune=26412-8         3.161n ± 3%   1.287n ± 0%  -59.29% (p=0.000 n=20)
Stdlib/IsSpace/rune=55296-8         3.062n ± 3%   1.286n ± 1%  -57.98% (p=0.000 n=20)
Stdlib/IsSpace/rune=128169-8        3.080n ± 2%   1.291n ± 4%  -58.06% (p=0.000 n=20)
Stdlib/IsSpace/rune=33554432-8      3.067n ± 3%   1.288n ± 1%  -58.01% (p=0.000 n=20)
geomean                             4.342n        1.240n       -71.45%

@gopherbot gopherbot added this to the Proposal milestone Jun 19, 2024
@gabyhelp
Copy link

@ianlancetaylor
Copy link
Contributor

Can you show us the doc comments for the three functions that you are proposing? It's not obvious what they do.

@diegommm
Copy link
Contributor Author

diegommm commented Jun 19, 2024

Doc comment

Can you show us the doc comments for the three functions that you are proposing? It's not obvious what they do.

The three do the same but with different input types. A possible doc comment for unicode.IsStringFunc(string) func(rune) bool is:

// IsStringFunc returns a function that checks if a rune is found in the provided string. Example:
//
//  openingRune := IsStringFunc("([{<《(")
//  for s != "" {
//      i := strings.IndexFunc(s, openingRune)
//      // if i >= 0 ...
//  }

I initially considered the name IsInStringFunc, which I think may sound clearer, but then dropped the In word to make it more like unicode.IsSpace.

More on motivation using the previous example

Note that it is perfectly fine to use strings.IndexAny(s, "([{<《("), but then in each iteration strings.IndexAny goes through some strategies and ends up doing the following:

for i, c := range s {
	if IndexRune(chars, c) >= 0 { // IndexRune("([{<《(", c)
		return i
	}
}

Then IndexRune calls IndexByte as long as it works on byte-sized c runes, which is great because it's very fast. But if my text has a lot of non-ASCII text (e.g. if you're parsing chinese, then your c will likely be ASCII very few times) then it will call Index(s, string(r)). And there it gets hairy. For our use case, it will end up calling bytealg.IndexString("([{<《(", c), but if instead of "([{<《(" we have something longer (>31 even on some -old, pre 2013- amd64) then it may still try a few more looping tricks.

Having said all that, it is no longer clear what's going to be the O of strings.IndexAny for our mostly non-ASCII case. It may be some O(x * y * z) in the worst case (I may be wrong in the * z, after all there's some architecture dependent stuff too). And that should be multiplied by the number of iterations in our enclosing for loop.

If we had the chance to cache the strategy of what needs to be searched, then that could be reused. In the shared repo, I'm proposing a solution that provides constant time when checking any rune. So calling strings.IndexFunc(s, openingRune) becomes O( utf8.RuneCountString(s) ).

@diegommm
Copy link
Contributor Author

diegommm commented Jun 19, 2024

Another option:

  • IsInSet([]rune) func(rune) bool
  • IsInSetString(string) func(rune) bool
  • IsInSetBytes([]byte) func(rune) bool

(Thinking that we provide a "RuneSet" to match against)
And yet another one:

func IsInSet[S interface{ []rune | string | []byte }](s S) func(rune) bool {
    swtich ss := any(s).(type){
    case []rune:
    // ...
    }
}

@seankhliao seankhliao changed the title proposal: improvement of rune-checking funcs proposal: unicode: improvement of rune-checking funcs Jun 19, 2024
@ianlancetaylor
Copy link
Contributor

This sounds like something that might be appropriate for a third party library outside of the standard library. https://go.dev/doc/faq#x_in_std

@diegommm
Copy link
Contributor Author

diegommm commented Jun 20, 2024

This sounds like something that might be appropriate for a third party library outside of the standard library.

I'm thinking of these functions as a way to compile a set of runes to be later matched, much like a regular expression, but for runes. The performance advantage is huge since this is done everywhere, and given that rune matching is cached and you can reuse that in parsing loops. Sure, regular expressions are an infinitely more complex case, but consider that even in the standard library there are many hot paths in parsers using functions like bytes.IndexAny and bytes.ContainsAny. Imagine if instead of providing regexp.Compile() we would just have regexp.Match(regexp string, test string) (EDIT: this is an exaggeration, just to exemplify the concept).

I found that the bitmask approach even performs better than a func checking with a switch a handful of runes. This is very common. There are also cases where we want to check if it's just any of " \t\n", so we call one of the IndexAny functions. This is also ubiquitous. And it is not always just ASCII text being queried, take as an example html/template/escape.go, where it is parsing HTML and it needs to handle some JavaScript space characters and calls bytes.ContainsAny(s[written:i1], "\n\r\u2028\u2029"). There's also the case of mime/grammar.go:

func isToken(s string) bool {
        if s == "" {
                return false
        }
        return strings.IndexFunc(s, isNotTokenChar) < 0
}

func isNotTokenChar(r rune) bool {
        return !isTokenChar(r)
}

func isTokenChar(r rune) bool {
        // token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
        //             or tspecials>
        return r > 0x20 && r < 0x7f && !isTSpecial(r)
}

func isTSpecial(r rune) bool {
        return strings.ContainsRune(`()<>@,;:\"/[]?=`, r)
}

Now consider this alternative:

var (
        isTSpecialChars = `()<>@,;:\"/[]?=`
        isTSpecial = whatever.IsInStringSet(isTSpecialChars)
        isTokenChar func(rune) bool
)

func init(){
        runes := make([]rune, 0, 0x7f - 0x21 - len(isTSpecialChars))
        for r := rune(0x21); r < 0x7f; r++ {
                if !isTSpecial(r) {
                        runes = append(runes, r)
                }
        }
        isTokenChar = whatever.IsInRunesSet(runes)
}

func isNotTokenChar(r rune) bool {
        return !isTokenChar(r)
}

Note that all these function checks are done in constant time in the approach I'm proposing in the linked repo, as opposed to O(x * y) (and possibly O(x * y * z)?). In my machine, the version I'm proposing runs in 52% the time of the original, using a basic benchmark with the test strings I found in mime/mediatype_test.go (actually, I duplicated the strings and added an @ at the end to force some negatives too).

Benchmark stuff

goos: linux
goarch: amd64
pkg: test/mime-test
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
          │   old.txt    │               new.txt               │
          │    sec/op    │   sec/op     vs base                │
IsToken-8   1108.0n ± 1%   574.9n ± 0%  -48.11% (p=0.000 n=10)

The code

package main

import (
        "strings"
        "testing"
)

// original

func isToken(s string) bool {
        if s == "" {
                return false
        }
        return strings.IndexFunc(s, isNotTokenChar) < 0
}

func isNotTokenChar(r rune) bool {
        return !isTokenChar(r)
}

func isTokenChar(r rune) bool {
        return r > 0x20 && r < 0x7f && !isTSpecial(r)
}

func isTSpecial(r rune) bool {
        return strings.ContainsRune(`()<>@,;:\"/[]?=`, r)
}

// changed

var (
        isTSpecialChars = `()<>@,;:\"/[]?=`
        isTSpecial2     = IsStringFunc(isTSpecialChars)
        isTokenChar2    func(rune) bool
)

func init() {
        runes := make([]rune, 0, 0x7f-0x21-len(isTSpecialChars))
        for r := rune(0x21); r < 0x7f; r++ {
                if !isTSpecial(r) {
                        runes = append(runes, r)
                }
        }
        isTokenChar2 = IsRunesFunc(runes)
}

func isNotTokenChar2(r rune) bool {
        return !isTokenChar2(r)
}

func isToken2(s string) bool {
        if s == "" {
                return false
        }
        return strings.IndexFunc(s, isNotTokenChar2) < 0
}

// bench stuff

var mediaTypes = []string{
        "noslash noslash @",
        "foo bar/baz foo bar/baz @",
        "foo/bar baz foo/bar baz @",
        "attachment attachment @",
        "attachment attachment @",
        "attachment attachment @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/BAR foo/BAR @",
        "foo/bar foo/bar @",
        "foo/bar foo/bar @",
        "foo foo @",

        "noslash",
        "foo bar/baz",
        "foo/bar baz",
        "attachment",
        "attachment",
        "attachment",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/BAR",
        "foo/bar",
        "foo/bar",
        "foo",
}

func BenchmarkIsToken(b *testing.B) {
        for i := 0; i < b.N; i++ {
                for _, mt := range mediaTypes {
                        isToken(mt)
                }
        }
}

func BenchmarkIsToken2(b *testing.B) {
        for i := 0; i < b.N; i++ {
                for _, mt := range mediaTypes {
                        isToken2(mt)
                }
        }
}

This is also not something uncommon to find in the wild, I think unicode.IsSpace is a great example of what rune checking does, but we need to write parsers everywhere and we need to check for special chars as well.

Examples in standard library where this would help

This is not a final list, there are some other cases as well.

File Func/Method Operation Context
cmd/cgo/ast.go (*File).ParseGo strings.ContainsAny(abspath, "\r\n") 2nd level nested loop
cmd/cover/cover.go annotate strings.ContainsAny(name, "\r\n") 1st level loop
cmd/cover/cover.go (*Package).annotateFile strings.ContainsAny(name, "\r\n") 1st level loop
cmd/go/internal/fsys/fsys.go hasMeta strings.ContainsAny(path, magicChars) magicChars depends on OS, called twice in Glob
cmd/vendor/github.com/google/pprof/internal/symbolizer/symbolizer.go looksLikeDemangledCPlusPlus and removeMatching strings.ContainsAny called in a loop, top func is Demangle
go/types/resolver.go dir strings.LastIndexAny(path, /\) Called in a loop in (*Checker).collectObjects, called for each file being type-checked
html/template/escape.go (*escaper).escapeText and contextAfterText bytes.IndexAny Called in a loop at least once, thrice worst case, whenever escaping a text node, and includes funny characters because of JavaScript
html/template/html.go stripTags bytes.IndexAny called in a loop, runs in many places, processes a lot of HTML
mime/mediatype.go consumeToken strings.IndexFunc and strings.ContainsRune called in many loops
mime/grammar.go isToken strings.IndexFunc and strings.ContainsRune called several times in FormatMediaType
log/slog/level.go (*Level).parse strings.IndexAny called in UnmarshalJSON and in MarshalText
runtime/debug/mod.go quoteKey and quoteValue strings.ContainsAny called in loops to parse BuildInfo and to convert it to string
archive/tar/reader.go (*Reader).readHeader bytes.IndexFunc called in a loop to read the next block header of the tar archive each time you want to advance to the next entry in the archive, which means it's called in many other loops when iterating over the entries
encoding/csv/writer.go (*Writer).Write strings.IndexAny and strings.ContainsAny 1st and 2nd level nested loop
encoding/pem/pem.go bytes.ContainsAny bytes.ContainsAny 2nd level nested loop when decoding each PEM
net/http/cookiejar/jar.go isIP strings.ContainsAny Called in a loop to add each cookie, when setting, and when retrieving a cookie
net/http/cookie.go sanitizeCookieValue and isCookieNameValid strings.IndexFunc and strings.ContainsAny Many loops
net/http/request.go validMethod strings.IndexFunc Called every time you need to validate an HTTP method name
net/mail/message.go (*Address).String, (*addrParser).consumeDisplayNameComment, ParseDate strings.FieldsFunc, strings.IndexAny, and strings.ContainsAny Called in many loops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

4 participants