Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: unicode/utf8: improve EncodeRune performance by requiring a fixed 4 bytes width #68129

Closed
diegommm opened this issue Jun 22, 2024 · 4 comments

Comments

@diegommm
Copy link
Contributor

diegommm commented Jun 22, 2024

Proposal Details

Introduction

The function EncodeRune from package unicode/utf8 requires from the provided byte slice only the exact amount of space needed to write the provided rune. This is great for space efficiency, but an alternative approach requiring a fixed 4 spare bytes, regardless if they will use all of them, enables greater runtime optimization. The following benchstat shows the results of such optimizations:

goos: linux
goarch: amd64
pkg: unicode/utf8
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
                               │  variable   │                fixed                │
                               │   sec/op    │   sec/op     vs base                │
Encode/ASCIIRune-8               1.086n ± 1%   1.075n ± 0%   -1.06% (p=0.000 n=20)
Encode/SpanishRune-8             1.523n ± 2%   1.119n ± 2%  -26.58% (p=0.000 n=20)
Encode/JapaneseRune-8            1.933n ± 2%   1.683n ± 1%  -12.93% (p=0.000 n=20)
Encode/MaxRune-8                 2.175n ± 2%   1.741n ± 0%  -19.97% (p=0.000 n=20)
Encode/InvalidRuneMaxPlusOne-8   1.959n ± 1%   1.302n ± 1%  -33.56% (p=0.000 n=20)
Encode/InvalidRuneSurrogate-8    1.938n ± 2%   1.286n ± 0%  -33.67% (p=0.000 n=20)
Encode/InvalidRuneNegative-8     1.954n ± 1%   1.315n ± 1%  -32.73% (p=0.000 n=20)
geomean                          1.757n        1.340n       -23.74%

The benchmarks currently only have an ASCII (a valid 1 byte rune) and a Japanese (a valid 3 bytes rune) cases. Here, the benchmarks were extended to test for every possible rune case that needs to be encoded:

  • InvalidRuneMaxPlusOne is an erroneous rune overflowing MaxRune, likely the product of rune arithmetic. This is written as RuneError, which is a 3 bytes rune
  • InvalidRuneSurrogate is an erroneous rune which is in the surrogate range, same treatment as above
  • InvalidRuneNegative is an erroneous rune that is a negative value, same treatment as above
  • SpanishRune is a valid 2 bytes rune
  • MaxRune is a valid 4 bytes rune

Note that the cases for erroneous runes are added for completeness, and improving any implementation for those cases need not be a priority for the standard library.

Proposal alternatives

Add a new function to implement the proposed behavior

This is obviously not ideal, and may even generate confusion in users. Having two functions do mostly the same thing but with slightly different requirements may prove counter productive. It also increases the API surface only to provide a performance improvement, which is something that is to be avoided as much as possible, even if that already happened in the past.

Change EncodeRune to require a fixed length

The doc comment in EncodeRune is:

// EncodeRune writes into p (which must be large enough) the UTF-8 encoding of the rune.
// If the rune is out of range, it writes the encoding of [RuneError].
// It returns the number of bytes written.

Here, "large enough" is never specified to be "the least amount needed to encode the rune", which is what the current implementation does, thus it's permissible to argument that 4 available bytes is also "large enough" because that's large enough to encode any UTF-8 rune. So users shouldn't rely on that, but some may already do. Users rely on internals of implementation details as far as calling runtime.morestack_noctxt, so it's all possible.

But why would someone aware of utf8.EncodeRune not reserve utf8.UTFMax bytes in the first place? The only reasonable explanation the author can think of is micro managing the tail of a buffer to save, say, between 1 and 3 bytes. As a corollary, it should be noted that if this approach was taken, then all buffers written to exclusively with utf8.EncodeRune should have a length multiple of 4 bytes. Potentially wasting 3 bytes in the worst case doesn't sound that terrible, though, if we compare to the benefit of the improved performance.

In the standard library, for example, all buffers are already (sanely) allocating for utf8.UTFMax bytes before calling utf8.EncodeRune, and so should everyone be doing. If the text to be written is always ASCII, then there is no reason to call utf8.EncodeRune in the first place, just call copy for that.

But varying implementations exist, so if the approach of changing the EncodeRune function is taken, an extra cautious approach would be first rolling this change with a notice and only enabled with an environment variable, defaulting to use old behavior, in another release this environment variable could default to the new behavior, and finally the variable removed and this be the only possible behavior.

Ultimately, if those 1 to 3 bytes at the end of a buffer need to be saved, the following approach can do it:

var runeBytes [utf8.UTFMax]byte
n := utf8.EncodeRune(runeBytes[:], r)
copy(buf, runeBytes[:n])

Appendix: example implementation

Changes to src/unicode/utf8/utf8.go

The name EncodeRune4 is definitely not great, probably something like EncodeRuneFixed could be used (with Fixed meaning "fixed length"), if distinction is going to be made.

// EncodeRune4 writes into p (which must be have a minimum length of 4) the
// UTF-8 encoding of the rune. If the rune is out of range, it writes the
// encoding of [RuneError]. It returns the number of bytes written.
func EncodeRune4(p []byte, r rune) int {
       _ = p[3] // trade a runtime.panicSliceConvert for a runtime.panicIndex, which seems to be cheaper
       return utf8Bytes((*[4]byte)(p), r)
}

// utf8Bytes writes into p the UTF-8 encoding of the rune. If the rune is out of
// range, it writes the encoding of [RuneError]. It returns the number of bytes
// written.
func utf8Bytes(p *[4]byte, r rune) int {
       switch u := uint32(r); {
       case u <= rune1Max:
               p[0] = byte(r)
               return 1
       case u <= rune2Max:
               p[0] = t2 | byte(u>>6)
               p[1] = tx | byte(u)&maskx
               return 2
       case surrogateMin <= u && u <= surrogateMax: // RuneError: in surrogate range
               p[0] = 239
               p[1] = 191
               p[2] = 189
               return 3
       case u <= rune3Max:
               p[0] = t3 | byte(u>>12)
               p[1] = tx | byte(u>>6)&maskx
               p[2] = tx | byte(u)&maskx
               return 3
       case u <= MaxRune:
               p[0] = t4 | byte(u>>18)
               p[1] = tx | byte(u>>12)&maskx
               p[2] = tx | byte(u>>6)&maskx
               p[3] = tx | byte(u)&maskx
               return 4
       }
       // RuneError: overflowing MaxRune or negative. Delayed to favor the case of valid runes
       p[0] = 239
       p[1] = 191
       p[2] = 189
       return 3
}

Benchmarking

  • Setup:
    mkdir ~/go/src/go.googlesource.com
    git clone https://go.googlesource.com/go ~/go/src/go.googlesource.com/go
    cd ~/go/src/go.googlesource.com/go/src
    ./all.bash
    cd ..
    export GOROOT="$PWD"
    export PATH="$PWD/bin:$PATH"
  • Benchmarks run with: go test -timeout=30m -run=- -count=20 -bench=BenchmarkEncode > results-encode.txt
  • Results processed with: benchstat -col /implem results-encode.txt
Changes made to src/unicode/utf8/utf8_test.go for the benchmarks

diff --git a/src/unicode/utf8/utf8_test.go b/src/unicode/utf8/utf8_test.go
index 2e6aace766..810166ea39 100644
--- a/src/unicode/utf8/utf8_test.go
+++ b/src/unicode/utf8/utf8_test.go
@@ -638,18 +638,108 @@ func init() {
        longStringJapanese = strings.Repeat(japanese, 100_000/len(japanese))
 }
 
-func BenchmarkEncodeASCIIRune(b *testing.B) {
-       buf := make([]byte, UTFMax)
-       for i := 0; i < b.N; i++ {
-               EncodeRune(buf, 'a')
-       }
-}
+func BenchmarkEncode(b *testing.B) {
+       b.Run("implem=variable", func(b *testing.B) {
+               b.Run("ASCIIRune", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune(buf, 'a') // 1 byte
+                       }
+               })
 
-func BenchmarkEncodeJapaneseRune(b *testing.B) {
-       buf := make([]byte, UTFMax)
-       for i := 0; i < b.N; i++ {
-               EncodeRune(buf, '本')
-       }
+               b.Run("SpanishRune", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune(buf, 'Ñ') // 2 bytes
+                       }
+               })
+
+               b.Run("JapaneseRune", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune(buf, '本') // 3 bytes
+                       }
+               })
+
+               b.Run("MaxRune", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune(buf, MaxRune) // 4 bytes
+                       }
+               })
+
+               b.Run("InvalidRuneMaxPlusOne", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune(buf, MaxRune+1) // 3 bytes: RuneError
+                       }
+               })
+
+               b.Run("InvalidRuneSurrogate", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune(buf, 0xD800) // 3 bytes: RuneError
+                       }
+               })
+
+               b.Run("InvalidRuneNegative", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune(buf, -1) // 3 bytes: RuneError
+                       }
+               })
+       })
+
+       b.Run("implem=fixed", func(b *testing.B) {
+               b.Run("ASCIIRune", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune4(buf, 'a') // 1 byte
+                       }
+               })
+
+               b.Run("SpanishRune", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune4(buf, 'Ñ') // 2 bytes
+                       }
+               })
+
+               b.Run("JapaneseRune", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune4(buf, '本') // 3 bytes
+                       }
+               })
+
+               b.Run("MaxRune", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune4(buf, MaxRune) // 4 bytes
+                       }
+               })
+
+               b.Run("InvalidRuneMaxPlusOne", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune4(buf, MaxRune+1) // 3 bytes: RuneError
+                       }
+               })
+
+               b.Run("InvalidRuneSurrogate", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune4(buf, 0xD800) // 3 bytes: RuneError
+                       }
+               })
+
+               b.Run("InvalidRuneNegative", func(b *testing.B) {
+                       buf := make([]byte, UTFMax)
+                       for i := 0; i < b.N; i++ {
+                               EncodeRune4(buf, -1) // 3 bytes: RuneError
+                       }
+               })
+       })
 }
 
 func BenchmarkAppendASCIIRune(b *testing.B) {

@gopherbot gopherbot added this to the Proposal milestone Jun 22, 2024
@ianlancetaylor
Copy link
Contributor

There is no function Encode in the unicode/utf8 package. I assume you are taking about EncodeRune.

I'm confident that changing EncodeRune to require a slice of at least length 4 would break existing working code. We shouldn't do that. The performance benefits you measure are big percentages but in actual time they are less than a nanosecond. I'm skeptical that this is measurable for anything other than micro-benchmarks.

Why is your new code faster? Does it just have fewer bounds checks? But the old code should also only have one bounds check.

Note that you don't need to use unsafe to convert, you are permitted to convert from a slice to a pointer to an array.

@seankhliao seankhliao changed the title proposal: unicode/utf8: improve Encode performance by requiring a fixed 4 bytes width proposal: unicode/utf8: improve EncodeRune performance by requiring a fixed 4 bytes width Jun 23, 2024
@diegommm
Copy link
Contributor Author

diegommm commented Jun 23, 2024

There is no function Encode in the unicode/utf8 package. I assume you are taking about EncodeRune.

Fixed!

Note that you don't need to use unsafe to convert, you are permitted to convert from a slice to a pointer to an array.

Thank you! Fixed as well. Interestingly, it appears that runtime.panicSliceConvert is more expensive than runtime.panicIndex, so it still helps to do _ = p[3] before doing the conversion.

I'm confident that changing EncodeRune to require a slice of at least length 4 would break existing working code. We shouldn't do that. The performance benefits you measure are big percentages but in actual time they are less than a nanosecond. I'm skeptical that this is measurable for anything other than micro-benchmarks.

Yes, it's possible. I'll try to make more tangible tests and see what would be the result outside of micro-benchmarks.

Why is your new code faster? Does it just have fewer bounds checks? But the old code should also only have one bounds check.

Yes, both perform a single bound check. My intention was that by having the bound check in an inlined function, then the bound check is hoisted to the caller, where I hoped that the compiler may have already proven that the bounds are preserved, thus essentially removing redundant bounds checks. Adding go:noinline causes EncodeRune4 to perform far worse, bu adding go:noinline also makes the result undiscernible from what would be otherwise, so I may be wrong there.

@diegommm
Copy link
Contributor Author

diegommm commented Jun 24, 2024

I made another synthetic benchmark, this time writing 1024*1024 runes of different widths (so, 1MiB for ASCII). There is no further data processing, just passing through runes to a slice of bytes.

goos: linux
goarch: amd64
pkg: github.com/diegommm/runes
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
                                    │   stdlib    │             EncodeRune4             │
                                    │   sec/op    │   sec/op     vs base                │
EncodeRuneText/invalid,_negative-8    2.930m ± 3%   2.662m ± 2%   -9.13% (p=0.000 n=20)
EncodeRuneText/ASCII-8                2.115m ± 2%   1.970m ± 3%   -6.86% (p=0.000 n=20)
EncodeRuneText/2_bytes-8              2.542m ± 3%   2.129m ± 3%  -16.22% (p=0.000 n=20)
EncodeRuneText/invalid,_surrogate-8   3.134m ± 3%   2.230m ± 3%  -28.84% (p=0.000 n=20)
EncodeRuneText/3_bytes-8              3.042m ± 3%   2.686m ± 3%  -11.71% (p=0.000 n=20)
EncodeRuneText/4_bytes-8              3.367m ± 1%   2.990m ± 1%  -11.19% (p=0.000 n=20)
EncodeRuneText/invalid,_too_long-8    3.031m ± 3%   2.603m ± 1%  -14.12% (p=0.000 n=20)
geomean                               2.851m        2.444m       -14.29%

In the case of 4-bytes runes, which is one of the most notable, there is a ~0.3ms improvement in a 4MiB test data (yes, for slower machines it will be greater, but I don't think it will be so in a significant way). Adding any interesting processing may probably have this improvement lost in noise.

To your point, I think it's clear that, while there is a reduction in run time, it becomes negligible compared to any meaningful work that could be performed on such volume of data, and the risk of breaking anyone's code would not be worth it. And not to mention increasing the surface of the API. I will be closing this issue as a consequence, but feel free to reopen if you think there would be anything else that would need to be investigated, I would be happy to help.

Thank you for your time and patience!

Benchmarking code (for reference)

P.S.: Verified that the bounds check for EncodeRune4 is removed entirely after inlining it.

var runeCases = []struct {
        r    rune
        name string
        n    int // number of bytes of what it should encode to
}{
        {-1, "invalid, negative", 3},
        {10, "ASCII", 1},
        {209, "2 bytes", 2},
        {surrogateMin, "invalid, surrogate", 3},
        {65535, "3 bytes", 3},
        {maxRune, "4 bytes", 4},
        {maxRune + 11, "invalid, too long", 3},
}

func BenchmarkEncodeRuneText(b *testing.B) {
        const runesToWrite = 1024 * 1024

        for _, c := range runeCases {
                buf := make([]byte, runesToWrite*c.n+4)

                b.Run(c.name, func(b *testing.B) {
                        b.Run("implem=stdlib", func(b *testing.B) {
                                for i := 0; i < b.N; i++ {
                                        dst := buf
                                        for j := 0; j < runesToWrite; j++ {
                                                n := utf8.EncodeRune(dst, c.r)
                                                dst = dst[n:]
                                        }
                                }
                        })

                        b.Run("implem=EncodeRune4", func(b *testing.B) {
                                for i := 0; i < b.N; i++ {
                                        dst := buf
                                        for j := 0; j < runesToWrite; j++ {
                                                n := EncodeRune4(dst, c.r)
                                                dst = dst[n:]
                                        }
                                }
                        })
                })
        }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants