runtime: general slowdown relative from 1.22 to tip (1.23 candidate) #67822

aclements · 2024-06-04T19:20:52Z

The performance dashboard shows a general, across-the-board regression from the last release (1.22) to current tip. The worst regressions are on the 88 core machine, For example:

Unfortunately, the 88 core history doesn't go far enough back to identify the original source. There are a few clear regressions on the 16 core machine, where we can go back. For example Pi/foo=shopspring/prec=100-16 and HistogramAllocation-16. It's hard to finger an obvious CL in these cases, but it may give us a narrower window to reproduce on the 88 core machine.

cc @mknyszek @golang/runtime

JacobOaks · 2024-06-06T18:06:48Z

FWIW, we've noticed this at Uber while doing some benchmarking recently. I've posted two possibly related but more specific issues spawned from Zap's benchmarks - #67857 and #67858.

mknyszek · 2024-06-07T16:52:22Z

Moving this to okay-after-rc1 so we can gather more data in the RC1 concurrently with investigating these regressions. (I also don't think the regressions, primarily in microbenchmarks, are so severe that they should block an RC. But we should still get to the bottom of this and either fix it or make an explicit decision before the release.)

thanm · 2024-06-11T16:14:59Z

Hi all,

I spent some time looking at this issue, specifically at one of the benchmarks that has regressed, BenchmarkHistogramAllocation from https://github.com/uber-go/tally.

TLDR: some portion of this slowdown appears to be due to alignment/layout artifacts, since a good chunk of it goes away when I randomize the order of functions in the text section.

Here's what I did as an experiment:

prepared two Go repos, one selecting tip of master, and the other selecting the tip of the Go 1.22 release branch.
in the Go 1.22 repo, I cherry-picked in CL 562157, the recently added "-randlayout" linker option, which randomly shuffles the order of functions in the text section.
I did two sets of runs, one set comparing tip vs 1.22 without any additional compiler or linker flags. I then did a set of runs using "-randlayout" for both 1.22 and tip.

This is on a 2-socket 96-core cloudtop VM.

Here's the benchstat output for the base comparison (not showing "B/op" or "allocs/op" since there is no difference there):

goos: linux
goarch: amd64
pkg: github.com/uber-go/tally
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                       │ out.base.txt │             out.tip.txt             │
                       │    sec/op    │   sec/op     vs base                │
HistogramAllocation-96    2.230µ ± 1%   2.318µ ± 1%  +3.95% (p=0.000 n=100)

Here's what I get with randomized layout:

goos: linux
goarch: amd64
pkg: github.com/uber-go/tally
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                       │ rout.base.txt │            rout.tip.txt             │
                       │    sec/op     │   sec/op     vs base                │
HistogramAllocation-96     2.229µ ± 1%   2.261µ ± 1%  +1.44% (p=0.006 n=100)

The full script I am using is below. When I get a chance later today I'll try a couple of the other benchmarks.

#!/bin/sh
set -x
set -e
#
VERSIONS="/w/go.master:tip /w/go1.22:base"
#
function doruns() {
  local GR=$1
  local TAG=$2
  local DORAND=$3

  SEEDS="101 303 505 707 909 11011 13013 15015 17017 19019"

  SAVEPATH="$PATH"
  export PATH="${GR}/bin:${PATH}"

  BARGS="-run=nothing -bench=BenchmarkHistogramAllocation"

  OTAG="out"
  if [ $DORAND = "yes" ]; then
    OTAG="rout"
  fi

  # Run benchmark to produce output.
  rm -f ${OTAG}.${TAG}.txt
  for S in $SEEDS
  do
    EXTRA=""
    if [ $DORAND = "yes" ]; then
      EXTRA="-ldflags=-randlayout=${S}"
    fi
    go test $EXTRA -count=10 $BARGS >> ${OTAG}.${TAG}.txt
  done

  export PATH="${SAVEPATH}"
}
#
SCENARIOS="no yes"
for DORAND in $SCENARIOS
do            
  for V in $VERSIONS
  do
    GR=`echo $V | cut -f1 -d:`
    TAG=`echo $V | cut -f2 -d:`
    doruns $GR $TAG $DORAND
  done
done
benchstat out.base.txt out.tip.txt > cmp.txt
benchstat rout.base.txt rout.tip.txt > rcmp.txt

thanm · 2024-06-11T17:16:22Z

I repeated my experiment with BenchmarkMultiWriteSyncer from
go.uber.org/zap/zapcore. Results below.

Without randomization:

goos: linux
goarch: amd64
pkg: go.uber.org/zap/zapcore
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                                            │ out.base.txt │             out.tip.txt             │
                                            │    sec/op    │   sec/op     vs base                │
MultiWriteSyncer/2_discarder-96                13.12n ± 2%   14.21n ± 2%  +8.30% (p=0.000 n=100)
MultiWriteSyncer/4_discarder-96                14.31n ± 1%   15.07n ± 1%  +5.35% (p=0.000 n=100)
MultiWriteSyncer/4_discarder_with_buffer-96    225.5n ± 2%   232.6n ± 1%  +3.15% (p=0.004 n=100)
geomean                                        34.86n        36.80n       +5.58%

With randomization:

goos: linux
goarch: amd64
pkg: go.uber.org/zap/zapcore
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                                            │ rout.base.txt │            rout.tip.txt             │
                                            │    sec/op     │   sec/op     vs base                │
MultiWriteSyncer/2_discarder-96                 13.72n ± 2%   13.86n ± 2%       ~ (p=0.086 n=100)
MultiWriteSyncer/4_discarder-96                 14.61n ± 2%   15.32n ± 1%  +4.89% (p=0.000 n=100)
MultiWriteSyncer/4_discarder_with_buffer-96     231.0n ± 2%   230.2n ± 2%       ~ (p=0.793 n=100)
geomean                                         35.91n        36.57n       +1.83%

so still a 5% slowdown for MultiWriteSyncer/4_discarder-96 even after layout is taken into account. I will try to dig into that a bit.

thanm · 2024-06-11T19:17:29Z

For MultiWriteSyncer/4_discarder I am really not seeing much that is actionable. I ran "pprof disasm" on the hot functions and from what I can tell comparing 1.22 and tip the code hasn't changed in any meaningful way between the releases. Puzzling.

mknyszek · 2024-06-12T21:16:23Z

I think I understand the root cause of the BenchmarkHistogramAllocation. See my comment at #67857 (comment). In sum: instances of the old trace map have a significant-enough scannable globals footprint to affect microbenchmarks that allocate memory rapidly. (256 KiB footprint, 819284 bytes on 64-bit platforms.) These globals can basically end up acting as a ballast if the live heap is just high enough to push us over the minimum heap size (like, 2.5 MiB, in which case 256 KiB is ~10% of the live heap disappearing in the trace map CL I landed).

mknyszek · 2024-06-12T21:19:21Z

For Pi/foo=shopspring/prec=100-16 de3a3c9 jumps out as a possible culprit, for around 2pp of the regression. The only thing that CL may have made a little slower is zeroing large pointerful things; perhaps it's allocating large pointerful things frequently? I can take a closer look.

mknyszek · 2024-06-14T20:00:48Z

I spent some time today investigating Pi/foo=shopspring/prec=100-16 at de3a3c9 but nothing is jumping out at me.

After eliminating GC from the equation (GOMEMLIMIT=1GiB GOGC=off, confirmed no GCs run during timed section) the regression still persisted. I then tried 3 random link-time code layouts before and after de3a3c9 and when I aggregated those runs, the difference went away. I'm inclined to believe then that this is not worth worrying about.

thanm · 2024-07-03T16:07:15Z

Discussed in the weekly release checkin meeting -- the consensus is that we are in ok shape, just leaving this open for now

aclements added release-blocker compiler/runtime Issues related to the Go compiler and/or runtime. labels Jun 4, 2024

mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 4, 2024

mknyszek modified the milestones: Backlog, Go1.23 Jun 4, 2024

mknyszek self-assigned this Jun 5, 2024

mknyszek added the okay-after-rc1 Used by release team to mark a release-blocker issue as okay to resolve either before or after rc1 label Jun 7, 2024

mknyszek mentioned this issue Jun 7, 2024

runtime: performance degradation on tip on high core count machines #67858

Open

gopherbot removed the okay-after-rc1 Used by release team to mark a release-blocker issue as okay to resolve either before or after rc1 label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: general slowdown relative from 1.22 to tip (1.23 candidate) #67822

runtime: general slowdown relative from 1.22 to tip (1.23 candidate) #67822

aclements commented Jun 4, 2024 •

edited

Loading

JacobOaks commented Jun 6, 2024

mknyszek commented Jun 7, 2024 •

edited

Loading

thanm commented Jun 11, 2024 •

edited by gabyhelp

Loading

thanm commented Jun 11, 2024

thanm commented Jun 11, 2024

mknyszek commented Jun 12, 2024

mknyszek commented Jun 12, 2024

mknyszek commented Jun 14, 2024 •

edited

Loading

thanm commented Jul 3, 2024

runtime: general slowdown relative from 1.22 to tip (1.23 candidate) #67822

runtime: general slowdown relative from 1.22 to tip (1.23 candidate) #67822

Comments

aclements commented Jun 4, 2024 • edited Loading

JacobOaks commented Jun 6, 2024

mknyszek commented Jun 7, 2024 • edited Loading

thanm commented Jun 11, 2024 • edited by gabyhelp Loading

thanm commented Jun 11, 2024

thanm commented Jun 11, 2024

mknyszek commented Jun 12, 2024

mknyszek commented Jun 12, 2024

mknyszek commented Jun 14, 2024 • edited Loading

thanm commented Jul 3, 2024

aclements commented Jun 4, 2024 •

edited

Loading

mknyszek commented Jun 7, 2024 •

edited

Loading

thanm commented Jun 11, 2024 •

edited by gabyhelp

Loading

mknyszek commented Jun 14, 2024 •

edited

Loading