Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: general slowdown relative from 1.22 to tip (1.23 candidate) #67822

Open
aclements opened this issue Jun 4, 2024 · 9 comments
Open
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@aclements
Copy link
Member

aclements commented Jun 4, 2024

The performance dashboard shows a general, across-the-board regression from the last release (1.22) to current tip. The worst regressions are on the 88 core machine, For example:
image

Unfortunately, the 88 core history doesn't go far enough back to identify the original source. There are a few clear regressions on the 16 core machine, where we can go back. For example Pi/foo=shopspring/prec=100-16 and HistogramAllocation-16. It's hard to finger an obvious CL in these cases, but it may give us a narrower window to reproduce on the 88 core machine.

cc @mknyszek @golang/runtime

@aclements aclements added release-blocker compiler/runtime Issues related to the Go compiler and/or runtime. labels Jun 4, 2024
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 4, 2024
@mknyszek mknyszek modified the milestones: Backlog, Go1.23 Jun 4, 2024
@mknyszek mknyszek self-assigned this Jun 5, 2024
@JacobOaks
Copy link

FWIW, we've noticed this at Uber while doing some benchmarking recently. I've posted two possibly related but more specific issues spawned from Zap's benchmarks - #67857 and #67858.

@mknyszek mknyszek added the okay-after-rc1 Used by release team to mark a release-blocker issue as okay to resolve either before or after rc1 label Jun 7, 2024
@mknyszek
Copy link
Contributor

mknyszek commented Jun 7, 2024

Moving this to okay-after-rc1 so we can gather more data in the RC1 concurrently with investigating these regressions. (I also don't think the regressions, primarily in microbenchmarks, are so severe that they should block an RC. But we should still get to the bottom of this and either fix it or make an explicit decision before the release.)

@thanm
Copy link
Contributor

thanm commented Jun 11, 2024

Hi all,

I spent some time looking at this issue, specifically at one of the benchmarks that has regressed, BenchmarkHistogramAllocation from https://github.com/uber-go/tally.

TLDR: some portion of this slowdown appears to be due to alignment/layout artifacts, since a good chunk of it goes away when I randomize the order of functions in the text section.

Here's what I did as an experiment:

  • prepared two Go repos, one selecting tip of master, and the other selecting the tip of the Go 1.22 release branch.

  • in the Go 1.22 repo, I cherry-picked in CL 562157, the recently added "-randlayout" linker option, which randomly shuffles the order of functions in the text section.

  • I did two sets of runs, one set comparing tip vs 1.22 without any additional compiler or linker flags. I then did a set of runs using "-randlayout" for both 1.22 and tip.

This is on a 2-socket 96-core cloudtop VM.

Here's the benchstat output for the base comparison (not showing "B/op" or "allocs/op" since there is no difference there):

goos: linux
goarch: amd64
pkg: github.com/uber-go/tally
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                       │ out.base.txt │             out.tip.txt             │
                       │    sec/op    │   sec/op     vs base                │
HistogramAllocation-96    2.230µ ± 1%   2.318µ ± 1%  +3.95% (p=0.000 n=100)

Here's what I get with randomized layout:

goos: linux
goarch: amd64
pkg: github.com/uber-go/tally
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                       │ rout.base.txt │            rout.tip.txt             │
                       │    sec/op     │   sec/op     vs base                │
HistogramAllocation-96     2.229µ ± 1%   2.261µ ± 1%  +1.44% (p=0.006 n=100)

The full script I am using is below. When I get a chance later today I'll try a couple of the other benchmarks.

#!/bin/sh
set -x
set -e
#
VERSIONS="/w/go.master:tip /w/go1.22:base"
#
function doruns() {
  local GR=$1
  local TAG=$2
  local DORAND=$3

  SEEDS="101 303 505 707 909 11011 13013 15015 17017 19019"

  SAVEPATH="$PATH"
  export PATH="${GR}/bin:${PATH}"

  BARGS="-run=nothing -bench=BenchmarkHistogramAllocation"

  OTAG="out"
  if [ $DORAND = "yes" ]; then
    OTAG="rout"
  fi

  # Run benchmark to produce output.
  rm -f ${OTAG}.${TAG}.txt
  for S in $SEEDS
  do
    EXTRA=""
    if [ $DORAND = "yes" ]; then
      EXTRA="-ldflags=-randlayout=${S}"
    fi
    go test $EXTRA -count=10 $BARGS >> ${OTAG}.${TAG}.txt
  done

  export PATH="${SAVEPATH}"
}
#
SCENARIOS="no yes"
for DORAND in $SCENARIOS
do            
  for V in $VERSIONS
  do
    GR=`echo $V | cut -f1 -d:`
    TAG=`echo $V | cut -f2 -d:`
    doruns $GR $TAG $DORAND
  done
done
benchstat out.base.txt out.tip.txt > cmp.txt
benchstat rout.base.txt rout.tip.txt > rcmp.txt

@thanm
Copy link
Contributor

thanm commented Jun 11, 2024

I repeated my experiment with BenchmarkMultiWriteSyncer from
go.uber.org/zap/zapcore. Results below.

Without randomization:

goos: linux
goarch: amd64
pkg: go.uber.org/zap/zapcore
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                                            │ out.base.txt │             out.tip.txt             │
                                            │    sec/op    │   sec/op     vs base                │
MultiWriteSyncer/2_discarder-96                13.12n ± 2%   14.21n ± 2%  +8.30% (p=0.000 n=100)
MultiWriteSyncer/4_discarder-96                14.31n ± 1%   15.07n ± 1%  +5.35% (p=0.000 n=100)
MultiWriteSyncer/4_discarder_with_buffer-96    225.5n ± 2%   232.6n ± 1%  +3.15% (p=0.004 n=100)
geomean                                        34.86n        36.80n       +5.58%

With randomization:

goos: linux
goarch: amd64
pkg: go.uber.org/zap/zapcore
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                                            │ rout.base.txt │            rout.tip.txt             │
                                            │    sec/op     │   sec/op     vs base                │
MultiWriteSyncer/2_discarder-96                 13.72n ± 2%   13.86n ± 2%       ~ (p=0.086 n=100)
MultiWriteSyncer/4_discarder-96                 14.61n ± 2%   15.32n ± 1%  +4.89% (p=0.000 n=100)
MultiWriteSyncer/4_discarder_with_buffer-96     231.0n ± 2%   230.2n ± 2%       ~ (p=0.793 n=100)
geomean                                         35.91n        36.57n       +1.83%

so still a 5% slowdown for MultiWriteSyncer/4_discarder-96 even after layout is taken into account. I will try to dig into that a bit.

@thanm
Copy link
Contributor

thanm commented Jun 11, 2024

For MultiWriteSyncer/4_discarder I am really not seeing much that is actionable. I ran "pprof disasm" on the hot functions and from what I can tell comparing 1.22 and tip the code hasn't changed in any meaningful way between the releases. Puzzling.

@mknyszek
Copy link
Contributor

I think I understand the root cause of the BenchmarkHistogramAllocation. See my comment at #67857 (comment). In sum: instances of the old trace map have a significant-enough scannable globals footprint to affect microbenchmarks that allocate memory rapidly. (256 KiB footprint, 819284 bytes on 64-bit platforms.) These globals can basically end up acting as a ballast if the live heap is just high enough to push us over the minimum heap size (like, 2.5 MiB, in which case 256 KiB is ~10% of the live heap disappearing in the trace map CL I landed).

@mknyszek
Copy link
Contributor

For Pi/foo=shopspring/prec=100-16 de3a3c9 jumps out as a possible culprit, for around 2pp of the regression. The only thing that CL may have made a little slower is zeroing large pointerful things; perhaps it's allocating large pointerful things frequently? I can take a closer look.

@mknyszek
Copy link
Contributor

mknyszek commented Jun 14, 2024

I spent some time today investigating Pi/foo=shopspring/prec=100-16 at de3a3c9 but nothing is jumping out at me.

After eliminating GC from the equation (GOMEMLIMIT=1GiB GOGC=off, confirmed no GCs run during timed section) the regression still persisted. I then tried 3 random link-time code layouts before and after de3a3c9 and when I aggregated those runs, the difference went away. I'm inclined to believe then that this is not worth worrying about.

@gopherbot gopherbot removed the okay-after-rc1 Used by release team to mark a release-blocker issue as okay to resolve either before or after rc1 label Jun 21, 2024
@thanm
Copy link
Contributor

thanm commented Jul 3, 2024

Discussed in the weekly release checkin meeting -- the consensus is that we are in ok shape, just leaving this open for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
Status: In Progress
Development

No branches or pull requests

5 participants