Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/tls: races detected in BenchmarkCertCache, BenchmarkHandshakeServer, BenchmarkLatency and BenchmarkThroughput #67979

Closed
dmitshur opened this issue Jun 13, 2024 · 10 comments
Labels
NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin release-blocker
Milestone

Comments

@dmitshur
Copy link
Contributor

dmitshur commented Jun 13, 2024

The macOS race builder is reporting data races found by briefly running benchmarks. For example:

 ==================
 WARNING: DATA RACE
 Write at 0x00c000152a49 by goroutine 40:
   testing.(*common).Fail()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/testing.go:955 +0x104
   testing.(*common).Fail()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/testing.go:947 +0x79
   testing.(*common).Errorf()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/testing.go:1076 +0xae
   crypto/tls.benchmarkHandshakeServer.func2()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/crypto/tls/handshake_server_test.go:1346 +0x3ee
 Previous read at 0x00c000152a49 by goroutine 21:
   testing.(*B).Run()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:686 +0xa8d
   crypto/tls.BenchmarkHandshakeServer()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/crypto/tls/handshake_server_test.go:1371 +0x64
   testing.(*B).runN()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:193 +0x19c
   testing.(*B).run1.func1()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:215 +0x7b
 Goroutine 40 (running) created at:
   crypto/tls.benchmarkHandshakeServer()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/crypto/tls/handshake_server_test.go:1336 +0xa36
   crypto/tls.BenchmarkHandshakeServer.func2.2()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/crypto/tls/handshake_server_test.go:1377 +0x9a
   testing.(*B).runN()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:193 +0x19c
   testing.(*B).launch()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:316 +0x385
   testing.(*B).doBench.gowrap1()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:266 +0x33
 Goroutine 21 (running) created at:
   testing.(*B).run1()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:208 +0x194
   testing.(*B).Run()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:682 +0x9d4
   testing.runBenchmarks.func1()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:543 +0x85
   testing.(*B).runN()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:193 +0x19c
   testing.runBenchmarks()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:552 +0xbb6
   testing.(*M).Run()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/testing.go:2050 +0x152c
   crypto/tls.runMain()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/crypto/tls/handshake_test.go:400 +0xde4
   crypto/tls.TestMain()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/crypto/tls/handshake_test.go:349 +0x10e
   main.main()
       _testmain.go:423 +0x171
 ==================
     handshake_server_test.go:1349: #4: mismatch on read: got:00000000000000000000000000000000000000000000000000000000000000000000000000000000000000 want:1403030001011603030020d6984a2d5e5b03e16c2ff4634b31c8975d4e142adba6a63188219c02e4e81183
     testing.go:1399: race detected during execution of test

From https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8745300644568482689/+/u/step/11/log/2#L333_13. It's happening quite frequently:

image

Since GOOS=darwin is a first class port, we can't release Go 1.23 without addressing this.

CC @golang/security.

@dmitshur dmitshur added OS-Darwin NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker okay-after-rc1 Used by release team to mark a release-blocker issue as okay to resolve either before or after rc1 labels Jun 13, 2024
@dmitshur dmitshur added this to the Go1.23 milestone Jun 13, 2024
@FiloSottile
Copy link
Contributor

To a first investigation, I think this is a crypto/tls benchmark failure flake that however is tickling a bug in testing's sub-sub-benchmarks package causing the data race. (Note how e.g. BenchmarkHandshakeServer has two levels of nesting.)

testing.B.Errorf is documented to be safe for concurrent use

A benchmark ends when its Benchmark function returns or calls any of the methods FailNow, Fatal, Fatalf, SkipNow, Skip, or Skipf. Those methods must be called only from the goroutine running the Benchmark function. The other reporting methods, such as the variations of Log and Error, may be called simultaneously from multiple goroutines.

and indeed the write to b.parent.failed in the race is done by b.parent.Fail ((*common).Fail()) under b.parent.mu. (Here b is the sub-sub-benchmark and b.parent is a sub-benchmark itself.)

However, the read in the race is done by the top-level benchmark's b.parent.parent.Run ((*B).Run()) on sub.failed (aka b.parent.failed) without holding b.parent.mu.

Not sure if the two levels of nesting are necessary to trigger this, to be honest, or if the correct behavior would be for Errorf to panic with Fail in goroutine after $NAME has completed instead, but the data race is not right.

@gopherbot gopherbot removed the okay-after-rc1 Used by release team to mark a release-blocker issue as okay to resolve either before or after rc1 label Jun 21, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/594255 mentions this issue: crypto/tls: replay test recordings without network

gopherbot pushed a commit that referenced this issue Jun 24, 2024
There is no reason to go across a pipe when replaying a conn recording.
This avoids the complexity of using localPipe and goroutines, and makes
handshake benchmarks more accurate, as we don't measure network
overhead.

Also note how it removes the need for -fast: operating locally we know
when the flow is over and can error out immediately, without waiting for
a read from the feeder on the other side of the pipe to timeout.

Avoids some noise in #67979, but doesn't fix the two root causes:
localPipe flakes and testing.B races.

Updates #67979

Change-Id: I153d3fa5a24847f3947823e8c3a7bc639f89bc1d
Reviewed-on: https://go-review.googlesource.com/c/go/+/594255
Auto-Submit: Filippo Valsorda <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Roland Shoemaker <[email protected]>
Reviewed-by: Joedian Reid <[email protected]>
@joedian
Copy link

joedian commented Jun 26, 2024

@rolandshoemaker is any more updates planned for this?

@rolandshoemaker
Copy link
Member

I believe @FiloSottile is continuing to work on this.

@FiloSottile
Copy link
Contributor

I am working on the crypto/tls side, but #67979 (comment) is a little out of my depth in terms of the testing package. I'll open a separate issue for it.

@thanm
Copy link
Contributor

thanm commented Jul 3, 2024

Any update on this issue? Discussed during the weekly release checkin meeting. Is this a 1.23 regression, or a more longstanding problem?

@aclements
Copy link
Member

testing.B.Errorf is supposed to be safe for concurrent use, but it's not safe to call after the benchmark has ended, which I believe is what's happening here.

The race is between a read here, when the subbenchmark ends, and a write here where Errorf is called from a separate goroutine. If I'm reading the test right, nothing is waiting for the goroutine created by benchmarkHandshakeServer to exit before allowing the benchmark to finish. Hence, the benchmark can finish and then the goroutine can call Errorf on the exited benchmark. I think this just needs a WaitGroup wait after the close(feeder).

@aclements
Copy link
Member

aclements commented Jul 3, 2024

I see as of CL 594255 the feeder goroutine is no longer there. And the Darwin builder looks happy since that landed. Thus, I think this is fixed and we can close the issue. Is there anything I'm missing?

@FiloSottile
Copy link
Contributor

testing.B.Errorf is supposed to be safe for concurrent use, but it's not safe to call after the benchmark has ended, [...]

Fair enough. I feel like a Fail in goroutine after $NAME has completed panic is still friendlier than a data race, but maybe it's not worth the complexity.

I see as of CL 594255 the feeder goroutine is no longer there. And the Darwin builder looks happy since that landed.

Huh, I thought the other benchmarks were flaking too but now that I went to rerun them it doesn't look like they are. I wonder what I had wrong. Great, closing.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jul 3, 2024
Mchnan pushed a commit to Mchnan/go-sylixos that referenced this issue Jul 9, 2024
There is no reason to go across a pipe when replaying a conn recording.
This avoids the complexity of using localPipe and goroutines, and makes
handshake benchmarks more accurate, as we don't measure network
overhead.

Also note how it removes the need for -fast: operating locally we know
when the flow is over and can error out immediately, without waiting for
a read from the feeder on the other side of the pipe to timeout.

Avoids some noise in golang#67979, but doesn't fix the two root causes:
localPipe flakes and testing.B races.

Updates golang#67979

Change-Id: I153d3fa5a24847f3947823e8c3a7bc639f89bc1d
Reviewed-on: https://go-review.googlesource.com/c/go/+/594255
Auto-Submit: Filippo Valsorda <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Roland Shoemaker <[email protected]>
Reviewed-by: Joedian Reid <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin release-blocker
Projects
Status: Done
Development

No branches or pull requests

8 participants