Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share callstack b/w Caller and Stack #808

Closed
wants to merge 5 commits into from
Closed

Conversation

abhinav
Copy link
Collaborator

@abhinav abhinav commented Mar 27, 2020

A log entry that includes both, the caller and the stack trace requests
stack information twice: once to determine the caller and once to build
the stack trace.

This change optimizes this by requesting call stack information once and
sharing it for both cases.


I ran the benchmarks with -count 3.

Before

BenchmarkAddCallerHook-4         1262595               886 ns/op             248 B/op          3 allocs/op
BenchmarkAddCallerHook-4         1344021               889 ns/op             248 B/op          3 allocs/op
BenchmarkAddCallerHook-4         1000000              1085 ns/op             248 B/op          3 allocs/op
BenchmarkAddStack-4               483288              2160 ns/op             304 B/op          2 allocs/op
BenchmarkAddStack-4               550969              1993 ns/op             304 B/op          2 allocs/op
BenchmarkAddStack-4               614550              2000 ns/op             304 B/op          2 allocs/op
BenchmarkAddCallerAndStack-4      397208              2554 ns/op             552 B/op          5 allocs/op
BenchmarkAddCallerAndStack-4      462566              2578 ns/op             552 B/op          5 allocs/op
BenchmarkAddCallerAndStack-4      459589              2980 ns/op             552 B/op          5 allocs/op

After

BenchmarkAddCallerHook-4         1138980               960 ns/op             240 B/op          2 allocs/op
BenchmarkAddCallerHook-4         1096950               965 ns/op             240 B/op          2 allocs/op
BenchmarkAddCallerHook-4         1242375               978 ns/op             240 B/op          2 allocs/op
BenchmarkAddStack-4               868773              1388 ns/op             304 B/op          2 allocs/op
BenchmarkAddStack-4               857756              1438 ns/op             304 B/op          2 allocs/op
BenchmarkAddStack-4               865842              1348 ns/op             304 B/op          2 allocs/op
BenchmarkAddCallerAndStack-4      701884              1715 ns/op             544 B/op          4 allocs/op
BenchmarkAddCallerAndStack-4      698002              1709 ns/op             544 B/op          4 allocs/op
BenchmarkAddCallerAndStack-4      608336              1712 ns/op             544 B/op          4 allocs/op

On average,

  1. Adding caller alone is ~14 nanoseconds (~1.5%) slower
  2. Adding stack alone is ~659 nanoseconds (~32%) faster
  3. Adding both, caller and stack is ~992 nanoseconds (~36%) faster

The second of these rightfully raises eyebrows because it was expected
to have no changes at best. A cursory investigation doesn't reveal
any obvious reason for the speed up unless there's something incredibly
wrong with the change.

This adds benchmarks to measure the cost of logging with stack trace and
of logging with both, caller and stack trace.
@codecov
Copy link

codecov bot commented Mar 27, 2020

Codecov Report

Merging #808 into master will increase coverage by 0.14%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #808      +/-   ##
==========================================
+ Coverage   98.31%   98.45%   +0.14%     
==========================================
  Files          43       43              
  Lines        2311     2464     +153     
==========================================
+ Hits         2272     2426     +154     
+ Misses         32       31       -1     
  Partials        7        7              
Impacted Files Coverage Δ
logger.go 95.23% <100.00%> (+0.64%) ⬆️
stacktrace.go 100.00% <100.00%> (+2.12%) ⬆️
field.go 100.00% <0.00%> (ø)
zapcore/field.go 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f994215...332434f. Read the comment docs.

A log entry that includes both, the caller and the stack trace requests
stack information twice: once to determine the caller and once to build
the stack trace.

This change optimizes this by requesting call stack information once and
sharing it for both cases.

---

I ran the benchmarks with `-count 3`.

**Before**

```
BenchmarkAddCallerHook-4         1265024               912 ns/op             248 B/op          3 allocs/op
BenchmarkAddCallerHook-4         1389937               949 ns/op             248 B/op          3 allocs/op
BenchmarkAddCallerHook-4         1327134               862 ns/op             248 B/op          3 allocs/op
BenchmarkAddStack-4               565006              1911 ns/op             304 B/op          2 allocs/op
BenchmarkAddStack-4               644364              2180 ns/op             304 B/op          2 allocs/op
BenchmarkAddStack-4               547028              1896 ns/op             304 B/op          2 allocs/op
BenchmarkAddCallerAndStack-4      405864              2471 ns/op             552 B/op          5 allocs/op
BenchmarkAddCallerAndStack-4      492368              2940 ns/op             552 B/op          5 allocs/op
BenchmarkAddCallerAndStack-4      414658              2588 ns/op             552 B/op          5 allocs/op
```

**After**

```
BenchmarkAddCallerHook-4         1000000              1054 ns/op             240 B/op          2 allocs/op
BenchmarkAddCallerHook-4         1245952              1009 ns/op             240 B/op          2 allocs/op
BenchmarkAddCallerHook-4         1102441               974 ns/op             240 B/op          2 allocs/op
BenchmarkAddStack-4               908425              1269 ns/op             304 B/op          2 allocs/op
BenchmarkAddStack-4               774799              1345 ns/op             304 B/op          2 allocs/op
BenchmarkAddStack-4               836653              1575 ns/op             304 B/op          2 allocs/op
BenchmarkAddCallerAndStack-4      716456              1623 ns/op             544 B/op          4 allocs/op
BenchmarkAddCallerAndStack-4      562585              1826 ns/op             544 B/op          4 allocs/op
BenchmarkAddCallerAndStack-4      705979              1830 ns/op             544 B/op          4 allocs/op
```

On average,

1. Adding caller alone is ~104 nanoseconds *slower*
2. Adding stack alone is ~600 nanoseconds *faster*
3. Adding both, caller and stack is ~906 nanoseconds *faster*

The second of these rightfully raises eyebrows because it was expected
to have no changes at best. A cursory investigation doesn't reveal
any obvious reason for the speed up unless there's something incredibly
wrong with the change.
The change previously had a degradation on the caller-only route because
it captured more of the stack than necessary.

This optimizes that route back to original performance by limiting the
depth of the captured stack.

master:

```
BenchmarkAddCallerHook-4         1042035              1036 ns/op             248 B/op          3 allocs/op
BenchmarkAddCallerHook-4          916015              1096 ns/op             248 B/op          3 allocs/op
BenchmarkAddCallerHook-4         1178632              1193 ns/op             248 B/op          3 allocs/op
```

With this change:

```
BenchmarkAddCallerHook-4         1090138              1056 ns/op             240 B/op          2 allocs/op
BenchmarkAddCallerHook-4         1135695              1068 ns/op             240 B/op          2 allocs/op
BenchmarkAddCallerHook-4          896173              1122 ns/op             240 B/op          2 allocs/op
```
This tests the beahvior of Callers when the limit is larger than 64 (the
value used by the pool by default).

To test this reliably, I had to add an option to disable skipping zap
frames in makeStacktrace.
This verifies that when there is no limit on the stack depth, deep
stacks are resized as expected.
@abhinav abhinav changed the title WIP: Share callstack b/w Caller and Stack Share callstack b/w Caller and Stack Mar 27, 2020
@abhinav abhinav requested a review from prashantv March 27, 2020 23:40
for {
// Skip the call to runtime.Callers and programCounter.Callers
// so that the program counters start at our caller.
numFrames = runtime.Callers(skip+2, pc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do numFrames := since we don't use the value outside of the for loop and always replace it

var trace string
withStackDepth(tt.stackDepth, func() {
pcs := newProgramCounters().Callers(1, tt.limit)
trace = makeStacktrace(pcs, false /* don't skip zap */)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to avoid this boolean arg (as it's a test-only code path, and generally boolean args are not great)
presumably you need a fair bit of stack depth,, is it possible to use something outside of zap that takes in a callback, e.g.,
https://golang.org/pkg/bytes/#Map
hacky PoC: https://play.golang.org/p/bKrUH8kEreE

And keep calling that function recursively till we hit the limit we want, and that should add non-zap stack frames?

ce.Entry.Caller = zapcore.NewEntryCaller(0, "", 0, false)
} else {
frame, _ := runtime.CallersFrames(pcs[callerIdx:]).Next()
frame, _ := runtime.CallersFrames(pcs).Next()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are adding caller and stack, then the stackSkip passed in is 0 (e.g., no caller skip), so the caller value used here would be incorrect right, since we wouldn't be respecting the log.callerSkip

if limit < len(pc) {
pc = pc[:limit]
} else {
pc = make([]uintptr, limit)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case never gets hit in prod right? anyway we can refactor the change to avoid adding a case that's not necessary in prod?

@abhinav
Copy link
Collaborator Author

abhinav commented Aug 9, 2021

Closing until I have a chance to look at this again in the future.

@abhinav abhinav closed this Aug 9, 2021
@abhinav abhinav deleted the abg/share-caller-stack branch August 9, 2021 16:00
abhinav added a commit that referenced this pull request Jan 27, 2022
(Supersedes #808.)

If we use the `AddCaller(true)` option for a logger with
`AddStacktrace(level)`, which captures a stack trace for the specified
level or higher, Zap currently captures inspects the stack twice:

- `getCallerFrame` retrieves information about the immediate caller of
  the log entry
- `StackSkip` calls `takeStacktrace` to capture the entire stack trace
  and build a string from it

For the case where both caller and stack trace are requested,
we can avoid redundant work by sharing information about the call stack
between the logic for the two.

To accomplish this, the following high-level pieces were added:

    type stacktrace
      captureStacktrace(skip int, depth stacktraceDepth) *stacktrace
      func (*stacktrace) Next() (runtime.Frame, bool)

    type stackFormatter
      func newStackFormatter(*buffer.Buffer)
      func (*stackFormatter) FormatStack(*stacktrace)
      func (*stackFormatter) FormatFrame(runtime.Frame)

`captureStacktrace` captures stack traces (either just one frame or the
entire stack) that can then be inspected manually (to extract caller
information) and formatted into a string representation either
one-frame-at-a-time or wholesale with `stackFormatter`.

This allows us reorganize logic that applies the AddCaller and
AddStacktrace options so that it can:

- capture as much of the stack as it needs
- extract caller information from the first frame if we need it
- format that frame and the remaining frames into a stack trace

---

Benchmarking:
I ran the new benchmark included in this PR, as well as the existing
stack trace/caller based benchmarks 10 times for 10 seconds each.

```
% go test -run '^$' -bench 'Caller|Stack' -benchmem -benchtime 10s -count 10
[...]
name                       old time/op    new time/op    delta
StackField-96                 834ns ± 8%     819ns ±14%     ~     (p=0.971 n=10+10)
AddCallerHook-96              466ns ± 7%     480ns ± 1%     ~     (p=0.118 n=10+6)
AddCallerAndStacktrace-96    1.16µs ± 4%    0.78µs ± 8%  -32.60%  (p=0.000 n=9+10)
TakeStacktrace-96            2.66µs ± 2%    2.74µs ± 2%   +3.01%  (p=0.000 n=9+10)

name                       old alloc/op   new alloc/op   delta
StackField-96                  542B ± 0%      544B ± 0%   +0.33%  (p=0.000 n=10+10)
AddCallerHook-96               254B ± 0%      248B ± 0%   -2.36%  (p=0.000 n=10+10)
AddCallerAndStacktrace-96      652B ± 0%      429B ± 0%  -34.20%  (p=0.000 n=9+6)
TakeStacktrace-96              420B ± 0%      421B ± 0%   +0.24%  (p=0.000 n=10+10)

name                       old allocs/op  new allocs/op  delta
StackField-96                  3.00 ± 0%      3.00 ± 0%     ~     (all equal)
AddCallerHook-96               3.00 ± 0%      2.00 ± 0%  -33.33%  (p=0.000 n=10+10)
AddCallerAndStacktrace-96      5.00 ± 0%      3.00 ± 0%  -40.00%  (p=0.000 n=10+10)
TakeStacktrace-96              2.00 ± 0%      2.00 ± 0%     ~     (all equal)
```

Summary:

- AddCaller+AddStacktrace takes 2 fewer allocations and is ~30% faster
- AddCaller costs one allocation less
- takeStacktrace costs 3% more CPU

I think that the overall improvement is worth the slightly more expensive stack traces.

Refs GO-1181
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

2 participants