Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README-profiling.md #810

Merged
merged 1 commit into from
Dec 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README-go-port.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,16 +163,20 @@ nil through the reader/transformer/writer sequence.

Key performance-related PRs for the Go port include:

* [https://github.com/johnkerl/miller/pull/424](#424): In C, lots of little mallocs are fine. In Go, not so much. This is not the garbage-collection penalty -- it's the penalty of _allocating_ -- lots of `duffcopy` and `madvise` appearing in the flame graphs. The idea was to reduce data-copies in the DSL.
* [https://github.com/johnkerl/miller/pull/765](#765): In C, prints to `stdout` are buffered a line at a time if the output is to the terminal, or a few KB at a time if not (i.e. file or pipe). Note the cost is how often the process does a `write` system call with associated overhead of context-switching into the kernel and back out. The C behavior is the right thing to do. In the Go port, very early on writes were all unbuffered -- several per record. Then buffering was soon switched to per-record, which was an improvement. But as of #765, the buffering is done at the library level, and it's done C-style -- much less frequently when output is not to a terminal.
* [https://github.com/johnkerl/miller/pull/774](#774): For CSV-lite and DKVP, this avoids using regexes to split strings when `strings.Split` will do.
* [https://github.com/johnkerl/miller/pull/779](#779): The basic idea of the Miller Go port was that the record-reader writes a record at a time over a channel to the first verb; the first verb writes records one at a time to the second verb, and so on; the last verb writes records one at a time to the record-writer. This is very simple, but for large files, the Go runtime scheduler overhead is too large -- data are chopped up into too many pieces. On #779 records are written 500 (or fewer) per batch, and all the channels from record-reader, to verbs, to record-writer are on record-batches. This lets Miller spend more time doing its job and less time yielding to the goroutine scheduler.
* [https://github.com/johnkerl/miller/pull/787](#786): In the C version, all values were strings until operated on specifically (expliclitly) by a verb. In the Go port, initially, all values were type-inferred on read, with types retained throughout the processing chain. This was an incredibly elegant and empowering design decision -- central to the Go port, in fact -- but it came with the cost that _all_ fields were being scanned as float/int even if they weren't used in the processing chain. On #786, fields are left as raw strings with type "pending", only just-in-time inferred to string/int/float only when used within the processing chain.
* [https://github.com/johnkerl/miller/pull/787](#787): This removed an unnecessary data copy in the `mlrval.String()` method. Originally this method had non-pointer receiver to conform with the `fmt.Stringer` interface. Hoewver, that's a false economy: `fmt.Println(someMlrval)` is a corner case, and stream processing is the primary concern. Implementing this as a pointer-receiver method was a performance improvement.
* [https://github.com/johnkerl/miller/pull/809](#809): This reduced the number of passes through fields for just-in-time type-inference. For example, for `$y = $x + 1`, each record's `$x` field's raw string (if not already accessed in the processing chain) needs to be checked to see if it's int (like `123`), float (like `123.4` or `1.2e3`), or string (anything else). Previously, succinct calls to built-in Go library functions were used. That was easy to code, but made too many expensive calls that were avoidable by lighter peeking of strings. In particular, an is-octal regex was being invoked unnecessarily on every field type-infer operation.

See also [./README-profiling.md](./README-profiling.md) and [https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks](https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks).

In summary:

* #765, #774, and #787 were low-hanging fruit.
* #424 was a bit more involved, and reveals that memory allocation -- not just GC -- needs to be handled more mindfully in Go than in C.
* #779 was a bit more involved, and reveals that Go's elegant goroutine/channel processing model comes with the caveat that channelized data should not be organized in many, small pieces.
* #809 was also bit more involved, and reveals that library functions are convenient, but profiling and analysis can sometimes reveal an opportunity for an impact, custom solution.
* #786 was a massive refactor involving about 10KLOC -- in hindsight it would have been best to do this work at the start of the Go port, not at the end.
Expand Down
139 changes: 29 additions & 110 deletions README-profiling.md
Original file line number Diff line number Diff line change
@@ -1,60 +1,55 @@
# How to run the profiler
# See also

Running Miller:
See also [https://github.com/johnkerl/miller/blob/readme-profiling/README-go-port.md#performance-optimizations](https://github.com/johnkerl/miller/blob/readme-profiling/README-go-port.md#performance-optimizations) and [https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks](https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks).

```
mlr --cpuprofile cpu.pprof put -f u/example.mlr then nothing ~/tmp/huge > /dev/null
```
# How to view profile data

(or whatever command-line flags).
Run the profiler:

# How to view the profiling results
```
mlr --cpuprofile cpu.pprof --csv put -f scripts/chain-1.mlr ~/tmp/big.csv > /dev/null
```

(or whatever command-line flags for Miller).

## Text mode
Text mode:

```
go tool pprof mlr cpu.pprof
top10
```

## PDF mode
Graphical mode:

```
go tool pprof --pdf mlr cpu.pprof > mlr-call-graph.pdf
mv mlr-call-graph.pdf ~/Desktop
go tool pprof -http=:8080 cpu.pprof
```

## Flame-graph mode
and let it pop open a browser window.

One-time setup:
# Benchmark scripts

Scripts:

```
export GOPATH=$HOME/go
mkdir -p $HOME/go
```
* [./scripts/make-big-files](./scripts/make-big-files) -- Create million-record data files in various formats.
* [./scripts/chain-cmps.sh](./scripts/chain-cmps.sh) -- Run a few processing scenarios on the million-record CSV file.
* [./scripts/chain-1.mlr](./scripts/chain-1.mlr) -- An example `mlr put` used by the previous script
* [./scripts/time-big-files](./scripts/time-big-files) -- Runs `mlr cat` for million-record files of various file formats. Catting files isn't intrinsically interesting but it shows how input and output processing vary over file formats.
* [./scripts/time-big-file](./scripts/time-big-file) -- Helper script for the former.
* [./scripts/chain-lengths.sh](./scripts/chain-lengths.sh) -- Run longer and longer chains of `scripts/chain1.mlr`, showing how Miller handles multicore and concurrency.
* [./scripts/make-data-stream](./scripts/make-data-stream) -- Create an endless stream of data to be piped into Miller for steady-state load-testing: e.g. `scripts/make-data-stream | mlr ...` then look at `htop` in another window.

```
go get -u github.com/google/pprof
ll ~/go/bin/pprof
go get -u github.com/uber/go-torch
```
Notes:

```
mkdir -p ~/git/brendangregg
cd ~/git/brendangregg
git clone https://github.com/brendangregg/FlameGraph
```
* Any of the above can be run using the profiler. I find Flame Graph mode particuarly informative for drill-down.
* The above refer to `mlr5` and `~/tmp/miller/mlr` as well as `./mlr`. The idea is I have a copy of Miller 5.10.3 (the C implementation) saved off in my path as `mlr5`. Then I keep `~/tmp/miller` on recent HEAD. Then I have `.` on a dev branch. Comparing `mlr5` to `./mlr` shows relative performance of the C and Go implementations. Comparing `~/tmp/miller/mlr` to `./mlr` shows relative performance of whatever optimization I'm currently working on.
* Several of the above scripts use [justtime](https://github.com/johnkerl/scripts/blob/main/fundam/justtime) to get one-line timing information.

Per run:
# How to vary compiler versions

```
cd /path/to/mlr/go
export PATH=${PATH}:~/git/brendangregg/FlameGraph/
go-torch cpu.pprof
mv torch.svg ~/Desktop/
```
* [./scripts/compiler-versions-install](./scripts/compiler-versions-install)
* [./scripts/compiler-versions-build](./scripts/compiler-versions-build)
* [./scripts/compiler-versions-time](./scripts/compiler-versions-time)

# How to control garbage collection

Expand All @@ -69,79 +64,3 @@ GOGC=1000 GODEBUG=gctrace=1 mlr -n put -q -f u/mand.mlr 1> /dev/null
# Turn off GC entirely and see where time is spent:
GOGC=off GODEBUG=gctrace=1 mlr -n put -q -f u/mand.mlr 1> /dev/null
```

# Findings from the profiler as of 2021-02

* GC on: lots of GC
* GC off: lots of `duffcopy` and `madvise`
* From benchmark `u/mand.mlr`: issue is allocation created by expressions
* Things like `type BinaryFunc func(input1 *Mlrval, input2 *Mlrval) (output Mlrval)`
* `z = 2 + x + 4 + y * 3` results in AST, mapped to a CST, with a malloc on the output of every unary/binary/ternary function
* Idea: replace with `type BinaryFunc func(output* Mlrval, input1 *Mlrval, input2 *Mlrval)`: no allocation at all.
* Breaks the Fibonacci test since the binary-operator node is no longer re-entrant
* Idea: replace with `type BinaryFunc func(input1 *Mlrval, input2 *Mlrval) (output *Mlrval)`: better.
* Makes possible zero-copy eval of literal nodes, etc.

```
for i in 100 200 300 400 500 600 700 800 900 1000 ; do
for j in 1 2 3 4 5 ; do
echo $i;
justtime GOGC=$i mlr -n put -q -f u/mand.mlr > /dev/null
done
done
```

```
100 23.754
100 23.883
100 24.021
100 24.022
100 24.305
200 20.864
200 20.211
200 19.980
200 20.251
200 20.691
300 19.140
300 18.610
300 18.793
300 19.111
300 19.027
400 18.067
400 18.274
400 18.344
400 18.378
400 18.250
500 17.791
500 17.644
500 17.814
500 18.064
500 18.403
600 17.878
600 17.892
600 18.034
600 18.125
600 18.008
700 18.153
700 18.286
700 17.342
700 21.136
700 20.729
800 19.585
800 19.116
800 17.170
800 18.549
800 18.236
900 16.950
900 17.883
900 17.532
900 17.551
900 17.804
1000 20.076
1000 20.745
1000 19.657
1000 18.733
1000 18.560
```

Sweet spot around 500. Note https://golang.org/pkg/runtime/debug/#SetGCPercent.
2 changes: 1 addition & 1 deletion scripts/time-big-file
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/bash

mlr="mlr -S"
mlr="mlr"
suffix="dkvp"

iflag=""
Expand Down
4 changes: 0 additions & 4 deletions todo.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
================================================================
PUNCHDOWN LIST

* numeric-inference perf
o README-profiling.md -- update ...
o README-go-port.md re various scripts

* blockers:
- fractional-strptime
- improved regex doc w/ lots of examples
Expand Down