From 5ad44f008f9e90c832ffb859a6d234db399b8325 Mon Sep 17 00:00:00 2001 From: John Kerl Date: Mon, 27 Dec 2021 23:21:31 -0500 Subject: [PATCH] Update README-profiling.md --- README-go-port.md | 4 ++ README-profiling.md | 139 +++++++++--------------------------------- scripts/time-big-file | 2 +- todo.txt | 4 -- 4 files changed, 34 insertions(+), 115 deletions(-) diff --git a/README-go-port.md b/README-go-port.md index c020d34c64..e6cde483d5 100644 --- a/README-go-port.md +++ b/README-go-port.md @@ -163,6 +163,7 @@ nil through the reader/transformer/writer sequence. Key performance-related PRs for the Go port include: +* [https://github.com/johnkerl/miller/pull/424](#424): In C, lots of little mallocs are fine. In Go, not so much. This is not the garbage-collection penalty -- it's the penalty of _allocating_ -- lots of `duffcopy` and `madvise` appearing in the flame graphs. The idea was to reduce data-copies in the DSL. * [https://github.com/johnkerl/miller/pull/765](#765): In C, prints to `stdout` are buffered a line at a time if the output is to the terminal, or a few KB at a time if not (i.e. file or pipe). Note the cost is how often the process does a `write` system call with associated overhead of context-switching into the kernel and back out. The C behavior is the right thing to do. In the Go port, very early on writes were all unbuffered -- several per record. Then buffering was soon switched to per-record, which was an improvement. But as of #765, the buffering is done at the library level, and it's done C-style -- much less frequently when output is not to a terminal. * [https://github.com/johnkerl/miller/pull/774](#774): For CSV-lite and DKVP, this avoids using regexes to split strings when `strings.Split` will do. * [https://github.com/johnkerl/miller/pull/779](#779): The basic idea of the Miller Go port was that the record-reader writes a record at a time over a channel to the first verb; the first verb writes records one at a time to the second verb, and so on; the last verb writes records one at a time to the record-writer. This is very simple, but for large files, the Go runtime scheduler overhead is too large -- data are chopped up into too many pieces. On #779 records are written 500 (or fewer) per batch, and all the channels from record-reader, to verbs, to record-writer are on record-batches. This lets Miller spend more time doing its job and less time yielding to the goroutine scheduler. @@ -170,9 +171,12 @@ Key performance-related PRs for the Go port include: * [https://github.com/johnkerl/miller/pull/787](#787): This removed an unnecessary data copy in the `mlrval.String()` method. Originally this method had non-pointer receiver to conform with the `fmt.Stringer` interface. Hoewver, that's a false economy: `fmt.Println(someMlrval)` is a corner case, and stream processing is the primary concern. Implementing this as a pointer-receiver method was a performance improvement. * [https://github.com/johnkerl/miller/pull/809](#809): This reduced the number of passes through fields for just-in-time type-inference. For example, for `$y = $x + 1`, each record's `$x` field's raw string (if not already accessed in the processing chain) needs to be checked to see if it's int (like `123`), float (like `123.4` or `1.2e3`), or string (anything else). Previously, succinct calls to built-in Go library functions were used. That was easy to code, but made too many expensive calls that were avoidable by lighter peeking of strings. In particular, an is-octal regex was being invoked unnecessarily on every field type-infer operation. +See also [./README-profiling.md](./README-profiling.md) and [https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks](https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks). + In summary: * #765, #774, and #787 were low-hanging fruit. +* #424 was a bit more involved, and reveals that memory allocation -- not just GC -- needs to be handled more mindfully in Go than in C. * #779 was a bit more involved, and reveals that Go's elegant goroutine/channel processing model comes with the caveat that channelized data should not be organized in many, small pieces. * #809 was also bit more involved, and reveals that library functions are convenient, but profiling and analysis can sometimes reveal an opportunity for an impact, custom solution. * #786 was a massive refactor involving about 10KLOC -- in hindsight it would have been best to do this work at the start of the Go port, not at the end. diff --git a/README-profiling.md b/README-profiling.md index 19546b299b..dcf0310cdd 100644 --- a/README-profiling.md +++ b/README-profiling.md @@ -1,60 +1,55 @@ -# How to run the profiler +# See also -Running Miller: +See also [https://github.com/johnkerl/miller/blob/readme-profiling/README-go-port.md#performance-optimizations](https://github.com/johnkerl/miller/blob/readme-profiling/README-go-port.md#performance-optimizations) and [https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks](https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks). -``` -mlr --cpuprofile cpu.pprof put -f u/example.mlr then nothing ~/tmp/huge > /dev/null -``` +# How to view profile data -(or whatever command-line flags). +Run the profiler: -# How to view the profiling results +``` +mlr --cpuprofile cpu.pprof --csv put -f scripts/chain-1.mlr ~/tmp/big.csv > /dev/null +``` +(or whatever command-line flags for Miller). -## Text mode +Text mode: ``` go tool pprof mlr cpu.pprof top10 ``` -## PDF mode +Graphical mode: ``` -go tool pprof --pdf mlr cpu.pprof > mlr-call-graph.pdf -mv mlr-call-graph.pdf ~/Desktop +go tool pprof -http=:8080 cpu.pprof ``` -## Flame-graph mode +and let it pop open a browser window. -One-time setup: +# Benchmark scripts +Scripts: -``` -export GOPATH=$HOME/go -mkdir -p $HOME/go -``` +* [./scripts/make-big-files](./scripts/make-big-files) -- Create million-record data files in various formats. +* [./scripts/chain-cmps.sh](./scripts/chain-cmps.sh) -- Run a few processing scenarios on the million-record CSV file. + * [./scripts/chain-1.mlr](./scripts/chain-1.mlr) -- An example `mlr put` used by the previous script +* [./scripts/time-big-files](./scripts/time-big-files) -- Runs `mlr cat` for million-record files of various file formats. Catting files isn't intrinsically interesting but it shows how input and output processing vary over file formats. + * [./scripts/time-big-file](./scripts/time-big-file) -- Helper script for the former. +* [./scripts/chain-lengths.sh](./scripts/chain-lengths.sh) -- Run longer and longer chains of `scripts/chain1.mlr`, showing how Miller handles multicore and concurrency. +* [./scripts/make-data-stream](./scripts/make-data-stream) -- Create an endless stream of data to be piped into Miller for steady-state load-testing: e.g. `scripts/make-data-stream | mlr ...` then look at `htop` in another window. -``` -go get -u github.com/google/pprof -ll ~/go/bin/pprof -go get -u github.com/uber/go-torch -``` +Notes: -``` -mkdir -p ~/git/brendangregg -cd ~/git/brendangregg -git clone https://github.com/brendangregg/FlameGraph -``` +* Any of the above can be run using the profiler. I find Flame Graph mode particuarly informative for drill-down. +* The above refer to `mlr5` and `~/tmp/miller/mlr` as well as `./mlr`. The idea is I have a copy of Miller 5.10.3 (the C implementation) saved off in my path as `mlr5`. Then I keep `~/tmp/miller` on recent HEAD. Then I have `.` on a dev branch. Comparing `mlr5` to `./mlr` shows relative performance of the C and Go implementations. Comparing `~/tmp/miller/mlr` to `./mlr` shows relative performance of whatever optimization I'm currently working on. +* Several of the above scripts use [justtime](https://github.com/johnkerl/scripts/blob/main/fundam/justtime) to get one-line timing information. -Per run: +# How to vary compiler versions -``` -cd /path/to/mlr/go -export PATH=${PATH}:~/git/brendangregg/FlameGraph/ -go-torch cpu.pprof -mv torch.svg ~/Desktop/ -``` +* [./scripts/compiler-versions-install](./scripts/compiler-versions-install) +* [./scripts/compiler-versions-build](./scripts/compiler-versions-build) +* [./scripts/compiler-versions-time](./scripts/compiler-versions-time) # How to control garbage collection @@ -69,79 +64,3 @@ GOGC=1000 GODEBUG=gctrace=1 mlr -n put -q -f u/mand.mlr 1> /dev/null # Turn off GC entirely and see where time is spent: GOGC=off GODEBUG=gctrace=1 mlr -n put -q -f u/mand.mlr 1> /dev/null ``` - -# Findings from the profiler as of 2021-02 - -* GC on: lots of GC -* GC off: lots of `duffcopy` and `madvise` -* From benchmark `u/mand.mlr`: issue is allocation created by expressions - * Things like `type BinaryFunc func(input1 *Mlrval, input2 *Mlrval) (output Mlrval)` - * `z = 2 + x + 4 + y * 3` results in AST, mapped to a CST, with a malloc on the output of every unary/binary/ternary function - * Idea: replace with `type BinaryFunc func(output* Mlrval, input1 *Mlrval, input2 *Mlrval)`: no allocation at all. - * Breaks the Fibonacci test since the binary-operator node is no longer re-entrant - * Idea: replace with `type BinaryFunc func(input1 *Mlrval, input2 *Mlrval) (output *Mlrval)`: better. - * Makes possible zero-copy eval of literal nodes, etc. - -``` -for i in 100 200 300 400 500 600 700 800 900 1000 ; do - for j in 1 2 3 4 5 ; do - echo $i; - justtime GOGC=$i mlr -n put -q -f u/mand.mlr > /dev/null - done -done -``` - -``` - 100 23.754 - 100 23.883 - 100 24.021 - 100 24.022 - 100 24.305 - 200 20.864 - 200 20.211 - 200 19.980 - 200 20.251 - 200 20.691 - 300 19.140 - 300 18.610 - 300 18.793 - 300 19.111 - 300 19.027 - 400 18.067 - 400 18.274 - 400 18.344 - 400 18.378 - 400 18.250 - 500 17.791 - 500 17.644 - 500 17.814 - 500 18.064 - 500 18.403 - 600 17.878 - 600 17.892 - 600 18.034 - 600 18.125 - 600 18.008 - 700 18.153 - 700 18.286 - 700 17.342 - 700 21.136 - 700 20.729 - 800 19.585 - 800 19.116 - 800 17.170 - 800 18.549 - 800 18.236 - 900 16.950 - 900 17.883 - 900 17.532 - 900 17.551 - 900 17.804 -1000 20.076 -1000 20.745 -1000 19.657 -1000 18.733 -1000 18.560 -``` - -Sweet spot around 500. Note https://golang.org/pkg/runtime/debug/#SetGCPercent. diff --git a/scripts/time-big-file b/scripts/time-big-file index f660391a46..fef1e07ca2 100755 --- a/scripts/time-big-file +++ b/scripts/time-big-file @@ -1,6 +1,6 @@ #!/bin/bash -mlr="mlr -S" +mlr="mlr" suffix="dkvp" iflag="" diff --git a/todo.txt b/todo.txt index 6e891c3c83..30aa087c42 100644 --- a/todo.txt +++ b/todo.txt @@ -1,10 +1,6 @@ ================================================================ PUNCHDOWN LIST -* numeric-inference perf - o README-profiling.md -- update ... - o README-go-port.md re various scripts - * blockers: - fractional-strptime - improved regex doc w/ lots of examples