From 5ad44f008f9e90c832ffb859a6d234db399b8325 Mon Sep 17 00:00:00 2001
From: John Kerl <kerl.john.r@gmail.com>
Date: Mon, 27 Dec 2021 23:21:31 -0500
Subject: [PATCH] Update README-profiling.md

---
 README-go-port.md     |   4 ++
 README-profiling.md   | 139 +++++++++---------------------------------
 scripts/time-big-file |   2 +-
 todo.txt              |   4 --
 4 files changed, 34 insertions(+), 115 deletions(-)

diff --git a/README-go-port.md b/README-go-port.md
index c020d34c64..e6cde483d5 100644
--- a/README-go-port.md
+++ b/README-go-port.md
@@ -163,6 +163,7 @@ nil through the reader/transformer/writer sequence.
 
 Key performance-related PRs for the Go port include:
 
+* [https://github.com/johnkerl/miller/pull/424](#424): In C, lots of little mallocs are fine. In Go, not so much. This is not the garbage-collection penalty -- it's the penalty of _allocating_ -- lots of `duffcopy` and `madvise` appearing in the flame graphs. The idea was to reduce data-copies in the DSL.
 * [https://github.com/johnkerl/miller/pull/765](#765): In C, prints to `stdout` are buffered a line at a time if the output is to the terminal, or a few KB at a time if not (i.e. file or pipe). Note the cost is how often the process does a `write` system call with associated overhead of context-switching into the kernel and back out. The C behavior is the right thing to do. In the Go port, very early on writes were all unbuffered -- several per record. Then buffering was soon switched to per-record, which was an improvement. But as of #765, the buffering is done at the library level, and it's done C-style -- much less frequently when output is not to a terminal.
 * [https://github.com/johnkerl/miller/pull/774](#774): For CSV-lite and DKVP, this avoids using regexes to split strings when `strings.Split` will do.
 * [https://github.com/johnkerl/miller/pull/779](#779): The basic idea of the Miller Go port was that the record-reader writes a record at a time over a channel to the first verb; the first verb writes records one at a time to the second verb, and so on; the last verb writes records one at a time to the record-writer. This is very simple, but for large files, the Go runtime scheduler overhead is too large -- data are chopped up into too many pieces. On #779 records are written 500 (or fewer) per batch, and all the channels from record-reader, to verbs, to record-writer are on record-batches. This lets Miller spend more time doing its job and less time yielding to the goroutine scheduler.
@@ -170,9 +171,12 @@ Key performance-related PRs for the Go port include:
 * [https://github.com/johnkerl/miller/pull/787](#787): This removed an unnecessary data copy in the `mlrval.String()` method. Originally this method had non-pointer receiver to conform with the `fmt.Stringer` interface. Hoewver, that's a false economy: `fmt.Println(someMlrval)` is a corner case, and stream processing is the primary concern. Implementing this as a pointer-receiver method was a performance improvement.
 * [https://github.com/johnkerl/miller/pull/809](#809): This reduced the number of passes through fields for just-in-time type-inference. For example, for `$y = $x + 1`, each record's `$x` field's raw string (if not already accessed in the processing chain) needs to be checked to see if it's int (like `123`), float (like `123.4` or `1.2e3`), or string (anything else). Previously, succinct calls to built-in Go library functions were used. That was easy to code, but made too many expensive calls that were avoidable by lighter peeking of strings. In particular, an is-octal regex was being invoked unnecessarily on every field type-infer operation.
 
+See also [./README-profiling.md](./README-profiling.md) and [https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks](https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks).
+
 In summary:
 
 * #765, #774, and #787 were low-hanging fruit.
+* #424 was a bit more involved, and reveals that memory allocation -- not just GC -- needs to be handled more mindfully in Go than in C.
 * #779 was a bit more involved, and reveals that Go's elegant goroutine/channel processing model comes with the caveat that channelized data should not be organized in many, small pieces.
 * #809 was also bit more involved, and reveals that library functions are convenient, but profiling and analysis can sometimes reveal an opportunity for an impact, custom solution.
 * #786 was a massive refactor involving about 10KLOC -- in hindsight it would have been best to do this work at the start of the Go port, not at the end.
diff --git a/README-profiling.md b/README-profiling.md
index 19546b299b..dcf0310cdd 100644
--- a/README-profiling.md
+++ b/README-profiling.md
@@ -1,60 +1,55 @@
-# How to run the profiler
+# See also
 
-Running Miller:
+See also [https://github.com/johnkerl/miller/blob/readme-profiling/README-go-port.md#performance-optimizations](https://github.com/johnkerl/miller/blob/readme-profiling/README-go-port.md#performance-optimizations) and [https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks](https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks).
 
-```
-mlr --cpuprofile cpu.pprof put -f u/example.mlr then nothing ~/tmp/huge > /dev/null
-```
+# How to view profile data
 
-(or whatever command-line flags).
+Run the profiler:
 
-# How to view the profiling results
+```
+mlr --cpuprofile cpu.pprof --csv put -f scripts/chain-1.mlr ~/tmp/big.csv > /dev/null
+```
 
+(or whatever command-line flags for Miller).
 
-## Text mode
+Text mode:
 
 ```
 go tool pprof mlr cpu.pprof
 top10
 ```
 
-## PDF mode
+Graphical mode:
 
 ```
-go tool pprof --pdf mlr cpu.pprof > mlr-call-graph.pdf
-mv mlr-call-graph.pdf ~/Desktop
+go tool pprof -http=:8080 cpu.pprof
 ```
 
-## Flame-graph mode
+and let it pop open a browser window.
 
-One-time setup:
+# Benchmark scripts
 
+Scripts:
 
-```
-export GOPATH=$HOME/go
-mkdir -p $HOME/go
-```
+* [./scripts/make-big-files](./scripts/make-big-files) -- Create million-record data files in various formats.
+* [./scripts/chain-cmps.sh](./scripts/chain-cmps.sh) -- Run a few processing scenarios on the million-record CSV file.
+  * [./scripts/chain-1.mlr](./scripts/chain-1.mlr) -- An example `mlr put` used by the previous script
+* [./scripts/time-big-files](./scripts/time-big-files) -- Runs `mlr cat` for million-record files of various file formats. Catting files isn't intrinsically interesting but it shows how input and output processing vary over file formats.
+  * [./scripts/time-big-file](./scripts/time-big-file) -- Helper script for the former.
+* [./scripts/chain-lengths.sh](./scripts/chain-lengths.sh) -- Run longer and longer chains of `scripts/chain1.mlr`, showing how Miller handles multicore and concurrency.
+* [./scripts/make-data-stream](./scripts/make-data-stream) -- Create an endless stream of data to be piped into Miller for steady-state load-testing: e.g. `scripts/make-data-stream | mlr ...` then look at `htop` in another window.
 
-```
-go get -u github.com/google/pprof
-ll ~/go/bin/pprof
-go get -u github.com/uber/go-torch
-```
+Notes:
 
-```
-mkdir -p ~/git/brendangregg
-cd ~/git/brendangregg
-git clone https://github.com/brendangregg/FlameGraph
-```
+* Any of the above can be run using the profiler. I find Flame Graph mode particuarly informative for drill-down.
+* The above refer to `mlr5` and `~/tmp/miller/mlr` as well as `./mlr`. The idea is I have a copy of Miller 5.10.3 (the C implementation) saved off in my path as `mlr5`. Then I keep `~/tmp/miller` on recent HEAD. Then I have `.` on a dev branch. Comparing `mlr5` to `./mlr` shows relative performance of the C and Go implementations. Comparing `~/tmp/miller/mlr` to `./mlr` shows relative performance of whatever optimization I'm currently working on.
+* Several of the above scripts use [justtime](https://github.com/johnkerl/scripts/blob/main/fundam/justtime) to get one-line timing information.
 
-Per run:
+# How to vary compiler versions
 
-```
-cd /path/to/mlr/go
-export PATH=${PATH}:~/git/brendangregg/FlameGraph/
-go-torch cpu.pprof
-mv torch.svg ~/Desktop/
-```
+* [./scripts/compiler-versions-install](./scripts/compiler-versions-install)
+* [./scripts/compiler-versions-build](./scripts/compiler-versions-build)
+* [./scripts/compiler-versions-time](./scripts/compiler-versions-time)
 
 # How to control garbage collection
 
@@ -69,79 +64,3 @@ GOGC=1000 GODEBUG=gctrace=1 mlr -n put -q -f u/mand.mlr 1> /dev/null
 # Turn off GC entirely and see where time is spent:
 GOGC=off  GODEBUG=gctrace=1 mlr -n put -q -f u/mand.mlr 1> /dev/null
 ```
-
-# Findings from the profiler as of 2021-02
-
-* GC on: lots of GC
-* GC off: lots of `duffcopy` and `madvise`
-* From benchmark `u/mand.mlr`: issue is allocation created by expressions
-  * Things like `type BinaryFunc func(input1 *Mlrval, input2 *Mlrval) (output Mlrval)`
-  * `z = 2 + x + 4 + y * 3` results in AST, mapped to a CST, with a malloc on the output of every unary/binary/ternary function
-  * Idea: replace with `type BinaryFunc func(output* Mlrval, input1 *Mlrval, input2 *Mlrval)`: no allocation at all.
-    * Breaks the Fibonacci test since the binary-operator node is no longer re-entrant
-  * Idea: replace with `type BinaryFunc func(input1 *Mlrval, input2 *Mlrval) (output *Mlrval)`: better.
-    * Makes possible zero-copy eval of literal nodes, etc.
-
-```
-for i in 100 200 300 400 500 600 700 800 900 1000 ; do
-  for j in 1 2 3 4 5 ; do
-    echo $i;
-    justtime GOGC=$i mlr -n put -q -f u/mand.mlr > /dev/null
-  done
-done
-```
-
-```
- 100 23.754
- 100 23.883
- 100 24.021
- 100 24.022
- 100 24.305
- 200 20.864
- 200 20.211
- 200 19.980
- 200 20.251
- 200 20.691
- 300 19.140
- 300 18.610
- 300 18.793
- 300 19.111
- 300 19.027
- 400 18.067
- 400 18.274
- 400 18.344
- 400 18.378
- 400 18.250
- 500 17.791
- 500 17.644
- 500 17.814
- 500 18.064
- 500 18.403
- 600 17.878
- 600 17.892
- 600 18.034
- 600 18.125
- 600 18.008
- 700 18.153
- 700 18.286
- 700 17.342
- 700 21.136
- 700 20.729
- 800 19.585
- 800 19.116
- 800 17.170
- 800 18.549
- 800 18.236
- 900 16.950
- 900 17.883
- 900 17.532
- 900 17.551
- 900 17.804
-1000 20.076
-1000 20.745
-1000 19.657
-1000 18.733
-1000 18.560
-```
-
-Sweet spot around 500. Note https://golang.org/pkg/runtime/debug/#SetGCPercent.
diff --git a/scripts/time-big-file b/scripts/time-big-file
index f660391a46..fef1e07ca2 100755
--- a/scripts/time-big-file
+++ b/scripts/time-big-file
@@ -1,6 +1,6 @@
 #!/bin/bash
 
-mlr="mlr -S"
+mlr="mlr"
 suffix="dkvp"
 
 iflag=""
diff --git a/todo.txt b/todo.txt
index 6e891c3c83..30aa087c42 100644
--- a/todo.txt
+++ b/todo.txt
@@ -1,10 +1,6 @@
 ================================================================
 PUNCHDOWN LIST
 
-* numeric-inference perf
-  o README-profiling.md -- update ...
-  o README-go-port.md re various scripts
-
 * blockers:
   - fractional-strptime
   - improved regex doc w/ lots of examples