=============================================================== * 404 -- what broke? https://github.com/johnkerl/miller/pull/757/files * https://github.com/johnkerl/miller/issues?q=is%3Aissue+is%3Aopen+label%3Aneeds-documentation * https://squidfunk.github.io/mkdocs-material/setup/setting-up-versioning/ RELEASES * 6.4.0 ideas: ! -r splits for rename, merge-fields, cut ! summing up empty data ! emitv2 ! csv check for empty field names ! strmatch ! transposed output o mrpl exits ... o opt-in type-infers for inf, true, etc o #982 '.' ',' etc o extended field accessors for #763 and #948 (positional & json.nested) o strict mode o awk-like exit o unsparsify -f CSV by default -- ? into CSV record-writer -- ? caveat that record 1 controls all ... o mlr split -- needs an example page along with the tee DSL function - some mlr merge somehow -- ? would need a verb-API refactor ? zlen csv? ? datediff et al. ? rank ? YAML ? #908 inferencing options ? gogll ================================================================ FEATURES ---------------------------------------------------------------- STRICT MODE ? re silent zero-pass for-loops on non-collections: i theme is handling of 'absent' ? what about handling of 'error' ? * improve wording: mlr: couldn't assign variable int function return value from value absent (absent) * need $?x and @?x in the grammar & CST * flags: o mlr -z and mlr put -z o note put has -w (warn) and -W (fatal) - then strict mode includes -W? * tests: mlr --csv --from $exv put -z 'x = 1' mlr --csv --from $exv put -z 'var x = a' mlr --csv --from $exv put -z 'var x = $nonesuch' mlr --csv --from $exv put -z 'var x = $["asdf"]' mlr --csv --from $exv put -z 'var x = $[nonesuch]' mlr --csv --from $exv put -z 'var x = $[[999]]' mlr --csv --from $exv put -z 'var x = $[[[999]]]' mlr --csv --from $exv put -z 'begin { var m = $* }' mlr --csv --from $exv put -z 'var x = @nonesuch' mlr --csv --from $exv put -z 'var x = @["nonesuch"]' mlr --csv --from $exv put -z 'func f(): int {}; $x = f()' mlr --csv --from $exv put -z 'func f() {}; $x = f()' mlr --csv --from $exv put -z 'func f() {return nonesuch}; $x = f()' mlr --csv --from $exv put -z '$env = ENV[nonesuch]' mlr --csv --from $exv put -z '$env = ENV["nonesuch"]' ---------------------------------------------------------------- EXTENDED FIELD ACCESSORS * extract/simplify getter logic outside of full CST/runtime/etc * sketch out putter ... ? what does 'cut -f a.b[3]' even mean? o 'cut -xf a.b[3]' is easy enough ... o field name '@graph' is an issue :( * maybe ......... not DSL per se but a separate "parser" here o delimiters '[' ']' '.' '{' '}' o $foo o ${foo} -- in case the field name has a . in it or what have you x $["foo"] -- don't support this o $foo.bar -- treat this like $foo["bar"] o $foo["bar"] or foo[1] -- contents just a string or an integer (no expressions supported) o $[[2]] -- contents just an integer (no expressions supported) o $[[[3]]] -- contents just an integer (no expressions supported) o note $foo.bar[[[2]]][[3]] needs to work $foo["bar"]={"a":1,"b":{"x":3},"c":4} $foo["bar"][[[2]]][[1]] is "x" o parse as walk-through (with {...} special) until . [] [[]] [[[]]] - example $foo.bar[[[2]]][[3]] L1: $* -> foo L2: -> map-index ["bar"] L3: positional value 2 L4: positional name 3 o maybe ... eval as simply composition of classes - level 1 mlrmap -> mlrval -- $foo or $[[2]] or $[[[3]]] - level 2+ mlrval -> mlrval -- [], [[]], [[[]]], or . * code-comment why special classes & not just DSL o inside [] [[]] [[[]]] are just int/string (maybe slices?); no expressions o no $[...] with strings o no need for runtime.State o move POC from $cst to $mlrval -- ? or maybe its own package? * note that '.' is quite fine in CSV files etc -- also maybe in JSON ........ ? so, maybe only enable this with mlr -e? also maybe extended/non-extended at each & every verb -- ?!? ? maybe if extended/non-extended is unspecified, default it to on for json/jsonl input & off otherwise -- ? * maybe also 'raise' and 'raise_array' verbs (need better names though) ---------------------------------------------------------------- TSV etc ? also: some escapes perhaps for dkvp, xtab, pprint -- ? o nidx is a particular pure-text, leave-as-is ? try out nidx single-line w/ \001, \002 FS/PS & \n or \n\n RS o make/publicize a shorthand for this -- ? o --words && --lines & --paragraphs -- ? ---------------------------------------------------------------- example pages as feature catch-up: * example page for format/scan options o format/unformat x=unformat(1,2); x; string(x); is_error(x) o strmatch o =~ * separate examples section from (new) FAQs section * mlr split -- needs an example page along with the tee DSL function o narrative about in-memory, shard-out to workers, holdouts, sample/bootstrap, etc etc. o ulimit note * new slwin/ewma example entry, with ccump and pgr o slwin --prune (or somesuch) to only emit averages over full windows -- ? ---------------------------------------------------------------- inference: * inf/nan (unquoted) -> DSL o webdocs as in #933 description * for data files: --symbol-true yes --symbol-false off --symbol-infinity inf --symbol-not-available N/A ---------------------------------------------------------------- ! sysdate, sysdate_local; datediff ... ---------------------------------------------------------------- ! strmatch https://github.com/johnkerl/miller/issues/77#issuecomment-538790927 ---------------------------------------------------------------- * make a lag-by-n and lead-by-n ---------------------------------------------------------------- ! rank ---------------------------------------------------------------- ! CSV default unsparsify ... note this must trust the *first* record ....... * CSV: acceptance somehow for \" in place of "" o & more flex more generally ---------------------------------------------------------------- ! quoted NIDX - how with whitespace regex -- ? ! quoted DKVP - what about csvlite-style -- ? needs a --dkvplite ? ! pprint emit on schema change, not all-at-end. * new columns-to-arrays and arrays-to-columns for stan format * transpose ... * DKVP/XTAB encodings/decodings for \n etc; non-lite DKVP reader -- or nah? broken already for these cases * csvstat / pandas describe * r-strings/implicit-r/297: double-check end of reference-main-data-types.md.in * non-streaming DSL-enabled cut: https://github.com/johnkerl/miller/discussions/613 * pv: 'mlr --prepipex pv --gzin tail -n 10 ~/tmp/zhuge.gz' needs --gzin & --prepipex both o plus corresponding docwork * -W to mlr main also -- so it can be used with .mlrrc ? json-triple-quote -- what can be done here? ? dotted-syntax support in verbs? ? full format-string parser for corner cases like "X%08lldX" ? array and string slices on LHS of assignments ? feature/shorthand for repl newline before prompt ? meta: nr, nf, keys ? BIFs as FCFs? ? IP addresses and ranges as a datatype so one could do membership tests like "if 10.10.10.10 in 10.0.0.0/8". ? push/pop/shift/unshift functions ? IIFEs: 'func f = (func(){ return udf})()' ? non-top-level func defs ? support 'x[1]["a"]' etc notation in various verbs? ? sort within nested data structures? ? ast-parex separate mlr auxents entrypoint? ? het ifmt-reading - separate out InputFormat into per-file (or tbd) & have autodetect on file endings -- ? - maybe a TBD reader -- ? - InputFormat into Context - TBD writer -- defer factory until first context? - deeper refactor pulling format out of reader/writer options entirely -- ? ? multiple-valued return/assign -- ? ? array destructure at LHS for multi-retval assign (maps too?) ? new file formats o https://brandur.org/logfmt is simply DKVP w/ IFS = space (need dquot though) o https://docs.fluentbit.io/manual/pipeline/parsers/ltsv is just DKVP with IFS tab and IPS colon ? string/array slices on assignment LHS -- ? ? zip -- but that is an archive case too not just an encoding case ? miller support for archive-traversal; directory-traversal even w/o zip -- ? ? as 6.1 follow-on work -- ? * mlr -k o various test cases o OLH re limitations o check JSON-parser x 2 -- is there really a 'restart'? - infinite-loop avoidance for sure ================================================================ UX ? NF? ! bnf fix for '[[' ']]' etc -- make it a nesting of singles. since otherwise no '[[3,4]]' literals :( ! broadly rethink os.Exit, especially as affecting mlr repl * consider expanding '(error)' to have more useful error-text * sync-print option; or (yuck) another xprint variant; or ...; emph dump/eprint * strptime w/ ...00.Z -> error * short 'asserting' functions (absent/error); and/or strict or somesuch (DSL, or global ...) * header-only csv ... reminder ... * bash completion script https://github.com/johnkerl/miller/issues/77#issuecomment-308247402 https://iridakos.com/programming/2018/03/01/bash-programmable-completion-tutorial#:~:text=Bash%20completion%20is%20a%20functionality,key%20while%20typing%20a%20command. * tilde-expand for REPL load/open: if '~' is at the start of the string, run it though 'sh -c echo' o or, simpler, just slice[1:] and prepend with os.Getenv("HOME") ? fmtnum(98, "%3d%%") -- ? workaround: fmtnum(98, "%3d") . "%" ? main-level (verb-level?) flag for "," -> X in verbs -- in case commas in field names ? trace-mode ? ? $0 as raw-record string -- ? would make mlr grep simpler and more natural ... ================================================================ DOC * issue on gnu parallel -> link to scripting page * https://stackoverflow.com/questions/71011603/is-there-a-simple-way-to-convert-a-csv-with-0-indexed-paths-as-keys-to-json-with/71015190#71015190 ! slwin / unformat narrative: o prototype in DSL o mlr put -f / mlr -s o and, if it's frequent then submit a feature request b/c other people probably also would like it! :) ! link to SE table ... https://github.com/johnkerl/miller/discussions/609#discussioncomment-1115715 * put -f link in scrpting.html * no --load in .mlrrc b/c https://github.com/johnkerl/miller/security/advisories/GHSA-mw2v-4q78-j2cw (system() in begin{}) but suggest alias * surface readme-versions.md @ install & tell them how to get current o in particular, older distros won't auto-update * how-to-contribute @ rmd * more clear miller docs == head * go version & why not generics (not all distros w/ newer go compilers) * document the fill-empty verb in https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling * mlr R numpy/scipy/pandas ref. * ffmts page <-> R read.table read.csv ... * https://github.com/agarrharr/awesome-cli-apps#data-manipulation * "Miller, if I understand correctly, was born for those who already frequented the command line, awk, for users with at least average data skills. While for me, a basic user, it made me discover how it gives the possibility of analyzing and transforming data, even complex ones, with a very low learning curve." * document cloudthings, e.g. o go.yml o codespell.yml - codespell --check-filenames --skip *.csv,*.dkvp,*.txt,*.js,*.html,*.map,./tags,./test/cases --ignore-words-list denom,inTerm,inout,iput,nd,nin,numer,Wit,te,wee o readthedocs triggers * doc o wut h1 spacing before/after ... o shell-commands: while-read example from issues E reference-dsl-user-defined-functions: UDSes -> non-factorial example -- maybe some useful aggregator o reference-main-arithmetic: ? test stats1/step -F flag o reference-dsl-control-structures: e while (NR < 10) will never terminate as NR is only incremented between records -> and each expression is invoked once per record so once for NR=1, once for NR=2, etc. o C-style triple-for loops: loop to NR -> NO!!! o Since uninitialized out-of-stream variables default to 0 for addition/subtraction and 1 for multiplication when they appear on expression right-hand sides (not quite as in awk, where they'd default to 0 either way) <-> xlink to other page r fzf-ish w/ head -n 4, --from, up-arrow & append verb, then cat -- find & update the existing section ! https://github.com/johnkerl/miller/issues/653 -- stats1 w/ empties? check stats2 - needs UTs as well o while-read example from issues w contact re https://jsonlines.org/on_the_web/ * verslink old relnotes * single UT, hard to invoke w/ new full go.mod path go test $(ls internal/pkg/lib/*.go|grep -v test) internal/pkg/lib/unbackslash_test.go etc * file-formats: NIDX link to headerless CSV * window.mlr, window2.mlr -> doc somewhere * sec2gmt --millis/--micros/--nanos doc * sort-within-records --recursive doc * back-incompat: mlr -n put $vflag '@x=1; dump > stdout, @x' mlr -n put $vflag '@x=1; dump > stdout @x' * single cheatsheet page -- put out RFH? https://twitter.com/icymi_py/status/1426622817785765898/photo/1 * more about HTTP logs in miller -- doc and encapsulate: mlr --icsv --implicit-csv-header --ifs space --oxtab --from elb-log put -q 'print $27' * hofs section on typedecls * write up how git status after test should show any missed extra-outs * docs nest simplers now that we have getoptish * zlib: n.b. brew install pigz, then pigz -z * doc shift/unshift as using [2:] and append * comment schema-change supported only in csvlite reader, not csv reader * consider -w/-W to stderr not stdout -- ? ? grep out all error message from regtest outputs & doc them all & make sure index-searchable at readthedocs =============================================================== TESTING ! ./mlr vs mlr ... ! pos/neg 0x/0b/0o UTs * RT ngrams.sh -v -o 1 one-word-list.txt * JIT UT: o JSON I/O o mlrval_cmp.go o mv from-array/from-map w/ copy & mutate orig & check new -- & vice versa o dash-O and octal infer o populate each bifs/X_test.go for each bifs/X.go etc etc * xtab splitter UT; nidx too * hofs+typedecls RT cases * UT per se for lrec ops ================================================================ PERFORMANCE ! profile sort * JSON perf -- try alternate packages to encoding/json * more perf? - batchify source-quench -- experiment 1st - further channelize (CSV-first focus) mlrval infer vs record-put ? ? coalesce errchan & done-writing w/ Err to RAC, and close-chan *and* EOSMarker -- ? ================================================================ CODE-NEATENS / QUALITY * go-style assert.Equal(t, actual, expected) in place of R style expect_equal(expected, actual) in .../*_test.go ! ast namings etc ? make a simple NodeFromToken & have all interface{} be *ASTNode, not *token.Token ! broad commenting pass / TODO * []Mlrval -> []*Mlrval ? * cmp for array & map o JIT neatens - carefully read all $mlv files - check $types & $bifs as well - proofread all mlrval_cmp.go dispo mxes - update rmds x several * https://staticcheck.io/docs o lots of nice little things to clean up -- no bugs per se, all stylistic i *think* ... * funcptr away the ifs/ifsregex check in record-readers * all case-files could use top-notes * godoc neatens at func/const/etc level * unset, unassign, remove -- too many different names. also assign/put ... maybe stick w/ 2? * check triple-dash at mlr fill-down-h ; check others * clean up unused exitCode arg in sort/put usage. o also document pre/post conditions for flag and non-flag usages x all mappers * neaten mlr gap -g (default) print * typedecls w/ for-multi: C semantics 'k1: data k2: a:x v:1', sigh ... * fill-down make columns required. also, --all. * bitwise_and_dispositions et al should not have _absn for collections -- _erro instead * libify errors.New callsites for DSL/CST * record-readers are fully in-channel/loop; record-writers are multi with in-channel/loop being done by ChannelWriter, which is very small. opportunity to refactor. * widen DSL coverage o err-return for array/map get/put if incorrect types ... currently go-void ... ! the DSL needs a full, written-down-and-published spell-out of error-eval semantics * double-check rand-seeding o all rand invocations should go through the seeder for UT/other determinism * dev-note on why `int` not `int64` -- processor-arch & those who most need it get it ? gzout, bz2out -- ? make sure this works through tee? bleah ... ? golinter? ? go exe 17MB b/c gocc -- gogll ? ================================================================ INFO i @all-contributors please add @person for ideas i @all-contributors please add @person for documentation i https://en.wikipedia.org/wiki/Delimiter#Delimiter_collision i https://framework.frictionlessdata.io/docs/tutorials/working-with-cli/ i https://segment.com/blog/allocation-efficiency-in-high-performance-go-services/ i go tool nm -size mlr | sort -nrk 2 i https://go.dev/play/p/hrB77U3d0S3?v=gotip * godoc notes: o go get golang.org/x/tools/cmd/godoc o dev mode: godoc -http=:6060 -goroot . o publish: godoc -http=:6060 -goroot . cd ~/tmp/bar wget -p -k http://localhost:6060/pkg mv localhost:6060 miller6 file:///Users/kerl/tmp/bar/miller6/pkg maybe publish to ISP space * pkg graph: go get github.com/kisielk/godepgraph godepgraph miller | dot -Tpng -o ~/Desktop/mlrdeps.png flamegraph etc double-check * more data formats: https://indico.cern.ch/event/613842/contributions/2585787/attachments/1463230/2260889/pivarski-data-formats.pdf