Skip to content

Commit

Permalink
mlr sparsify (#1498)
Browse files Browse the repository at this point in the history
* mlr sparsify

* regression-test cases

* typofix

* Remove mods due to processor-architecture change
  • Loading branch information
johnkerl committed Feb 18, 2024
1 parent cd6d427 commit f5eaf29
Show file tree
Hide file tree
Showing 30 changed files with 424 additions and 19 deletions.
55 changes: 36 additions & 19 deletions docs/src/reference-verbs.md
Original file line number Diff line number Diff line change
Expand Up @@ -3126,6 +3126,23 @@ a b c
9 8 7
</pre>

## sparsify

<pre class="pre-highlight-in-pair">
<b>mlr sparsify --help</b>
</pre>
<pre class="pre-non-highlight-in-pair">
Usage: mlr sparsify [options]
Unsets fields for which the key is the empty string (or, optionally, another
specified value). Only makes sense with output format not being CSV or TSV.
Options:
-s {filler string} What values to remove. Defaults to the empty string.
-f {a,b,c} Specify field names to be operated on; any other fields won't be
modified. The default is to modify all fields.
-h|--help Show this message.
Example: if input is a=1,b=,c=3 then output is a=1,c=3.
</pre>

## split

<pre class="pre-highlight-in-pair">
Expand Down Expand Up @@ -3409,14 +3426,14 @@ fields, optionally categorized by one or more fields.
<b> data/medium</b>
</pre>
<pre class="pre-non-highlight-in-pair">
x_y_cov 0.000042574820827444476
x_y_corr 0.0005042001844467462
y_y_cov 0.08461122467974003
x_y_cov 0.00004257482082749404
x_y_corr 0.0005042001844473328
y_y_cov 0.08461122467974005
y_y_corr 1
x2_xy_cov 0.04188382281779374
x2_xy_corr 0.630174342037994
x2_y2_cov -0.00030953725962542085
x2_y2_corr -0.0034249088761121966
x2_xy_cov 0.041883822817793716
x2_xy_corr 0.6301743420379936
x2_y2_cov -0.0003095372596253918
x2_y2_corr -0.003424908876111875
</pre>

<pre class="pre-highlight-in-pair">
Expand All @@ -3425,12 +3442,12 @@ x2_y2_corr -0.0034249088761121966
<b> data/medium</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a x_y_ols_m x_y_ols_b x_y_ols_n x_y_r2 y_y_ols_m y_y_ols_b y_y_ols_n y_y_r2 xy_y2_ols_m xy_y2_ols_b xy_y2_ols_n xy_y2_r2
pan 0.01702551273681908 0.5004028922897639 2081 0.00028691820445814767 1 0 2081 1 0.8781320866715662 0.11908230147563566 2081 0.41749827377311266
eks 0.0407804923685586 0.48140207967651016 1965 0.0016461239223448587 1 0 1965 1 0.8978728611690183 0.10734054433612333 1965 0.45563223864254526
wye -0.03915349075204814 0.5255096523974456 1966 0.0015051268704373607 1 0 1966 1 0.8538317334220835 0.1267454301662969 1966 0.38991721818599295
zee 0.0027812364960399147 0.5043070448033061 2047 0.000007751652858786137 1 0 2047 1 0.8524439912011013 0.12401684308018937 2047 0.39356598090006495
hat -0.018620577041095078 0.5179005397264935 1941 0.0003520036646055585 1 0 1941 1 0.8412305086345014 0.13557328318623216 1941 0.3687944261732265
a x_y_ols_m x_y_ols_b x_y_ols_n x_y_r2 y_y_ols_m y_y_ols_b y_y_ols_n y_y_r2 xy_y2_ols_m xy_y2_ols_b xy_y2_ols_n xy_y2_r2
pan 0.017025512736819345 0.500402892289764 2081 0.00028691820445815624 1 -0.00000000000000002890430283104539 2081 1 0.8781320866715664 0.11908230147563569 2081 0.4174982737731127
eks 0.04078049236855813 0.4814020796765104 1965 0.0016461239223448218 1 0.00000000000000017862676354313703 1965 1 0.897872861169018 0.1073405443361234 1965 0.4556322386425451
wye -0.03915349075204785 0.5255096523974457 1966 0.0015051268704373377 1 0.00000000000000004464425401127647 1966 1 0.8538317334220837 0.1267454301662969 1966 0.3899172181859931
zee 0.0027812364960401333 0.5043070448033061 2047 0.000007751652858787357 1 0.00000000000000004819404567023685 2047 1 0.8524439912011011 0.12401684308018947 2047 0.39356598090006495
hat -0.018620577041095272 0.5179005397264937 1941 0.00035200366460556604 1 -0.00000000000000003400445761787692 1941 1 0.8412305086345017 0.13557328318623207 1941 0.3687944261732266
</pre>

Here's an example simple line-fit. The `x` and `y`
Expand Down Expand Up @@ -3516,11 +3533,11 @@ upsec_count_pca_quality 0.9999590846136102
donesec 92.33051350964094

color purple
upsec_count_pca_m -39.03009744795354
upsec_count_pca_b 979.9883413064914
upsec_count_pca_m -39.030097447953594
upsec_count_pca_b 979.9883413064917
upsec_count_pca_n 21
upsec_count_pca_quality 0.9999908956206317
donesec 25.10852919630297
donesec 25.108529196302943
</pre>

## step
Expand Down Expand Up @@ -3797,9 +3814,9 @@ distinct_count 5 5 10000 10000 10000
mode pan wye 1 0.3467901443380824 0.7268028627434533
sum 0 0 50005000 4986.019681679581 5062.057444929905
mean - - 5000.5 0.49860196816795804 0.5062057444929905
stddev - - 2886.8956799071675 0.2902925151144007 0.290880086426933
var - - 8334166.666666667 0.08426974433144456 0.08461122467974003
skewness - - 0 -0.0006899591185521965 -0.017849760120133784
stddev - - 2886.8956799071675 0.29029251511440074 0.2908800864269331
var - - 8334166.666666667 0.08426974433144457 0.08461122467974005
skewness - - 0 -0.0006899591185517494 -0.01784976012013298
minlen 3 3 1 15 13
maxlen 3 3 5 22 22
min eks eks 1 0.00004509679127584487 0.00008818962627266114
Expand Down
6 changes: 6 additions & 0 deletions docs/src/reference-verbs.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -995,6 +995,12 @@ GENMD-RUN-COMMAND
mlr --ijson --opprint sort-within-records data/sort-within-records.json
GENMD-EOF

## sparsify

GENMD-RUN-COMMAND
mlr sparsify --help
GENMD-EOF

## split

GENMD-RUN-COMMAND
Expand Down
1 change: 1 addition & 0 deletions pkg/transformers/aaa_transformer_table.go
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ var TRANSFORMER_LOOKUP_TABLE = []TransformerSetup{
SkipTrivialRecordsSetup,
SortSetup,
SortWithinRecordsSetup,
SparsifySetup,
SplitSetup,
SsubSetup,
Stats1Setup,
Expand Down
192 changes: 192 additions & 0 deletions pkg/transformers/sparsify.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
package transformers

import (
"container/list"
"fmt"
"os"
"strings"

"github.com/johnkerl/miller/pkg/cli"
"github.com/johnkerl/miller/pkg/lib"
"github.com/johnkerl/miller/pkg/mlrval"
"github.com/johnkerl/miller/pkg/types"
)

// ----------------------------------------------------------------
const verbNameSparsify = "sparsify"

var SparsifySetup = TransformerSetup{
Verb: verbNameSparsify,
UsageFunc: transformerSparsifyUsage,
ParseCLIFunc: transformerSparsifyParseCLI,
IgnoresInput: false,
}

func transformerSparsifyUsage(
o *os.File,
) {
fmt.Fprintf(o, "Usage: %s %s [options]\n", "mlr", verbNameSparsify)
fmt.Fprint(o,
`Unsets fields for which the key is the empty string (or, optionally, another
specified value). Only makes sense with output format not being CSV or TSV.
`)

fmt.Fprintf(o, "Options:\n")
fmt.Fprintf(o, "-s {filler string} What values to remove. Defaults to the empty string.\n")
fmt.Fprintf(o, "-f {a,b,c} Specify field names to be operated on; any other fields won't be\n")
fmt.Fprintf(o, " modified. The default is to modify all fields.\n")
fmt.Fprintf(o, "-h|--help Show this message.\n")

fmt.Fprint(o,
`Example: if input is a=1,b=,c=3 then output is a=1,c=3.
`)
}

func transformerSparsifyParseCLI(
pargi *int,
argc int,
args []string,
_ *cli.TOptions,
doConstruct bool, // false for first pass of CLI-parse, true for second pass
) IRecordTransformer {

// Skip the verb name from the current spot in the mlr command line
argi := *pargi
verb := args[argi]
argi++

fillerString := ""
var specifiedFieldNames []string = nil

for argi < argc /* variable increment: 1 or 2 depending on flag */ {
opt := args[argi]
if !strings.HasPrefix(opt, "-") {
break // No more flag options to process
}
if args[argi] == "--" {
break // All transformers must do this so main-flags can follow verb-flags
}
argi++

if opt == "-h" || opt == "--help" {
transformerSparsifyUsage(os.Stdout)
os.Exit(0)

} else if opt == "-s" {
fillerString = cli.VerbGetStringArgOrDie(verb, opt, args, &argi, argc)

} else if opt == "-f" {
specifiedFieldNames = cli.VerbGetStringArrayArgOrDie(verb, opt, args, &argi, argc)

} else {
transformerSparsifyUsage(os.Stderr)
os.Exit(1)
}
}

*pargi = argi
if !doConstruct { // All transformers must do this for main command-line parsing
return nil
}

transformer, err := NewTransformerSparsify(
fillerString,
specifiedFieldNames,
)
if err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}

return transformer
}

// ----------------------------------------------------------------
type TransformerSparsify struct {
fillerString string
fieldNamesSet map[string]bool
recordTransformerFunc RecordTransformerFunc
}

func NewTransformerSparsify(
fillerString string,
specifiedFieldNames []string,
) (*TransformerSparsify, error) {

tr := &TransformerSparsify{
fillerString: fillerString,
fieldNamesSet: lib.StringListToSet(specifiedFieldNames),
}
if specifiedFieldNames == nil {
tr.recordTransformerFunc = tr.transformAll
} else {
tr.recordTransformerFunc = tr.transformSome
}

return tr, nil
}

func (tr *TransformerSparsify) Transform(
inrecAndContext *types.RecordAndContext,
outputRecordsAndContexts *list.List, // list of *types.RecordAndContext
inputDownstreamDoneChannel <-chan bool,
outputDownstreamDoneChannel chan<- bool,
) {
HandleDefaultDownstreamDone(inputDownstreamDoneChannel, outputDownstreamDoneChannel)

if !inrecAndContext.EndOfStream {
tr.recordTransformerFunc(
inrecAndContext,
outputRecordsAndContexts,
inputDownstreamDoneChannel,
outputDownstreamDoneChannel,
)
} else {
outputRecordsAndContexts.PushBack(inrecAndContext) // end-of-stream marker
}
}

func (tr *TransformerSparsify) transformAll(
inrecAndContext *types.RecordAndContext,
outputRecordsAndContexts *list.List, // list of *types.RecordAndContext
inputDownstreamDoneChannel <-chan bool,
outputDownstreamDoneChannel chan<- bool,
) {
inrec := inrecAndContext.Record
outrec := mlrval.NewMlrmapAsRecord()

for pe := inrec.Head; pe != nil; pe = pe.Next {
if pe.Value.String() != tr.fillerString {
// Reference OK because ownership transfer
outrec.PutReference(pe.Key, pe.Value)
}
}

outrecAndContext := types.NewRecordAndContext(outrec, &inrecAndContext.Context)
outputRecordsAndContexts.PushBack(outrecAndContext)
}

// ----------------------------------------------------------------
func (tr *TransformerSparsify) transformSome(
inrecAndContext *types.RecordAndContext,
outputRecordsAndContexts *list.List, // list of *types.RecordAndContext
inputDownstreamDoneChannel <-chan bool,
outputDownstreamDoneChannel chan<- bool,
) {
inrec := inrecAndContext.Record
outrec := mlrval.NewMlrmapAsRecord()

for pe := inrec.Head; pe != nil; pe = pe.Next {
if tr.fieldNamesSet[pe.Key] {
if pe.Value.String() != tr.fillerString {
// Reference OK because ownership transfer
outrec.PutReference(pe.Key, pe.Value)
}
} else {
outrec.PutReference(pe.Key, pe.Value)
}
}

outrecAndContext := types.NewRecordAndContext(outrec, &inrecAndContext.Context)
outputRecordsAndContexts.PushBack(outrecAndContext)
}
12 changes: 12 additions & 0 deletions test/cases/cli-help/0001/expout
Original file line number Diff line number Diff line change
Expand Up @@ -988,6 +988,18 @@ Options:
-r Recursively sort subobjects/submaps, e.g. for JSON input.
-h|--help Show this message.

================================================================
sparsify
Usage: mlr sparsify [options]
Unsets fields for which the key is the empty string (or, optionally, another
specified value). Only makes sense with output format not being CSV or TSV.
Options:
-s {filler string} What values to remove. Defaults to the empty string.
-f {a,b,c} Specify field names to be operated on; any other fields won't be
modified. The default is to modify all fields.
-h|--help Show this message.
Example: if input is a=1,b=,c=3 then output is a=1,c=3.

================================================================
split
Usage: mlr split [options] {filename}
Expand Down
1 change: 1 addition & 0 deletions test/cases/verb-sparsify/0001/cmd
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
mlr --c2j --from test/input/sparsify-input.csv sparsify
Empty file.
17 changes: 17 additions & 0 deletions test/cases/verb-sparsify/0001/expout
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
[
{
"a": 1,
"b": 2,
"c": 3
},
{
"a": 4,
"b": 5
},
{},
{
"a": 7,
"b": 8,
"c": 9
}
]
1 change: 1 addition & 0 deletions test/cases/verb-sparsify/0002/cmd
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
mlr --c2j --from test/input/sparsify-input.csv sparsify -f a
Empty file.
21 changes: 21 additions & 0 deletions test/cases/verb-sparsify/0002/expout
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
[
{
"a": 1,
"b": 2,
"c": 3
},
{
"a": 4,
"b": 5,
"c": ""
},
{
"b": "",
"c": ""
},
{
"a": 7,
"b": 8,
"c": 9
}
]
1 change: 1 addition & 0 deletions test/cases/verb-sparsify/0003/cmd
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
mlr --c2j --from test/input/sparsify-input.csv sparsify -f b
Empty file.
21 changes: 21 additions & 0 deletions test/cases/verb-sparsify/0003/expout
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
[
{
"a": 1,
"b": 2,
"c": 3
},
{
"a": 4,
"b": 5,
"c": ""
},
{
"a": "",
"c": ""
},
{
"a": 7,
"b": 8,
"c": 9
}
]
1 change: 1 addition & 0 deletions test/cases/verb-sparsify/0004/cmd
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
mlr --c2j --from test/input/sparsify-input.csv sparsify -f b,c
Empty file.
Loading

0 comments on commit f5eaf29

Please sign in to comment.