Name		Name	Last commit message	Last commit date
parent directory ..
test		test
metadata.json		metadata.json
readme.md		readme.md
script.whizzml		script.whizzml

readme.md

Removing duplicated keys from an ordered dataset, at random

This simple script takes a dataset identifier and the name or identifier of one of its fields, and creates a new one that contains only one row for each different value of that field, on the assumption that all duplicated values of the key field are contiguous.

The scripts selects one of the rows at random, with two filters in a row (using Flatline) that, essentially, implement reservoir sampling for each of the row groups.

More concretely, we first perfom a filter with the following expression:

(let (id (f "key")
      n (count (cond-window "key" (= id (f "key")))))
  (= 0 (rand-int n)))

where "key" is the name of our key field. In the expression above, n is the number of rows below the current one with the same value for the "key" field, and we are selecting the current row with probability 1/n. In a regular reservoir sample, one would stop collecting rows once the first one is selected, but the Flatline filter would go on for all rows, so there's a pretty good chance that more than one row with the same key remains in the filtered dataset. To get our final result, we just keep the first row for each key in the filtered dataset, which is easily accomplished via the Flatline filter

(!= (f "key") (f "key" -1))

So the WhizzML script's job is mostly trivial: generate the Flatline expression, request a dataset for each filter in succession, and delete the intermediate dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deduplicate

deduplicate

readme.md

Removing duplicated keys from an ordered dataset, at random

Files

deduplicate

Directory actions

More options

Directory actions

More options

Latest commit

History

deduplicate

Folders and files

parent directory

readme.md

Removing duplicated keys from an ordered dataset, at random