Skip to content

Commit

Permalink
Merge pull request #1880 from jqnatividad/jsonp
Browse files Browse the repository at this point in the history
`jsonp`: add `jsonp` command allowing non-nested JSON to CSV conversion with Polars
  • Loading branch information
jqnatividad committed Jun 15, 2024
2 parents b15c6fd + b01ab5e commit 7cd8fcf
Show file tree
Hide file tree
Showing 6 changed files with 222 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@
| [join](/src/cmd/join.rs#L2) | Inner, outer, right, cross, anti & semi joins. Automatically creates a simple, in-memory hash index to make it fast. |
| [joinp](/src/cmd/joinp.rs#L2)<br>✨🚀🐻‍❄️ | Inner, outer, cross, anti, semi & asof joins using the [Pola.rs](https://www.pola.rs) engine. Unlike the `join` command, `joinp` can process files larger than RAM, is multithreaded, has join key validation, pre-join filtering, supports [asof joins](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.join_asof.html) (which is [particularly useful for time series data](https://github.com/jqnatividad/qsv/blob/30cc920d0812a854fcbfedc5db81788a0600c92b/tests/test_joinp.rs#L509-L983)) & its output doesn't have duplicate columns. However, `joinp` doesn't have an --ignore-case option & it doesn't support right outer joins. |
| [jsonl](/src/cmd/jsonl.rs#L2)<br>🚀🔣 | Convert newline-delimited JSON ([JSONL](https://jsonlines.org/)/[NDJSON](http:https://ndjson.org/)) to CSV. See `tojsonl` command to convert CSV to JSONL.
| [jsonp](/src/cmd/jsonp.rs#L2)<br> | Convert non-nested JSON to CSV. Only available with the polars feature enabled.
| <a name="luau_deeplink"></a><br>[luau](/src/cmd/luau.rs#L2) 👑<br>✨📇🌐🔣 ![CKAN](docs/images/ckan.png) | Create multiple new computed columns, filter rows, compute aggregations and build complex data pipelines by executing a [Luau](https://luau-lang.org) [0.625](https://github.com/Roblox/luau/releases/tag/0.625) expression/script for every row of a CSV file ([sequential mode](https://github.com/jqnatividad/qsv/blob/bb72c4ef369d192d85d8b7cc6e972c1b7df77635/tests/test_luau.rs#L254-L298)), or using [random access](https://www.webopedia.com/definitions/random-access/) with an index ([random access mode](https://github.com/jqnatividad/qsv/blob/bb72c4ef369d192d85d8b7cc6e972c1b7df77635/tests/test_luau.rs#L367-L415)).<br>Can process a single Luau expression or [full-fledged data-wrangling scripts using lookup tables](https://github.com/dathere/qsv-lookup-tables#example) with discrete BEGIN, MAIN and END sections.<br> It is not just another qsv command, it is qsv's [Domain-specific Language](https://en.wikipedia.org/wiki/Domain-specific_language) (DSL) with [numerous qsv-specific helper functions](https://github.com/jqnatividad/qsv/blob/113eee17b97882dc368b2e65fec52b86df09f78b/src/cmd/luau.rs#L1356-L2290) to build production data pipelines. |
| [partition](/src/cmd/partition.rs#L2) | Partition a CSV based on a column value. |
| [prompt](/src/cmd/prompt.rs#L2) | Open a file dialog to pick a file. |
Expand Down
123 changes: 123 additions & 0 deletions src/cmd/jsonp.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
static USAGE: &str = r#"
Convert non-nested JSON to CSV (polars feature only).
You may provide JSON data either from stdin or a file path.
This command may not work with nested JSON data.
As a basic example, say we have a file fruits.json with contents:
[
{
"fruit": "apple",
"price": 2.5
},
{
"fruit": "banana",
"price": 3.0
}
]
To convert it to CSV format, run:
qsv jsonp fruits.json
And the following is printed to the terminal:
fruit,price
apple,2.5
banana,3.0
If fruits.json was provided using stdin then either use - or do not provide a file path. For example:
cat fruits.json | qsv jsonp -
For more examples, see https://github.com/jqnatividad/qsv/blob/master/tests/test_jsonp.rs.
Usage:
qsv jsonp [options] [<input>]
qsv jsonp --help
jsonp options:
--datetime-format <fmt> The datetime format to use writing datetimes.
See https://docs.rs/chrono/latest/chrono/format/strftime/index.html
for the list of valid format specifiers.
--date-format <fmt> The date format to use writing dates.
--time-format <fmt> The time format to use writing times.
--float-precision <arg> The number of digits of precision to use when writing floats.
--wnull-value <arg> The string to use when WRITING null values.
Common options:
-h, --help Display this message
-o, --output <file> Write output to <file> instead of stdout.
"#;

use std::io::{Cursor, Read, Seek, SeekFrom, Write};

use polars::prelude::*;
use serde::Deserialize;

use crate::{util, CliResult};

#[derive(Deserialize)]
struct Args {
arg_input: Option<String>,
flag_datetime_format: Option<String>,
flag_date_format: Option<String>,
flag_time_format: Option<String>,
flag_float_precision: Option<usize>,
flag_wnull_value: Option<String>,
flag_output: Option<String>,
}

pub fn run(argv: &[&str]) -> CliResult<()> {
let args: Args = util::get_args(USAGE, argv)?;

fn df_from_stdin() -> PolarsResult<DataFrame> {
// Create a buffer in memory for stdin
let mut buffer: Vec<u8> = Vec::new();
let stdin = std::io::stdin();
stdin.lock().read_to_end(&mut buffer)?;
JsonReader::new(Box::new(std::io::Cursor::new(buffer))).finish()
}

fn df_from_path(path: String) -> PolarsResult<DataFrame> {
JsonReader::new(std::fs::File::open(path)?).finish()
}

let df = match args.arg_input.clone() {
Some(path) => {
if path == "-" {
df_from_stdin()?
} else {
df_from_path(path)?
}
},
None => df_from_stdin()?,
};

fn df_to_csv<W: Write>(mut writer: W, mut df: DataFrame, args: &Args) -> PolarsResult<()> {
CsvWriter::new(&mut writer)
.with_datetime_format(args.flag_datetime_format.clone())
.with_date_format(args.flag_date_format.clone())
.with_time_format(args.flag_time_format.clone())
.with_float_precision(args.flag_float_precision)
.with_null_value(args.flag_wnull_value.clone().unwrap_or("".to_string()))
.include_bom(util::get_envvar_flag("QSV_OUTPUT_BOM"))
.finish(&mut df)?;
Ok(())
}

if let Some(output_path) = args.flag_output.clone() {
let mut output = std::fs::File::create(output_path)?;
df_to_csv(&mut output, df, &args)?;
} else {
let mut res = Cursor::new(Vec::new());
df_to_csv(&mut res, df, &args)?;
res.seek(SeekFrom::Start(0))?;
let mut out = String::new();
res.read_to_string(&mut out)?;
println!("{out}");
}

Ok(())
}
2 changes: 2 additions & 0 deletions src/cmd/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ pub mod join;
pub mod joinp;
#[cfg(any(feature = "feature_capable", feature = "lite"))]
pub mod jsonl;
#[cfg(feature = "polars")]
pub mod jsonp;
#[cfg(feature = "luau")]
pub mod luau;
#[cfg(any(feature = "feature_capable", feature = "lite"))]
Expand Down
8 changes: 8 additions & 0 deletions src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,10 @@ fn main() -> QsvExitCode {

enabled_commands.push_str(" jsonl Convert newline-delimited JSON files to CSV\n");

#[cfg(all(feature = "polars", feature = "feature_capable"))]
enabled_commands
.push_str(" jsonp Convert non-nested JSON to CSV (polars feature only)\n");

#[cfg(all(feature = "luau", feature = "feature_capable"))]
enabled_commands.push_str(" luau Execute Luau script on CSV data\n");

Expand Down Expand Up @@ -356,6 +360,8 @@ enum Command {
#[cfg(all(feature = "polars", feature = "feature_capable"))]
JoinP,
Jsonl,
#[cfg(all(feature = "polars", feature = "feature_capable"))]
JsonP,
#[cfg(all(feature = "luau", feature = "feature_capable"))]
Luau,
Partition,
Expand Down Expand Up @@ -445,6 +451,8 @@ impl Command {
#[cfg(all(feature = "polars", feature = "feature_capable"))]
Command::JoinP => cmd::joinp::run(argv),
Command::Jsonl => cmd::jsonl::run(argv),
#[cfg(all(feature = "polars", feature = "feature_capable"))]
Command::JsonP => cmd::jsonp::run(argv),
#[cfg(all(feature = "luau", feature = "feature_capable"))]
Command::Luau => cmd::luau::run(argv),
Command::Partition => cmd::partition::run(argv),
Expand Down
86 changes: 86 additions & 0 deletions tests/test_jsonp.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
use crate::workdir::Workdir;

#[test]
fn jsonp_simple() {
let wrk = Workdir::new("jsonp_simple");
wrk.create_from_string(
"data.json",
r#"[{"id":1,"father":"Mark","mother":"Charlotte","oldest_child":"Tom","boy":true},
{"id":2,"father":"John","mother":"Ann","oldest_child":"Jessika","boy":false},
{"id":3,"father":"Bob","mother":"Monika","oldest_child":"Jerry","boy":true}]"#,
);
let mut cmd = wrk.command("jsonp");
cmd.arg("data.json");

let got: Vec<Vec<String>> = wrk.read_stdout(&mut cmd);
let expected = vec![
svec!["id", "father", "mother", "oldest_child", "boy"],
svec!["1", "Mark", "Charlotte", "Tom", "true"],
svec!["2", "John", "Ann", "Jessika", "false"],
svec!["3", "Bob", "Monika", "Jerry", "true"],
];
assert_eq!(got, expected);
}

#[test]
fn jsonp_fruits_stats() {
let wrk = Workdir::new("jsonp_fruits_stats");
wrk.create_from_string(
"data.json",
r#"[{"field":"fruit","type":"String","is_ascii":true,"sum":null,"min":"apple","max":"strawberry","range":null,"min_length":5,"max_length":10,"mean":null,"stddev":null,"variance":null,"nullcount":0,"max_precision":null,"sparsity":0},{"field":"price","type":"Float","is_ascii":null,"sum":7,"min":"1.5","max":"3.0","range":1.5,"min_length":4,"max_length":4,"mean":2.3333,"stddev":0.6236,"variance":0.3889,"nullcount":0,"max_precision":1,"sparsity":0}]"#,
);
let mut cmd = wrk.command("jsonp");
cmd.arg("data.json");

let got: String = wrk.stdout(&mut cmd);
let expected = r#"field,type,is_ascii,sum,min,max,range,min_length,max_length,mean,stddev,variance,nullcount,max_precision,sparsity
fruit,String,true,,apple,strawberry,,5,10,,,,0,,0
price,Float,,7,1.5,3.0,1.5,4,4,2.3333,0.6236,0.3889,0,1,0"#.to_string();
assert_eq!(got, expected);
}

#[test]
fn jsonp_fruits_stats_fp_2() {
let wrk = Workdir::new("jsonp_fruits_stats_fp_2");
wrk.create_from_string(
"data.json",
r#"[{"field":"fruit","type":"String","is_ascii":true,"sum":null,"min":"apple","max":"strawberry","range":null,"min_length":5,"max_length":10,"mean":null,"stddev":null,"variance":null,"nullcount":0,"max_precision":null,"sparsity":0},{"field":"price","type":"Float","is_ascii":null,"sum":7,"min":"1.5","max":"3.0","range":1.5,"min_length":4,"max_length":4,"mean":2.3333,"stddev":0.6236,"variance":0.3889,"nullcount":0,"max_precision":1,"sparsity":0}]"#,
);
let mut cmd = wrk.command("jsonp");
cmd.arg("data.json");
cmd.args(&["--float-precision", "2"]);

let got: String = wrk.stdout(&mut cmd);
let expected = r#"field,type,is_ascii,sum,min,max,range,min_length,max_length,mean,stddev,variance,nullcount,max_precision,sparsity
fruit,String,true,,apple,strawberry,,5,10,,,,0,,0
price,Float,,7,1.5,3.0,1.50,4,4,2.33,0.62,0.39,0,1,0"#.to_string();
assert_eq!(got, expected);
}

#[test]
// Verify that qsv stats fruits.csv has the same content as
// qsv stats fruits.csv | qsv slice --json | qsv jsonp
fn jsonp_fruits_stats_slice_jsonp() {
let wrk = Workdir::new("jsonp_fruits_stats_slice_jsonp");
let test_file = wrk.load_test_file("fruits.csv");

// qsv stats fruits.csv
let mut stats_cmd = wrk.command("stats");
stats_cmd.arg(test_file);
let stats_output: String = wrk.stdout(&mut stats_cmd);
wrk.create_from_string("stats.csv", stats_output.as_str());

// qsv slice --json
let mut slice_cmd = wrk.command("slice");
slice_cmd.arg("stats.csv");
slice_cmd.arg("--json");
let slice_output: String = wrk.stdout(&mut slice_cmd);
wrk.create_from_string("slice.json", slice_output.as_str());

// qsv jsonp
let mut jsonp_cmd = wrk.command("jsonp");
jsonp_cmd.arg("slice.json");
let jsonp_output: String = wrk.stdout(&mut jsonp_cmd);

assert_eq!(stats_output, jsonp_output);
}
2 changes: 2 additions & 0 deletions tests/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,8 @@ mod test_join;
mod test_joinp;
#[cfg(any(feature = "feature_capable", feature = "lite"))]
mod test_jsonl;
#[cfg(all(feature = "polars", not(feature = "datapusher_plus")))]
mod test_jsonp;
#[cfg(feature = "luau")]
mod test_luau;
#[cfg(any(feature = "feature_capable", feature = "lite"))]
Expand Down

0 comments on commit 7cd8fcf

Please sign in to comment.