Skip to content

jqnatividad/qsv

qsv: Ultra-fast CSV data-wrangling CLI toolkit

Linux build status Windows build status macOS build status Security audit Downloads Clones
Discussions Docs Minimum supported Rust version Crates.io Crates.io downloads

  Table of Contents
qsv logo qsv is a command line program for
indexing, slicing, analyzing, splitting,
enriching, validating & joining CSV files.
Commands are simple, fast & composable.

* Available Commands
* Installation
* Whirlwind Tour
* Cookbook
* FAQ
* Changelog
* Performance Tuning
* Benchmarks
* NYC School of Data 2022 slides
* Sponsor

NOTE: qsv is a fork of the popular xsv utility, merging several pending PRs since xsv 0.13.0's May 2018 release. It also has numerous new features & 53 additional commands/subcommands/operations (for a total of 73). See FAQ for more details.

Available commands

Command Description
apply1 Apply series of string, date, currency & geocoding transformations to a CSV column. It also has some basic NLP functions (similarity, sentiment analysis, profanity, eudex & language detection).
behead Drop headers from a CSV.
cat Concatenate CSV files by row or by column.
count2 Count the rows in a CSV file. (Instantaneous with an index.)
dedup34 Remove redundant rows.
enum Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value.
excel Exports a specified Excel/ODS sheet to a CSV file.
exclude2 Removes a set of CSV data from another set based on the specified columns.
explode Explode rows into multiple ones by splitting a column value based on the given separator.
extsort4 Sort an arbitrarily large CSV/text file using a multithreaded external merge sort algorithm.
fetch Fetches HTML/data from web pages or web services for every row in a URL column. Comes with jql JSON query language support and optional Redis response caching.
fill Fill empty values.
fixlengths Force a CSV to have same-length records by either padding or truncating them.
flatten A flattened view of CSV records. Useful for viewing one record at a time.
e.g. qsv slice -i 5 data.csv | qsv flatten.
fmt Reformat a CSV with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.)
foreach1 Loop over a CSV to execute bash commands. (not available on Windows)
frequency25 Build frequency tables of each column. (Uses multithreading to go faster if an index is present.)
generate1 Generate test data by profiling a CSV using Markov decision process machine learning.
headers Show the headers of a CSV. Or show the intersection of all headers between many CSV files.
index Create an index for a CSV. This is very quick & provides constant time indexing into the CSV file. Also enables multithreading for frequency, split, stats and schema commands.
input Read CSV data with special quoting, trimming, line-skipping and UTF-8 transcoding rules.
join2 Inner, outer, cross, anti & semi joins. Uses a simple hash index to make it fast.
jsonl Convert newline-delimited JSON (JSONL/NDJSON) to CSV.
lua1 Execute a Lua script over CSV lines to transform, aggregate or filter them. Embeds Lua 5.4.4.
partition Partition a CSV based on a column value.
pseudo Pseudonymise the value of the given column by replacing them with an incremental identifier.
py1 Evaluate a Python expression over CSV lines to transform, aggregate or filter them. Python's f-strings is particularly useful for extended formatting (Python 3.8+ required).
rename Rename the columns of a CSV efficiently.
replace Replace CSV data using a regex.
reverse3 Reverse order of rows in a CSV. Unlike the sort --reverse command, it preserves the order of rows with the same key.
sample2 Randomly draw rows (with optional seed) from a CSV using reservoir sampling (i.e., use memory proportional to the size of the sample).
schema5 Infer schema from CSV data and output in JSON Schema format. Uses multithreading to go faster if an index is present. See validate command.
search Run a regex over a CSV. Applies the regex to each field individually & shows only matching rows.
searchset Run multiple regexes over a CSV in a single pass. Applies the regexes to each field individually & shows only matching rows.
select Select, re-order, duplicate or drop columns.
slice23 Slice rows from any part of a CSV. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice).
sniff Quickly sniffs CSV details (delimiter, quote character, number of columns, data types, header row, preamble rows).
sort4 Sorts CSV data in alphabetical, numerical, reverse or random (with optional seed) order.
split25 Split one CSV file into many CSV files of N chunks. (Uses multithreading to go faster if an index is present.)
stats235 Infer data type & compute descriptive statistics for each column in a CSV (sum, min/max, min/max length, mean, stddev, variance, quartiles, IQR, lower/upper fences, skew, median, mode, cardinality & nullcount). Uses multithreading to go faster if an index is present.
table3 Show aligned output of a CSV using elastic tabstops.
transpose3 Transpose rows/columns of a CSV.
validate4 Validate CSV data with JSON Schema (See schema command). If no jsonschema file is provided, validates if a CSV conforms to the RFC 4180 standard.

Installation

Pre-built binaries for Windows, Linux and macOS are available from GitHub.

There are three versions of qsv. qsv supports features, with the pre-built binaries enabling all valid platform features6; qsvlite has all features disabled (half the size of qsv); qsvdp is optimized for use with DataPusher+, with only DataPusher+ relevant commands and the self-update engine removed (a sixth of the size of qsv).

Alternatively, you can compile from source by installing Cargo (Rust's package manager) and installing qsv using Cargo:

cargo install qsv --features full

If you encounter compilation errors, ensure you're using the exact version of the dependencies qsv was built with by issuing:

cargo install qsv --locked --features full

Compiling from this repository also works similarly:

git clone [email protected]:jqnatividad/qsv.git
cd qsv
cargo build --release --features full
# or if you encounter compilation errors
cargo build --release --locked --features full

The compiled binary will end up in ./target/release/.

To enable optional features, use cargo --features (see Feature Flags for more info):

cargo install qsv --features apply,generate,lua,fetch,foreach,python,full
# or to build qsvlite
cargo install qsv --features lite
# or to build qsvdp
cargo install qsv --features datapusher_plus

# or when compiling from a local repo
cargo build --release --features apply,generate,lua,fetch,foreach,python,full
# for qsvlite
cargo build --release --features lite
# for qsvdp
cargo build --release --features datapusher_plus

Minimum Supported Rust Version

Building qsv requires Rust stable - currently version 1.61.0.

Tab Completion

qsv's command-line options are quite extensive. Thankfully, since it uses docopt for CLI processing, we can take advantage of docopt.rs' tab completion support to make it easier to use qsv at the command-line (currently, only bash shell is supported):

# install docopt-wordlist
cargo install docopt

# IMPORTANT: run these commands from the root directory of your qsv git repository
# to setup bash qsv tab completion
echo "DOCOPT_WORDLIST_BIN=\"$(which docopt-wordlist)"\" >> $HOME/.bash_completion
echo "source \"$(pwd)/scripts/docopt-wordlist.bash\"" >> $HOME/.bash_completion
echo "complete -F _docopt_wordlist_commands qsv" >> $HOME/.bash_completion

File formats

qsv recognizes UTF-8/ASCII encoded, CSV (.csv) and TSV files (.tsv and .tab). CSV files are assummed to have "," (comma) as a delimiter, and TSV files, "\t" (tab) as a delimiter. The delimiter is a single ascii character that can be set either by the --delimiter command-line option or with the QSV_DEFAULT_DELIMITER environment variable or automatically detected when QSV_SNIFF_DELIMITER is set.

When using the --output option, note that qsv will UTF-8 encode the file and automatically change the delimiter used in the generated file based on the file extension - i.e. comma for .csv, tab for .tsv and .tab files.

JSONL/NDJSON files are also recognized and converted to CSV with the jsonl command.

The fetch command also produces JSONL files when its invoked without the --new-column option.

The sniff and validate commands produce JSON files with their --json and --pretty-json options.

The excel command recognizes Excel and Open Document Spreadsheet(ODS) files (.xls, .xlsx, .xlsm, .xlsb and .ods files).

RFC 4180

qsv validates against the RFC 4180 CSV standard. However IRL, CSV formats vary significantly and qsv is actually not strictly compliant with the specification so it can process "real-world" CSV files. qsv leverages the awesome Rust CSV library, which in turn, is built on top of the csv-core library to read CSV files.

Click here to find out how qsv conforms to the standard with csv-core.

UTF-8 Encoding

qsv requires UTF-8 encoded (of which ASCII is a subset) input files. On startup, it scans the input if it's UTF-8 encoded (for files, the first 8k; for stdin, the entire buffer), and will abort if its not unless QSV_SKIPUTF8_CHECK is set. On Linux and macOS, UTF-8 encoding is the default.

Should you need to reencode CSV/TSV files, you can use the input command to transcode to UTF-8. It will replace all invalid UTF-8 sequences with . Alternatively, there are several utilities you can use to do so on Linux/macOS and Windows.

Windows Usage Note

Unlike other modern operating systems, Windows' default encoding is UTF16-LE. This will cause problems when redirecting qsv's output to a CSV file and trying to open it with Excel (which ignores the comma delimiter, with everything in the first column):

qsv stats wcp.csv > wcpstats.csv

Which is weird, since you would think Microsoft Excel would properly recognize UTF16-LE encoded CSV files. Regardless, to create a properly UTF-8 encoded file, use the --output option instead:

qsv stats wcp.csv --output wcpstats.csv

Environment Variables

  • QSV_DEFAULT_DELIMITER - single ascii character to use as delimiter. Overrides --delimeter option. Defaults to "," (comma) for CSV files and "\t" (tab) for TSV files, when not set. Note that this will also set the delimiter for qsv's output to stdout. However, using the --output option, regardless of this environment variable, will automatically change the delimiter used in the generated file based on the file extension - i.e. comma for .csv, tab for .tsv and .tab files.
  • QSV_SNIFF_DELIMITER - when set, the delimiter is automatically detected. Overrides QSV_DEFAULT_DELIMITER and --delimiter option.
  • QSV_NO_HEADERS - when set, the first row will NOT be interpreted as headers. Supersedes QSV_TOGGLE_HEADERS.
  • QSV_TOGGLE_HEADERS - if set to 1, toggles header setting - i.e. inverts qsv header behavior, with no headers being the default, and setting --no-headers will actually mean headers will not be ignored.
  • QSV_AUTOINDEX - when set, automatically create an index when none is detected. Also automatically updates stale indices.
  • QSV_SKIPUTF8_CHECK - when set, skip UTF-8 encoding check. Otherwise, qsv scans the first 8k of files. For stdin, it scans the entire buffer.
  • QSV_MAX_JOBS - number of jobs to use for multithreaded commands (currently dedup, extsort, frequency, schema, sort, split, stats and validate). If not set, max_jobs is set to the detected number of logical processors. See Multithreading for more info.
  • QSV_REGEX_UNICODE - if set, makes search, searchset and replace commands unicode-aware. For increased performance, these commands are not unicode-aware and will ignore unicode values when matching and will panic when unicode characters are used in the regex.
  • QSV_RDR_BUFFER_CAPACITY - set to change reader buffer size (bytes - default when not set: 16384)
  • QSV_WTR_BUFFER_CAPACITY - set to change writer buffer size (bytes - default when not set: 65536)
  • QSV_COMMENT_CHAR - set to a comment character which will ignore any lines (including the header) that start with this character (default: comments disabled).
  • QSV_LOG_LEVEL - set to desired level (default - off, error, warn, info, trace, debug).
  • QSV_LOG_DIR - when logging is enabled, the directory where the log files will be stored. If the specified directory does not exist, qsv will attempt to create it. If not set, the log files are created in the directory where qsv was started. See Logging for more info.
  • QSV_NO_UPDATE - prohibit self-update version check for the latest qsv release published on GitHub.
  • QSV_REDIS_CONNECTION_STRING - the fetch command can use Redis to cache responses. By default it connects to redis:127.0.0.1:6379. Set to connect to another Redis instance.
  • QSV_REDIS_TTL_SECONDS - by default, Redis cached values have a time-to-live of 2,419,200 seconds (28 days).
  • QSV_REDIS_TTL_REFRESH- set to enable cache hits to refresh TTL of cached values.

Several dependencies also have environment variables that influence qsv's performance & behavior:

NOTE: To get a list of all active qsv-relevant environment variables, run qsv --envlist.

Feature Flags

qsv has several features:

  • mimalloc (default) - use the mimalloc allocator (see Memory Allocator for more info).
  • apply - enable apply command. This swiss-army knife of CSV transformations is very powerful, but it has a lot of dependencies that increases both compile time and binary size.
  • fetch - enable fetch command.
  • generate - enable generate command.
  • full - enable to build qsv.
  • lite - enable to build qsvlite.
  • datapusher_plus - enable to build qsvdp.
  • nightly - enable to turn on nightly/unstable features in the rand and regex creates when building with Rust nightly/unstable.

The following "power-user" commands can be abused and present "foot-shooting" scenarios.

  • lua - enable lua command.
  • foreach - enable foreach command (not valid for Windows).
  • python - enable py command (requires Python 3.8+). Note that qsv will automatically use the currently activated python version when run in a virtual environment.

NOTE: qsvlite, as the name implies, always has non-default features disabled. qsv can be built with any combination of the above features using the cargo --features & --no-default-features flags. The pre-built qsv binaries has all applicable features enabled for the target platform6.

License

Dual-licensed under MIT or the UNLICENSE.

Sponsor

qsv was made possible by
datHere Logo
Standards-based, best-of-breed, open source solutions
to make your Data Useful, Usable & Used.

Naming Collision

This project is unrelated to Intel's Quick Sync Video.

Footnotes

  1. enabled by optional feature flag. Not available on qsvlite. 2 3 4 5

  2. uses an index when available. 2 3 4 5 6 7 8

  3. loads the entire CSV into memory. Note that dedup, stats & transpose have modes that do not load the entire CSV into memory. 2 3 4 5 6

  4. multithreaded even without an index. 2 3 4

  5. multithreaded when an index is available. 2 3 4

  6. The foreach feature is not available on Windows. The python feature is not enabled on cross-compiled pre-built binaries as we don't have access to a native python interpreter for those platforms (aarch64, i686, and arm) on GitHub's action runners. Compile natively on those platforms with Python 3.8+ installed, if you want to enable the python feature. 2