This is a CLI companion to DataStation (a GUI) for running SQL queries against data files. So if you want the GUI version of this, check out DataStation.
Binaries for amd64 (x86_64) are provided for each release.
On macOS or Linux, you can run the following:
$ VERSION=0.1.0
$ curl -LO "https://github.com/multiprocessio/dsq/releases/download/$VERSION/dsq-$(uname -s | awk '{ print tolower($0) }')-x64-$VERSION.zip"
$ unzip dsq*.zip
$ sudo mv dsq /usr/local/bin/dsq
Or install manually from the releases
page, unzip and add
dsq
to your $PATH
.
Download the latest Windows
release, unzip it,
and add dsq
to your $PATH
.
If you are on another platform or architecture or want to grab the latest release, you can do so with Go 1.17+:
$ go install github.com/multiprocessio/dsq@latest
You can either pipe data to dsq
or you can pass a file name to it.
If you are passing a file, it must have the usual extension for its content type.
For example:
$ dsq testdata.json "SELECT * FROM {} WHERE x > 10"
Or:
$ dsq testdata.ndjson "SELECT name, AVG(time) FROM {} GROUP BY name ORDER BY AVG(time) DESC"
When piping data to dsq
you need to set the -s
flag and specify
the file extension or MIME type.
For example:
$ cat testdata.csv | dsq -s csv "SELECT * FROM {} LIMIT 1"
Or:
$ cat testdata.parquet | dsq -s parquet "SELECT COUNT(1) FROM {}"
You can pass multiple files to DSQ. As long as they are supported data
files in a valid format, you can run SQL against all files as
tables. Each table can be accessed by the string {N}
where N
is the
0-based index of the file in the list of files passed on the
commandline.
For example this joins two datasets of differing origin types (CSV and JSON).
$ dsq testdata/join/users.csv testdata/join/ages.json \
"select {0}.name, {1}.age from {0} join {1} on {0}.id = {1}.id"
As a shorthand for dsq testdata.csv "SELECT * FROM {}"
to convert
supported file types to JSON you can skip the query and the converted
JSON will be dumped to stdout.
For example:
$ dsq testdata.csv
[{...some csv data...},{...some csv data...},...]
It's easiest to show an example. Let's say you have the following JSON file called user_addresses.json
:
[
{"name": "Agarrah", "location": {"city": "Toronto", "address": { "number": 1002 }}},
{"name": "Minoara", "location": {"city": "Mexico City", "address": { "number": 19 }}},
{"name": "Fontoon", "location": {"city": "New London", "address": { "number": 12 }}}
]
You can query the nested fields like so:
$ dsq user_addresses.json 'SELECT name, "location.city" FROM {}'
And if you need to disambiguate the table:
$ dsq user_addresses.json 'SELECT name, {}."location.city" FROM {}'
Nested objects are collapsed and their new column name becomes the
JSON path to the value connected by .
. Actual dots in the path must
be escaped with a backslash. Since .
is a special character in SQL
you must quote the whole new column name.
Nested objects within arrays are still ignored/dropped by dsq
. So if
you have data like this:
[
{"field1": [1]},
{"field1": [2]},
]
You cannot access any data within field1
. You will need to
preprocess your data with some other tool.
You cannot query whole objects, you must ask for a specific path that results in a scalar value.
For example in the user_addresses.json
example above you CANNOT do this:
$ dsq user_addresses.json 'SELECT name, {}."location" FROM {}'
Because location
is not a scalar value. It is an object.
Name | File Extension(s) | Notes |
---|---|---|
CSV | csv |
|
TSV | tsv , tab |
|
JSON | json |
Must be an array of objects. |
Newline-delimited JSON | ndjson , jsonl |
|
Parquet | parquet |
|
Excel | xlsx , xls |
Currently only works if there is only one sheet. |
ODS | ods |
Currently only works if there is only one sheet. |
Apache Error Logs | text/apache2error |
Currently only works if being piped in. |
Apache Access Logs | text/apache2access |
Currently only works if being piped in. |
Nginx Access Logs | text/nginxaccess |
Currently only works if being piped in. |
Under the hood dsq uses DataStation as a library and under that hood DataStation uses SQLite to power these kinds of SQL queries on arbitrary (structured) data.
The speed column is based on rough benchmarks based on q's benchmarks. Eventually I'll do a more thorough and public benchmark.
Name | Link | Speed | Supported File Types | Engine |
---|---|---|---|---|
q | https://harelba.github.io/q/ | Fast | CSV, TSV | Uses SQLite |
textql | https://github.com/dinedal/textql | Ok | CSV, TSV | Uses SQLite |
octoql | https://github.com/cube2222/octosql | Slow | JSON, CSV, Excel, Parquet | Custom engine missing many features from SQLite |
dsq | Here | Ok | CSV, TSV, JSON, Newline-delimited JSON, Parquet, Excel, ODS (OpenOffice Calc), Logs | Uses SQLite |
Download the app and use it! Report bugs on Discord.
Before starting on any new feature though, check in on Discord!
If you want to hear about new features and how this works under the hood, sign up here.
This software is licensed under an Apache 2.0 license.