Skip to content

A SQLite extension for reading large files line-by-line (NDJSON, logs, txt, etc.)

License

Notifications You must be signed in to change notification settings

asg017/sqlite-lines

Repository files navigation

sqlite-lines

sqlite-lines is a SQLite extension for reading lines from a file or blob.

Benchmark between sqlite-lines and various other data processing tools

See Benchmarks for more info.

Usage

.load ./lines0
select line from lines_read('logs.txt');

sqlite-lines is great for line-oriented datasets, like ndjson or JSON Lines, when paired with SQLite's JSON support. Here, we calculate the top 5 country participants in Google's Quick, Draw! dataset for calendars.ndjson:

select
  line ->> '$.countrycode' as countrycode,
  count(*)
from lines_read('./calendar.ndjson')
group by 1
order by 2 desc
limit 5;
/*
┌─────────────┬──────────┐
│ countrycode │ count(*) │
├─────────────┼──────────┤
│ US          │ 141001   │
│ GB          │ 22560    │
│ CA          │ 11759    │
│ RU          │ 9250     │
│ DE          │ 8748     │
└─────────────┴──────────┘
*/

Use the SQLite CLI's fsdir() table functions with lines_read() to read lines from every file in a directory.

select
  name as file,
  lines.rowid as line_number,
  line
from fsdir('logs')
join lines_read(name) as lines
where name like '%.txt';
/*
┌─────────────────────┬──────┐
│ file  │ line_number | line │
├───────┼─────────────┤──────┤
| a.txt | 1           | x    |
| a.txt | 2           | y    |
| a.txt | 3           | z    |
| b.txt | 1           | xx   |
| b.txt | 2           | yy   |
| c.txt | 1           | xxx  |
└───────┴─────────────┴──────┘
*/

Documentation

See docs.md for a full API Reference and detailed documentation.

Installing

Language Install
Python pip install sqlite-lines PyPI
Datasette datasette install datasette-sqlite-lines Datasette
Node.js npm install sqlite-lines npm
Deno deno.land/x/sqlite_lines deno.land/x release
Ruby gem install sqlite-lines Gem
Github Release GitHub tag (latest SemVer pre-release)

The Releases page contains pre-built binaries for Linux amd64 and MacOS (amd64, no arm).

As a loadable extension

If you want to use sqlite-lines as a Runtime-loadable extension, Download the lines0.dylib (for MacOS) or lines0.so file from a release and load it into your SQLite environment.

Note: The 0 in the filename (lines0.dylib or lines0.so) denotes the major version of sqlite-lines. Currently sqlite-lines is pre v1, so expect breaking changes in future versions.

For example, if you are using the SQLite CLI, you can load the library like so:

.load ./lines0
select lines_version();
-- v0.0.1

Or in Python, using the builtin sqlite3 module:

import sqlite3

con = sqlite3.connect(":memory:")

con.enable_load_extension(True)
con.load_extension("./lines0")

print(con.execute("select lines_version()").fetchone())
# ('v0.0.1',)

Or in Node.js using better-sqlite3:

const Database = require("better-sqlite3");
const db = new Database(":memory:");

db.loadExtension("./lines0");

console.log(db.prepare("select lines_version()").get());
// { 'lines_version()': 'v0.0.1' }

Or with Datasette (using the "no filesystem" version to limit security vulnerabilities):

datasette data.db --load-extension ./lines_nofs0

Windows is not supported - yet!

From the browser with WASM/JavaScript

sqlite-lines is also distributed as a standalone SQL.js library. It's essentially a fork of the original SQL.js library, with the addition of sqlite-lines functions like lines_version() and lines().

Check out this Observable notebook for the full demonstration. The Releases page contains the JavaScript and WASM files.

The sqlite-lines CLI

sqlite-lines comes with an example CLI modeled after ndjson-cli that demos the speed and versatility of sqlite-lines. Download a pre-compiled version from the Releases page, or build yourself with:

make cli
./dist/sqlite-lines

The sqlite-lines CLI reads data from stdin and applies transformations with SQL code through its arguments.

The first argument should be a SQL expression that is used transform a single line from stdlin. The available columns are rowid, which is the "line number" that is being processed, and d, an alias for line, which is the text content of the current line (inspired by ndjson-cli). For example, to uppercase every line from a file with upper():

$ cat names.txt | sqlite-lines 'rowid || upper(d)'
1ALEX
2BRIAN
3CRAIG

This includes SQLite's new JSON -> and ->> operators for NDJSON/JSONL files:

$ cat data.ndjson | sqlite-lines 'd ->> "$.id"'
$ cat data.ndjson | sqlite-lines 'json_object("name", d ->> "$.name", "age": d ->> "$.stats.age")'

The second argument is another SQL expression that's used in the WHERE statement of the underlying SQL query to filter out lines.

# get the names of all people older than 40
cat data.ndjson | sqlite-lines 'd ->> "$.name"' 'd ->> "$.age" > 40'

The third argument is another SQL expression that's used in the GROUP BY statement of the underlying SQL query to aggregate lines.

A Note on CSV Parsing

sqlite-lines isn't a great option for CSVs. Technically you can, but the moment your data has a \n character in a field or header, then you'll get corrupted results.

Instead, you should use the "official" CSV Virtual Table, or use the .import command in the SQLite CLI.