Shipping prebuilt SQLite databases is a security issue #98

KOLANICH · 2023-02-15T10:10:23Z

Describe the bug

Manipulated database files can be used for arbitrary code execution and SQLite has never guaranteed any security for this case. It is a nice place to conceal backdoors. I understand it is not easy to replace the current SQLAlchemy-based infrastructure... but it still has to be done in long run.

In short run you can try mitigating the issue by splitting the database into a separate package, which sdist contains the source file in JSON/CSV/TSV format, and generating the SQLite file on end user machine during package installation (that's why no prebuild wheels are to be shipped for this package).

lmmentel · 2023-02-15T12:43:30Z

Thanks for your thoughts.

There certainly can be concerns but I think the combination of factors reduces the risk. First of all all the changes go through PRs and those are reviewed and the code tested. Second python layer fetches data through SQLAlchemy which does not really allow for executing arbitrary code. I hope

It certainly would be an idea to split the data layer into a separate repo since there are implementation of this package in julialang and rust. I'm sure they both have sqlite drivers but it might be worth to put plain files out there.

A simple solution might be to dump the entire db into csv files corresponding to individual tables like here. Putting it back together when building a distribution would however require substantial extra effort to modify the current CD pipeline. I'm not yet convinced the effort is justified.

KOLANICH · 2023-02-15T12:56:55Z

Second python layer fetches data through SQLAlchemy which does not really allow for executing arbitrary code.

It doesn't work this way. One just need to a query to a manipulated SQLite file to get adversary's shellcode executed.

First of all all the changes go through PRs and those are reviewed and the code tested.

Binary formats cannot be properly reviewed, they are binary, they cannot even be properly diffed by now with the publicly available tools. And even if we create a Kaitai Struct-based tool to do it, the diffs still will be large since there are formats not intended to be human-readable, but to be consumed only by software.

A simple solution might be to dump the entire db into csv files corresponding to individual tables

Yeah, that's what I have proposed. Also having datasets in CSV/TSV form is beneficial since far more software can be used to interface them.

An alternative to SQL can be HDF5 + pandas, but it would require replacing SQLAlchemy.

kalvdans · 2023-02-15T13:08:13Z

An alternative to SQL can be HDF5 + pandas, but it would require replacing SQLAlchemy.

I would recommend against HDF5, it is also a binary format with a library that crashes on corrupt files. There's only a single implementation of a writer library.

lmmentel · 2023-02-15T13:15:32Z

Second python layer fetches data through SQLAlchemy which does not really allow for executing arbitrary code.

It doesn't work this way. One just need to a query to a manipulated SQLite file to get adversary's shellcode executed.

I understand your concern but security for me is a risk management exercise and while this case is certainly possible I doubt it's very plausible in this particular case.

First of all all the changes go through PRs and those are reviewed and the code tested.

Binary formats cannot be properly reviewed, they are binary, they cannot even be properly diffed by now with the publicly available tools. And even if we create a Kaitai Struct-based tool to do it, the diffs still will be large since there are formats not intended to be human-readable, but to be consumed only by software.

Binary formats in general are hard to diff but sqlite can be diffed to some extent with sqldiff. Which to date was enough for purposes of tracking changes made to the db.

An alternative to SQL can be HDF5 + pandas, but it would require replacing SQLAlchemy.

I haven't done much investigations into security implication of SQLite vs HDF5 but I'm guessing you're suggesting that HDF5 is safer?

I don't want to sound like I'm ignoring the problem but this a small research tool with a moderate community of users so being on the pragmatic side I doubt someone will put massive effort to corrupt the db to exploit this case. I admit I might be wrong here. I want to think we're doing our best to mitigate reasonable risks but keeping dependencies up to date, testing and following best practices but if that's not enough for your use case you can choose another package - I mention a few in the docs.

I would gladly accept a PR that builds the db from plain text file formats for distribution but it would also have to address schema drift and migrations and relational consistency which are currently handled by SQLAlchemy. Let me know if you would be interested in leading that effort.

KOLANICH · 2023-02-15T13:21:42Z

it would also have to address schema drift and migrations

I don't understand why the migrations have to be handled at all, if data schema must match the way it is used in software and if currently the data is shipped with the software.

lmmentel · 2023-02-15T18:50:40Z

it would also have to address schema drift and migrations

I don't understand why the migrations have to be handled at all, if data schema must match the way it is used in software and if currently the data is shipped with the software.

I'm assuming that you still want to keep sqilte for package distribution and if so when the data changes schema must be updated - otherwise it might be hard to keep any consistency. Data rarely describes itself sufficiently to infer a schema from raw csv file. I'm not saying this could not be done but simply that going that route would be a headache every time someone adds/modifies/renames a column or table.

KOLANICH · 2023-02-15T20:23:01Z

BTW, I have an idea. We can use a :memory: database and import data into it on package initialization from TSV.

lmmentel · 2023-02-16T01:02:25Z

That could work. First import statement on the python side might take a while since it would run the first migration and load all the data into memory. It shouldn't be too hard to check how long that would take.

KOLANICH added the bug Something isn't working label Feb 15, 2023

KOLANICH mentioned this issue Feb 15, 2023

It is unclear how to rebuild the database from scratch #99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shipping prebuilt SQLite databases is a security issue #98

Shipping prebuilt SQLite databases is a security issue #98

KOLANICH commented Feb 15, 2023 •

edited

Loading

lmmentel commented Feb 15, 2023

KOLANICH commented Feb 15, 2023 •

edited

Loading

kalvdans commented Feb 15, 2023

lmmentel commented Feb 15, 2023

KOLANICH commented Feb 15, 2023 •

edited

Loading

lmmentel commented Feb 15, 2023

KOLANICH commented Feb 15, 2023

lmmentel commented Feb 16, 2023

Shipping prebuilt SQLite databases is a security issue #98

Shipping prebuilt SQLite databases is a security issue #98

Comments

KOLANICH commented Feb 15, 2023 • edited Loading

Describe the bug

lmmentel commented Feb 15, 2023

KOLANICH commented Feb 15, 2023 • edited Loading

kalvdans commented Feb 15, 2023

lmmentel commented Feb 15, 2023

KOLANICH commented Feb 15, 2023 • edited Loading

lmmentel commented Feb 15, 2023

KOLANICH commented Feb 15, 2023

lmmentel commented Feb 16, 2023

KOLANICH commented Feb 15, 2023 •

edited

Loading

KOLANICH commented Feb 15, 2023 •

edited

Loading

KOLANICH commented Feb 15, 2023 •

edited

Loading