Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shipping prebuilt SQLite databases is a security issue #98

Open
KOLANICH opened this issue Feb 15, 2023 · 8 comments
Open

Shipping prebuilt SQLite databases is a security issue #98

KOLANICH opened this issue Feb 15, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@KOLANICH
Copy link

KOLANICH commented Feb 15, 2023

Describe the bug

Manipulated database files can be used for arbitrary code execution and SQLite has never guaranteed any security for this case. It is a nice place to conceal backdoors. I understand it is not easy to replace the current SQLAlchemy-based infrastructure... but it still has to be done in long run.

In short run you can try mitigating the issue by splitting the database into a separate package, which sdist contains the source file in JSON/CSV/TSV format, and generating the SQLite file on end user machine during package installation (that's why no prebuild wheels are to be shipped for this package).

@lmmentel
Copy link
Owner

Thanks for your thoughts.

There certainly can be concerns but I think the combination of factors reduces the risk. First of all all the changes go through PRs and those are reviewed and the code tested. Second python layer fetches data through SQLAlchemy which does not really allow for executing arbitrary code. I hope

It certainly would be an idea to split the data layer into a separate repo since there are implementation of this package in julialang and rust. I'm sure they both have sqlite drivers but it might be worth to put plain files out there.

A simple solution might be to dump the entire db into csv files corresponding to individual tables like here. Putting it back together when building a distribution would however require substantial extra effort to modify the current CD pipeline. I'm not yet convinced the effort is justified.

@KOLANICH
Copy link
Author

KOLANICH commented Feb 15, 2023

Second python layer fetches data through SQLAlchemy which does not really allow for executing arbitrary code.

It doesn't work this way. One just need to a query to a manipulated SQLite file to get adversary's shellcode executed.

First of all all the changes go through PRs and those are reviewed and the code tested.

Binary formats cannot be properly reviewed, they are binary, they cannot even be properly diffed by now with the publicly available tools. And even if we create a Kaitai Struct-based tool to do it, the diffs still will be large since there are formats not intended to be human-readable, but to be consumed only by software.

A simple solution might be to dump the entire db into csv files corresponding to individual tables

Yeah, that's what I have proposed. Also having datasets in CSV/TSV form is beneficial since far more software can be used to interface them.

An alternative to SQL can be HDF5 + pandas, but it would require replacing SQLAlchemy.

@kalvdans
Copy link
Collaborator

An alternative to SQL can be HDF5 + pandas, but it would require replacing SQLAlchemy.

I would recommend against HDF5, it is also a binary format with a library that crashes on corrupt files. There's only a single implementation of a writer library.

@lmmentel
Copy link
Owner

Second python layer fetches data through SQLAlchemy which does not really allow for executing arbitrary code.

It doesn't work this way. One just need to a query to a manipulated SQLite file to get adversary's shellcode executed.

I understand your concern but security for me is a risk management exercise and while this case is certainly possible I doubt it's very plausible in this particular case.

First of all all the changes go through PRs and those are reviewed and the code tested.

Binary formats cannot be properly reviewed, they are binary, they cannot even be properly diffed by now with the publicly available tools. And even if we create a Kaitai Struct-based tool to do it, the diffs still will be large since there are formats not intended to be human-readable, but to be consumed only by software.

Binary formats in general are hard to diff but sqlite can be diffed to some extent with sqldiff. Which to date was enough for purposes of tracking changes made to the db.

An alternative to SQL can be HDF5 + pandas, but it would require replacing SQLAlchemy.

I haven't done much investigations into security implication of SQLite vs HDF5 but I'm guessing you're suggesting that HDF5 is safer?

I don't want to sound like I'm ignoring the problem but this a small research tool with a moderate community of users so being on the pragmatic side I doubt someone will put massive effort to corrupt the db to exploit this case. I admit I might be wrong here. I want to think we're doing our best to mitigate reasonable risks but keeping dependencies up to date, testing and following best practices but if that's not enough for your use case you can choose another package - I mention a few in the docs.

I would gladly accept a PR that builds the db from plain text file formats for distribution but it would also have to address schema drift and migrations and relational consistency which are currently handled by SQLAlchemy. Let me know if you would be interested in leading that effort.

@KOLANICH
Copy link
Author

KOLANICH commented Feb 15, 2023

it would also have to address schema drift and migrations

I don't understand why the migrations have to be handled at all, if data schema must match the way it is used in software and if currently the data is shipped with the software.

@lmmentel
Copy link
Owner

it would also have to address schema drift and migrations

I don't understand why the migrations have to be handled at all, if data schema must match the way it is used in software and if currently the data is shipped with the software.

I'm assuming that you still want to keep sqilte for package distribution and if so when the data changes schema must be updated - otherwise it might be hard to keep any consistency. Data rarely describes itself sufficiently to infer a schema from raw csv file. I'm not saying this could not be done but simply that going that route would be a headache every time someone adds/modifies/renames a column or table.

@KOLANICH
Copy link
Author

BTW, I have an idea. We can use a :memory: database and import data into it on package initialization from TSV.

@lmmentel
Copy link
Owner

That could work. First import statement on the python side might take a while since it would run the first migration and load all the data into memory. It shouldn't be too hard to check how long that would take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants