Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Python) database tools / chapter? #316

Open
egpbos opened this issue Nov 6, 2023 · 13 comments
Open

(Python) database tools / chapter? #316

egpbos opened this issue Nov 6, 2023 · 13 comments

Comments

@egpbos
Copy link
Member

egpbos commented Nov 6, 2023

The Python chapter contains a section "Database Interface" that doesn't seem helpful. As far as I know, what we use from Python to access databases is typically SQLAlchemy, as @bouweandela suggested. I have used this personally in the past, but it's been a long time and I don't feel confident enough on this topic to write more than just "SQLAlchemy good".

Does anyone with more recent experience want to write something about this? @HannoSpreeuw maybe? Needn't be more than a single sentence, but if you have good tips ("ALWAYS rtfm first" or "NEVER rtfm, it's horrible, try this instead" or...), that would be great as well.

Also, if anyone disagrees with the above assessment (that the currently listed database interface tools are essentially useless to our typical usecases), please correct me!

@egpbos
Copy link
Member Author

egpbos commented Nov 6, 2023

Btw, for the historians among you, it seems like the original contributor of this list is lost in the mists of time ;) https://github.com/NLeSC/guide/blame/52cf8b2fe97f571cff745f3d8a60d915deb5b4dd/software/language_specific_info.md

@egpbos egpbos mentioned this issue Nov 6, 2023
1 task
@HannoSpreeuw
Copy link
Member

Sorry, I do have some experience with SQLAlchemy from PADRE, but that was mainly about getting it to work after a Python 2 --> 3 conversion. What you need is an opinion about whether SA is the preferred Object Relational Mapper. I truly don't know.

@egpbos
Copy link
Member Author

egpbos commented Nov 7, 2023

How about @suvayu?

I think @LourensVeen has the theoretical background on ORMs, but is not familiar with SA, or at least that's the last I heard... Perhaps still enough to make a sensible recommendation on this?

@LourensVeen
Copy link
Member

I have extensive but outdated experience with PostgreSQL, and I'm using sqlite3 from Python directly through its SQL-based API. I could probably do some research and write a reasonable recommendation, but I'm also swamped again, so at the earliest in 2024...

@suvayu
Copy link
Member

suvayu commented Nov 7, 2023

I'm guessing this is the section you are referring to? My experience with both databases and SQLAlchemy is only at the level "I know enough to debug, and spot obvious mistakes".

That said, one (somewhat) low effort approach could be someone moderately experienced go through related awesome lists and select a few recommendations.

Another point would be, not to stop at just interfaces, but mention other related tools like db documentation, query optimisers/explainers, database modelling hints, and convenience utilities like DBCli.

I'll be back next week. I can help as long as someone else also joins me.

@egpbos
Copy link
Member Author

egpbos commented Nov 7, 2023

@LourensVeen so what you are saying is that sqlite3 actually is a useful one to keep there?

@suvayu ok, this is an interesting direction. It's sounding more like a chapter on databases, though. Perhaps actually something like that is better than listing random libraries that nobody may ever use? I also remember from my brief time doing MySQL + SQLAlchemy that we ended up spending a lot of time just doing raw SQL and indeed interfacing directly with the MySQL CLI tool or e.g. through good old phpMyAdmin.

I would say then that the database interfaces list can simply be removed from this chapter. If people know the general way of working with databases and they need either an ORM or some direct interface between the DB they chose and the language they use, then Google will deliver it to them. Ok, maybe it makes sense to list popular ORMs in such a chapter, among which would be SQLAlchemy, but that would be the extent of the Python-/language-specific needs then.

Agree? Disagree? Things I missed?

@egpbos
Copy link
Member Author

egpbos commented Nov 7, 2023

@wrvhage What does the Data SIG think, is this something that makes sense?

Note btw that new Guide contribution guidelines will soon be online, so please don't write a database chapter yet ;)

@LourensVeen
Copy link
Member

sqlite is excellent, and definitely recommended if you have a use case that fits it. For lots of data or complex queries or if you need a separate server, then PostgreSQL is better, but it does have a separate server so it makes your application context more complex. And then of course there are column stores and NoSQL databases that may or may not be a better match than SQL. SQL often gets overlooked these days, which is a shame because there are definitely still lots of use cases where it's just the right solution.

@suvayu
Copy link
Member

suvayu commented Nov 7, 2023

My impression is for research use cases, these days duckdb is probably a good choice, since the data is usually static and it's more about analytics.

@egpbos egpbos changed the title Python database tools (Python) database tools / chapter? Nov 7, 2023
@f-hafner
Copy link
Contributor

I came across this thread by chance, and thought I added my two cents from one of my previous research projects. There, we've been using sqlite extensively and over time built a database of almost 1TB with many tables and relatively complex queries.

What we have been struggling is aggregation queries -- they are quite slow in sqlite for this size of db. I recently experimented with DuckDB, and I could connect to the sqlite existing database and perform some aggregation queries 10x faster than through sqlite.

In short, I think combining DuckDB and sqlite could be promising in many use-cases. I guess @suvayu is right that the more static the data, the more useful is DuckDB. But if you are building a db from scratch or adding a lot of tables (which at least in the social sciences is not uncommon), I think sqlite is better because it can persist row indices which I think DuckDB cannot. For analytics, it seems one can then still use DuckDB, but perhaps one could explore the combination of the two a bit more.

@suvayu
Copy link
Member

suvayu commented Jan 26, 2024 via email

@f-hafner
Copy link
Contributor

I think I tried to replace the entire sqlite db with DuckDB in the past, but then abandoned the idea because I could not persist the row indices, but perhaps I should have tried a little more.

so yes, @suvayu , let's have a look together. I can't share the data publicly, but can give you access or prepare an extract that I can share.

@f-hafner
Copy link
Contributor

f-hafner commented Apr 25, 2024

@suvayu and I will work on the following

Issues to address for db comparisons duckdb vs sqlite

  • when resources are constrained, it should not fail -> set memory limit; spill to temp directory
  • how to control resource use (cores, threads), with an eye on personal laptop vs server
  • compare speed on large queries (aggregation) between sqlite and duckdb -- is there one area where we prefer sqlite??
  • wider use cases, like querying files (json, parquet, csv, excel)
  • lazy evaluation? in sqlite vs duckdb?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants