I feel like the article misses most of the 'meat' of bulk loading setups. e.g. Using partitioned tables, attaching partitions with little to no locking, indexing after loading, `COPY ... FREEZE`, etc.
One other note:
> Goto solution for bulk loading into PostgreSQL is the native copy command. But one limitation with the copy command is that it requires the CSV file to be placed on the server.
You can use `COPY ... FROM STDIN` and stream the data from the client, this is basically what `/copy` does in psql.
One use case I've used them for is when you have a table that is 'under fire', and you need to add bulk data to it. Doing a large insert into a table like that can cause various issues. Instead I a partitioned table and each bulk load is a separate partition. This allows doing the bulk load to a 'clean' table out of the line of fire. You can then apply indexes, do cache warming or whatever else you may want to do to it, then attach the partition to the parent table to make it 'active'. As long as you have covering constraint on the partition that matches the partition constraint, this results is minimal impact to query performance. We do bulk loads of partitions with ~100-200M rows this way and it's barely noticeable to the query side latency.
The best use case I've had for partitioning tables is when there is a lot of churn that would otherwise result in significant vacuuming. For example, a table holding the past 7 days of data would require deleting 1/7 of its rows every day.
Swapping out a child table and dropping the old one is free in that sense, but does require an exclusive lock which can cause issues if you have long running queries.
The important thing to remember is to use COPY .. FROM STDIN, not insert, for bulk loading into PostgreSQL. Most PostgreSQL drivers support COPY from the client, though it's not always available through generic database APIs.
COPY commands typically write hundreds of thousands of rows per second on a large server/cluster. It's useful to write over multiple connections, but rarely more than ~16.
The standard API for database access in Python is DB-API 2 and it doesn't include support for COPY , so each driver may implement it differently.
There's another aspect to this, and that's the format of the file to be ingested as it can be 'text', 'CSV' or 'binary'. If you're generating the file yourself then you have a choice, but whether 'binary' is faster than 'CSV' I just don't know.
Worth noting that you pass in data in a proprietary binary format. [1]
I've done it, not out of performance concerns, but because transcoding between proprietary binary formats with types is a lot saner than the alternative.
One caveat to keep in mind when using the binary format is that arrays of custom types are not portable across databases because the serialized array contains the OID of the custom type, which may be different on the other end.
The other thing to keep in mind is that text or CSV can be much more compact for data sets with many small integers or NULLs. On the other hand, the binary format is much more compact for timestamps and floating point numbers. In general, binary format has lower parsing overhead.
Not logging with pg_bulkload will win every time. However, not logging transactions is not something that should be taken lightly nor used for every bulk load operation. For instance, if replicating or synchronizing data to another db or using logs for such activity than you'll need the load transactions to perform the same operation on that other db.
Sadly can't - I have never needed to use any bulk loading Postgres utilities, so I don't have the necessary experience. Still, I have had some glances at pgloader as a Common Lisp program, and I'm aware that it can migrate from X to PostgreSQL, where X includes multiple distinct databases and CSV files. That's why I've mentioned it here.
I'm slightly surprised to not see any mention of materialized views.
We're in the middle of a project in which we use file_fdw to provide an interface to CSV files which are replaced nightly, but we then also have indexed materialized views to provide faster access to the data in those files. The CSVs are only read once per day when REFRESHing the materialized views after a new set of files is dropped in.
One of the nice things file_fdw is that it allows you to create foreign tables even when the source files don't exist on the filesystem. PG only cares about the files at the moment it's trying to read them. There is one downside, though, which is that the files have to be specified via absolute path. This is something that has to be considered given the heterogeneous machine environments on which the application runs.
Nit: The plots are hard to follow to the point of being useless. I need to skip the entire group (4 bars) to compare against each other. Bar charts work best when the bars you want to compare are adjacent, not organized by groups. Anyways, the table is far more useful here.
What about cases where there's multiple tables with foreign keys between them? Should you just drop the constraints, import tables individually, and then re-apply the constraints? Any alternatives?
Postgres allows for deferring constraints, so you can do an alter command at the start to defer them, then upon commit all of the foreign keys are checked. Makes it hugely easier to load when you don’t need to do it in dependency order.
Yup.. I’ve been there and it sucks. Only realising after your load of ~1 billion records has almost completed after several hours that there were a couple of duplicated rows and the PK constraint failed.
I wish there were a way for the db to filter out or not surface rows that violated a constraint but that then loaded the data that did meet the constraints. I guess the fix here is before letting the db validate constraints one needs to run data quality checks to either delete or fix rows that would have failed such checks. But then depending on transaction level you might not be able to see uncommitted rows? I dunno
> I wish there were a way for the db to filter out or not surface rows that violated a constraint but that then loaded the data that did meet the constraints.
You can definitely have PG ignore rows that would violate constraints by specifying "ON CONFLICT DO NOTHING"[1] as part of your INSERT statement.
If you wanted to capture those rows on the fly, one way would be to perform the load via a PL/PgSQL program and include an exception handler specific to the particular constraint violation(s) you expect to hit. But I guess then you'd have to load them one row at a time...and that's not too different from just doing that in your high-level language of choice. Also the PL/PgSQL docs specifically recommend against this idea.[2]
Depending on use case one can consider creating target table as UNLOGGED, then altering it after bulk data load. It makes for a really fast load, but it won't survive a server crash so altering it back to 'set logged' after the data loading is recommended.
One other note:
> Goto solution for bulk loading into PostgreSQL is the native copy command. But one limitation with the copy command is that it requires the CSV file to be placed on the server.
You can use `COPY ... FROM STDIN` and stream the data from the client, this is basically what `/copy` does in psql.