Hacker News new | past | comments | ask | show | jobs | submit login
Bulk loading into PostgreSQL: Options and comparison (highgo.ca)
112 points by eatonphil on Dec 12, 2020 | hide | past | favorite | 28 comments



I feel like the article misses most of the 'meat' of bulk loading setups. e.g. Using partitioned tables, attaching partitions with little to no locking, indexing after loading, `COPY ... FREEZE`, etc.

One other note:

> Goto solution for bulk loading into PostgreSQL is the native copy command. But one limitation with the copy command is that it requires the CSV file to be placed on the server.

You can use `COPY ... FROM STDIN` and stream the data from the client, this is basically what `/copy` does in psql.


The article did compare loading with index vs index after load.

What good do partitions do? Loading in parallel?


One use case I've used them for is when you have a table that is 'under fire', and you need to add bulk data to it. Doing a large insert into a table like that can cause various issues. Instead I a partitioned table and each bulk load is a separate partition. This allows doing the bulk load to a 'clean' table out of the line of fire. You can then apply indexes, do cache warming or whatever else you may want to do to it, then attach the partition to the parent table to make it 'active'. As long as you have covering constraint on the partition that matches the partition constraint, this results is minimal impact to query performance. We do bulk loads of partitions with ~100-200M rows this way and it's barely noticeable to the query side latency.


The best use case I've had for partitioning tables is when there is a lot of churn that would otherwise result in significant vacuuming. For example, a table holding the past 7 days of data would require deleting 1/7 of its rows every day.

Swapping out a child table and dropping the old one is free in that sense, but does require an exclusive lock which can cause issues if you have long running queries.


The important thing to remember is to use COPY .. FROM STDIN, not insert, for bulk loading into PostgreSQL. Most PostgreSQL drivers support COPY from the client, though it's not always available through generic database APIs.

COPY commands typically write hundreds of thousands of rows per second on a large server/cluster. It's useful to write over multiple connections, but rarely more than ~16.


This is true. I'm the maintainer of pg8000, a Python driver for PostgreSQL. Here are the docs for how you'd do this using pg8000:

https://github.com/tlocke/pg8000#copy-from-and-to-a-file

The standard API for database access in Python is DB-API 2 and it doesn't include support for COPY , so each driver may implement it differently.

There's another aspect to this, and that's the format of the file to be ingested as it can be 'text', 'CSV' or 'binary'. If you're generating the file yourself then you have a choice, but whether 'binary' is faster than 'CSV' I just don't know.


Worth noting that you pass in data in a proprietary binary format. [1]

I've done it, not out of performance concerns, but because transcoding between proprietary binary formats with types is a lot saner than the alternative.

[1]: https://www.postgresql.org/docs/current/sql-copy.html#id-1.9...


One caveat to keep in mind when using the binary format is that arrays of custom types are not portable across databases because the serialized array contains the OID of the custom type, which may be different on the other end.

The other thing to keep in mind is that text or CSV can be much more compact for data sets with many small integers or NULLs. On the other hand, the binary format is much more compact for timestamps and floating point numbers. In general, binary format has lower parsing overhead.


Not logging with pg_bulkload will win every time. However, not logging transactions is not something that should be taken lightly nor used for every bulk load operation. For instance, if replicating or synchronizing data to another db or using logs for such activity than you'll need the load transactions to perform the same operation on that other db.


This article, for whatever reason, doesn't mention pgloader!

https://pgloader.io/


Written in Common Lisp which is pretty cool.


Could you comment on how it compares to the methods benchmarked?


Sadly can't - I have never needed to use any bulk loading Postgres utilities, so I don't have the necessary experience. Still, I have had some glances at pgloader as a Common Lisp program, and I'm aware that it can migrate from X to PostgreSQL, where X includes multiple distinct databases and CSV files. That's why I've mentioned it here.


There are some good tips from a previous HN post on the same topic here: https://news.ycombinator.com/item?id=20403284


I'm slightly surprised to not see any mention of materialized views.

We're in the middle of a project in which we use file_fdw to provide an interface to CSV files which are replaced nightly, but we then also have indexed materialized views to provide faster access to the data in those files. The CSVs are only read once per day when REFRESHing the materialized views after a new set of files is dropped in.

One of the nice things file_fdw is that it allows you to create foreign tables even when the source files don't exist on the filesystem. PG only cares about the files at the moment it's trying to read them. There is one downside, though, which is that the files have to be specified via absolute path. This is something that has to be considered given the heterogeneous machine environments on which the application runs.


Nit: The plots are hard to follow to the point of being useless. I need to skip the entire group (4 bars) to compare against each other. Bar charts work best when the bars you want to compare are adjacent, not organized by groups. Anyways, the table is far more useful here.


Also, the choice of colours is incredibly poor, different shades of blue don't work with people for some forms of color blindness.


What about cases where there's multiple tables with foreign keys between them? Should you just drop the constraints, import tables individually, and then re-apply the constraints? Any alternatives?


Not sure if this solves it in all cases, but foreign key constraint checks can be deferred until the end of the transaction.

https://www.postgresql.org/docs/9.1/sql-set-constraints.html


Postgres allows for deferring constraints, so you can do an alter command at the start to defer them, then upon commit all of the foreign keys are checked. Makes it hugely easier to load when you don’t need to do it in dependency order.


What happens to rows that fail the constraint check after load?


It'll just throw an error when COMMITing, most likely.


And roll back the whole load? Yikes


Yup.. I’ve been there and it sucks. Only realising after your load of ~1 billion records has almost completed after several hours that there were a couple of duplicated rows and the PK constraint failed.


I wish there were a way for the db to filter out or not surface rows that violated a constraint but that then loaded the data that did meet the constraints. I guess the fix here is before letting the db validate constraints one needs to run data quality checks to either delete or fix rows that would have failed such checks. But then depending on transaction level you might not be able to see uncommitted rows? I dunno


> I wish there were a way for the db to filter out or not surface rows that violated a constraint but that then loaded the data that did meet the constraints.

You can definitely have PG ignore rows that would violate constraints by specifying "ON CONFLICT DO NOTHING"[1] as part of your INSERT statement.

If you wanted to capture those rows on the fly, one way would be to perform the load via a PL/PgSQL program and include an exception handler specific to the particular constraint violation(s) you expect to hit. But I guess then you'd have to load them one row at a time...and that's not too different from just doing that in your high-level language of choice. Also the PL/PgSQL docs specifically recommend against this idea.[2]

[1]: https://www.postgresql.org/docs/11/sql-insert.html#SQL-ON-CO...

[2]: https://www.postgresql.org/docs/11/plpgsql-control-structure...

EDIT: formatting and clarity


Of course! I forgot about that option. Makes sense.


Depending on use case one can consider creating target table as UNLOGGED, then altering it after bulk data load. It makes for a really fast load, but it won't survive a server crash so altering it back to 'set logged' after the data loading is recommended.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: