chapter | appendix | title |
---|---|---|
20 |
A |
Getting and Preparing OSM Data |
Osm2pgsql imports OSM data into a database. But where does this data come from? This appendix shows how to get the data and, possibly, prepare it for use with osm2pgsql.
There are several common formats for OSM data and osm2pgsql supports all of them (XML{:.extlink}, PBF{:.extlink}, and O5M{:.extlink}). If you have the option, you should prefer the PBF format, because the files are smaller than those in other formats and faster to read.
OSM data files usually have a suffix .osm
, .osm.gz
, .osm.bz2
, or
.osm.pbf
, sometimes just .pbf
. They are used in osm2pgsql "create" mode.
OSM change files usually have the suffix .osc.gz
. They are used in
osm2pgsql "append" mode.
The OpenStreetMap project publishes the so-called "planet file", which contains a dump of the current full OSM database. This dump is available in XML and PBF format.
Downloads are available on planet.osm.org{:.extlink} and several mirrors{:.extlink}.
If you are new to osm2pgsql we recommend you start experimenting with a small extract, not the planet file! The planet file is huge (tens of GBytes) and it will take many hours to import. {: .note}
Importing data into the database takes time and uses a lot of disk space. If you only need a portion of the OSM data it often makes sense to only import an extract of the data.
Geographical extracts for countries, states, cities, etc. are available from several sources{:.extlink}.
The extracts from Geofabrik{:.extlink} are very popular. They are updated daily and also offer daily change files suitable for updating an osm2pgsql database.
Always choose an extract slightly larger than the area you are interested in, otherwise you might have problems with the data at the boundary (also see Handling of Incomplete OSM Data).
If you can't find a suitable extract, see below for creating your own.
The OpenStreetMap database changes all the time. To get these changes into your database, you need to download the OSM change files, sometimes called diffs or replication diffs, which contain those changes. They are available from planet.osm.org{:.extlink}. Depending on how often you want to update your database, you can get minutely, hourly, or daily change files.
Some services offering OSM data extracts for download also offer change files suitable for updating those extracts. Geofabrik{:.extlink} has daily change files for all its updates. See the extract page for a link to the replication URL. (Note that change files go only about 3 months back. Older files are deleted.) download.openstreetmap.fr{:.extlink} has minutely change files for all its extracts.
To keep an osm2pgsql database up to date you need to know the replication
(base) URL, i.e. the URL of the directory containing a state.txt
file.
Osm2pgsql comes with a script scripts/osm2pgsql-replication
which is the easiest way to keep an osm2pgsql database up to date. The script
requires PyOsmium{:.extlink} and
Psycopg{:.extlink} (psycopg2 and psycopg3 both will work)
to be installed.
Before you can download updates, osm2pgsql-replication needs to find the starting point from which to apply the updates. Run the initialisation like this:
osm2pgsql-replication init -d <dbname>
The -d
parameter tells the replication script which database to connect
to. The script also supports all other parameters mentioned in the section
database connection including
libpq environment variables.
The only exception is the '-W' parameter for interactive password prompts.
When you need to supply a password, always use a
pgpass file.
Osm2pgsql will store a table
osm2pgsql_properties
to your database for new
imports. osm2pgsql-replication
uses that table for storing
the replication information.
By default the update server and interval will be set from the file headers,
for planet dumps minutely updates from the OSM main servers will be used. If
you want to use a different replication service, use the --server
parameter.
It is safe to repeat initialisation at any time. For example, when you want
to change the replication service, simply run the init command again with a
different --server
parameter.
Fetching updates is as simple as running:
osm2pgsql-replication update -d <dbname> -- <parameters to osm2pgsql>
This fetches data from the replication service, saves it in a temporary file
and calls osm2pgsql with the given parameters to apply the changes. Note that
osm2pgsql-replication makes sure to only fetch a limited amount of data at the
time to make sure that it does not use up too much RAM. If more data is
available it will repeat the download and call of osm2pgsql until the database
is up to date. You can change the amount of data downloaded at once with
--max-diff-size
, the default is 500MB.
Sometimes you need to run additional commands after osm2pgsql has updated the
database, for example, when you use the expiry function. You can use the
option --post-processing
to give osm2pgsql-replication a script it is
supposed to run after each call to osm2pgsql. Note that if the script fails,
then the entire update process is considered a failure and aborted.
osm2pgsql-replication works well as a systemd service that keeps your database up to date automatically. There are many ways to set up systemd. This section gives you a working example to get you started.
First set up a service that runs the updates. Add the following file as
/etc/systemd/system/osm2pgsql-update.service
:
[Unit]
Description=Keep osm2pgsql database up-to-date
[Service]
WorkingDirectory=/tmp
ExecStart=osm2pgsql-replication update -d <dbname> -- <parameters to osm2pgsql>
StandardOutput=append:/var/log/osm2pgsql-updates.log
User=<database user>
Type=simple
Restart=on-failure
RestartSec=5min
Make sure to adapt the database name, osm2pgsql parameters and the user that the
script should be run as. The Restart
parameters make sure that updates will be
tried again, when something goes wrong like there is a temporary network error.
Now add a timer script that starts the service regularly. Add the following
file as /etc/systemd/system/osm2pgsql-update.timer
:
[Unit]
Description=Trigger a osm2pgsql database update
[Timer]
OnBootSec=10
OnUnitActiveSec=1h
[Install]
WantedBy=timers.target
This timer is good to bring your database up to date once per hour. If you
use minutely updates, you can lower the value for OnUnitActiveSec
to get the
changes even more often. If you use a daily
replication service like Geofabrik, set OnUnitActiveSec to at least 1 hour!
If you prefer to run your updates only once every night, use
OnCalendar=*-*-* 02:00
instead. This will update your server every night at
2am. See the man pages of systemd.timer
for more information.
Now reload systemd so it scans the new scripts and enable the timer:
sudo systemctl daemon-reload
sudo systemctl enable osm2pgsql-update.timer
sudo systemctl start osm2pgsql-update.timer
If this script is not available in your version of osm2pgsql or you want more control over the update process, there are other options. You need a program to download the change files and keep track of where you are in the replication process. Then you load the changes into the database using osm2pgsql's "append" mode.
We recommend the
pyosmium_get_changes.py
{:.extlink}
tool from the PyOsmium{:.extlink} project.
With it, downloading all changes since you ran the program the last time is
just a single command.
When you have changes for many months or years it might make more sense to drop your database completely and re-import from a more current OSM data file instead of updating the database from change files. {:.note}
Processing a daily update in one step is fine, but if you want to update data for longer time periods update it in steps, one for each day or so. If you update too much data in one step, there is a good chance the update will take a long time or even fail. {:.note}
If you have imported an extract into an osm2pgsql database but there are no change files for the area of the extract, you can still use the replication diffs from planet.osm.org to update your database. But, because those extracts contain data for the whole planet, your database will keep accumulating more and more data outside the area of your extract that you don't really need.
Be aware that you usually can not use OSM change files directly with osm2pgsql.
The replication diffs contain all changes, including, occasionally, duplicate
changes to the same object, if it was changed more than once in the period
covered by the change file. You need to "simplify" the change files before
use, for instance with the osmium merge-changes --simplify
command. If you
use osm2pgsql-replication
or pyosmium_get_changes.py
this will be taken
care of automatically.
{:.note}
If you run osm2pgsql with --append
and the update fails for some reason,
for instance when your server crashes in the middle of it, you can just re-run
the update again and everything should come out fine.
Before some OSM data file can be given to osm2pgsql it is sometimes necessary to prepare it in some way. This chapter explains some options.
For most of these steps, the recommended application is the osmium{:.extlink} command line tool, which offers a wide range of functionality of slicing and dicing OSM data.
One handy command is osmium fileinfo
{:.extlink}
which can tell you a lot about an OSM file, including how many objects it
contains, whether it is sorted, etc.
You can create your own extracts from the planet file or from existing
extracts with the osmium extract
{:.extlink}
command. It can create extracts of OSM data from a bounding box or from a
boundary.
There are different "cutting strategies" used by osmium extract
leading to
different results. Read the man page and understand the implications of those
strategies before deciding which one to use.
Always cut an extract slightly larger than the area you are interested in, otherwise you might have problems with the data at the boundary (also see Handling of Incomplete OSM Data).
If you are working with extracts you sometimes have several extracts, lets say for different countries, that you want to load into the same database. One approach in this case is to merge those extracts first and then import the resulting file with osm2pgsql. But osm2pgsql can also read multiple files at once.
You can use the osmium merge
{:.extlink}
command for merging the input files.
Note that in any case for this to work the input files must all have their data from the same point in time. You can use this to import two or more geographical extracts into the same database. If the extracts are from different points in time and contain different versions of the same object, this will fail!
To speed up processing when you have many OSM change files, you can merge
several change files into larger files and then process the larger files with
osm2pgsql. The osmium merge-changes
{:.extlink}
command will do this for you. Make sure to use the option -s, --simplify
.
(That command also works with a single change file, if you just want to
simplify one file.)
Usually you should not merge change files for more than a day or so when doing this, otherwise the amount of changes osm2pgsql has to process in one go becomes too large.
OSM data usually uses positive numbers for the ids of nodes, ways, and relations. Negative numbers are sometimes used for inofficial OSM data, for instance when non-OSM data is mixed with OSM data to make sure the are no id clashes.
Osm2pgsql can only handle positive ids. It uses negative ids internally (for multipolygon geometries in the database).
If you have negative ids in your input file, you can renumber it first. You can
use the osmium renumber
{:.extlink} command
for this.
OSM data files are almost always sorted, first nodes in order of their ids, then ways in order of their ids, then relations in order of their ids. The planet files, change files, and usual extracts all follow this convention.
Osm2pgsql can only read OSM files ordered in this way. This allows some optimizations in the code which speed up the normal processing.
If you have an unsorted input file, you should sort it first. You can use the
osmium sort
{:.extlink}
command for this.
OpenStreetMap offers complete dumps of all OSM data{:.extlink} which include not only the current version of all objects like the normal planet dump, but also all earlier version of OSM objects including deleted ones.
Like most other OSM software, osm2pgsql can not handle this data.
For some use cases there is a workaround: Create extracts from the full history
dump for specific points in time and feed those to osm2pgsql. You can use the
osmium time-filter
{:.extlink}
command to create such extracts.