What is OBS processing?

This repository contains everything needed to process and store synoptic observations from a variety of sources. Out-of-the box it supports DWD, KNMI and IMGW Open Data services and can parse BUFR files from many other providers. It is easily extendable via configuration (YAML) files and by adding your own scripts which make use of the existing framework.

How to install OBS processing

While in the directory of the repository, run install.sh like this:
chmod +x install.sh && ./install.sh
OR if the permissions cannot be set/changed:
bash install.sh
The install.sh script will install miniconda if not present, create an environment with all necessary packages and install the plbufr package from sferics' github.
It then defines ".githook/" as the directory for git hooks. There are currently two git hooks: The pre-commit git hook automatically compiles alls .py files before each commit, so at least some syntax errors can be easily avoided. The post-commit, on the other hand, calls scripts/export_conda_environment.sh which exports the conda environment information to "environment.yml" and creates a "requirement.txt" file.
Afterwards, it will compile all .py files in the directory in order to speed-up the first run of each script.
Lastly, it executes 3 .sql files (in "sql/") which add some essential tables, columns and values to the main database. These changes should be implemented in amalthea/main for a better integration!

How to use OBS processing

Python scripts

All python scripts offer a -h/--help option which shows their command line arguments with a brief explanation. However, in order to understand them better, you should read the following in-depth information carefully.
To be able to run these scripts, the configuration files general.yml, scripts.yml, sources.yml and clusters.yml are needed. So right before the first usage, you need to make sure to create them by copying the template files from "config/templates/" to config/ and adding your desired configurations of sources and clusters to the respective files. The general.yml and scripts.yml files also need to be adjusted with your desired file paths, system-specific settings etc.

Note on command line arguments

All command line arguments are defined in config/parser_args.yml and they are the same across all scripts. The only difference lies in their availability.
For more details on adding/changing/removing command line arguments, please read the respective section about the YAML configuration files -> parser_args.yml.
IMPORTANT: Settings defined by command line arguments will always overwrite settings defined in the script's configuration!

Common command line arguments

-h/--help

show help message which explains the usage of the script briefly

-v/--verbose

print (more) verbose output

-d/--debug

run in debug mode with additional debug prints and stop points (using pdb module)

-t/--traceback

use traceback module to print error messages that occur on module level

-w/--no_warnings

supress all warning messages

-i/--pid_file

use a PID file to determine whether the script is already running and which processes number it has

-l/--log_level $LOG_LEVEL

define logging level (choose one of the following: {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET} )

-C/--config_dir $FILE_NAME

define a custom config directory (structure has to be the same as within config/ directory)

-k/--known_stations $LIST_OF_STATIONS

comma-seperated list of stations to consider

-c/--clusters $LIST_OF_CLUSTERS

comma-seperated list of clusters to consider

-m/--max_retries $RETRIES

maximal number of retries when writing to station databases

-n/--max_files $NUMBER_OF_FILES

maximal number of files to process (usually this setting applies per source)

-m/--mode $MODE

operation mode (can be "dev", "oper" or "test")

-s/--stage $STAGE

stage of forging (can be "raw","forge","bad" or "final")

-o/--timeout $TIMEOUT

timeout in seconds when trying to write to station databases

-O/--output $OUTPUT_PATH

define custom output path where the station databases will be saved

-P/--processes $NUMBER_OF_PROCESSES

use multiprocessing if -P > 1; defines number of processes to use

-T/--translation $TRANSLATION_FILE

define name of custom (BUFR) translation file (can be necessary for providers which use special encoding or error treatment)

decode_bufr.py

This script decodes one or several BUFR files and inserts all relevant observations into the raw databases.
It can also process intire source/dataset directories which can be provided by the source name(s) as argument(s) or via the "sources.yml" configuration file.

Unique command line arguments

source

first and only positional argument
can take several sources, seperated by spaces

-a/--approach $APPROACH

You may use 5 different approaches to decode the BUFR files:

pd: Using pdbufr package officially provided by ECMWF (very slow because it uses pandas).
pl: Using plbufr package forked from pdbufr by sferics (much faster because it uses polars instead).
gt: Also using plbufr package, but instead of creating a polars DataFrame, it uses a generator (should be equally fast).
us: Fastest decoding method using bufr keys from ECCODES, but lacking some observations like soil temperatures.
ex: Slower than "us" method, but significantly faster than pdbufr/plbufr methods. Not guaranteed to work with all files and lacking some information from DWD Open Data files!

-f/--file $FILE_PATH

process a single file, given by its file path

-F/--FILES $LIST_OF_FILES

process several files, given by their file paths, seperated by divider character (default: ";")

-D/--divider $DIVIDER

define a custom divider/seperator character for -F

-r/--redo

process file(s) again, even if they have been processed already

-R/--restart

usually only used automatically by the script if the RAM is full, so it knows which files are still left to process

-s/--sort_files

sort files with sorting algorithm (sorted() by default)

-H/--how

define sorting algorithm for the above option (has to be a python callable and will be evaluated by eval() method)

Example usages

single file, redo even if already processed:

decode_bufr.py -a pl -f example_file.bufr -r

multiple files, use "," as divider character, show verbose output:

decode_bufr.py -a ex -F example_file1.bin,example_file2.bin,example_file3.bin -D "," -v

single source, consider only specific stations:

decode_bufr.py DWD -a gt -k 10381,10382,10384,10385

multiple sources, process a maximum of 100 files per source:

decode_bufr.py DWD KNMI RMI -a gt -n 100

custom config file, process all sources which are defined there and use custom output directory:

decode_bufr.py -C obs_custom.yml -O /custom/output/directory

forge_obs.py

This is a chain script which runs the following scripts in the order of occurrence. Only in operational mode, derived_obs.py runs again after aggregate_obs.py and export_obs.py will only be executed if -e/--export is set.

Unique command line arguments

-b/--bare

only print out commands and do not actually run the scripts
this is meant for debugging purposes only

-e/--export

export new observations into old/legacy metwatch csv format after finishing the chain (see export_obs.py for more information)

-L/--legacy_output $LEGACY_OUTPUT

define old/legacy metwatch csv output directory for export_obs.py

Example usage

Define custum output path and set log level to "INFO"

python forge_obs.py -e -L /legacy/output/path -l INFO

reduce_obs.py

(only 1 row with max(file) per dataset [UNIQUE datetime,duration,element]) Copy all remaining elements from raw to forge databases [dataset,datetime,duration,element,value]

Example usage

Use 12 processes:

python reduce_obs.py -P 12

derive_obs.py

Compute derived elements like relative humidity, cloud levels or reduced pressure from (a combination of) other elements.

Unique command line arguments

-A/--aggregated

Compute derived elements again, but only considering 30min-values.

Example usage

Only derive observations from a single station:

python derive_obs.py -k 10381

aggregate_obs.py

Aggregate over certain time periods / durations (like 30min,1h,3h,6h,12,24h) and create new elements with "{duration}" suffix (like "TMAX12h_2m_syn"). The information about what elements to aggregate over which durations and which elements need gap filling is contained in config/element_aggregation.yml.

Example usage

Enable traceback prints

python aggregate_obs.py -t

audit_obs.py

Check all obs in forge databases, delete bad data like NaN, unknown value or out-of-range

move good data in final databases e.g. "/oper/final" (oper mode)

move bad data to seperate databases, e.g. "/dev/bad" (dev mode)

Example usage

Run in debugging mode with debug prints and stop points

python audit_obs.py -d

empty_obs.py

Clear forge station databases (they are temporary and get rebuilt every chain cycle).

Unique command line arguments

-B/--bad_obs

clear bad obs as well

Example usage

Use the above option and show no warnings

python empty_obs.py -B -w

export_obs.py

Export observations from final databases into the old/legacy metwatch csv format.

Unique command line arguments

-L/--legacy_output $LEGACY_OUTPUT

define old/legacy metwatch csv output directory

Example usage

Define a custom directory for the legacy output

python export_obs.py -L /legacy/output/directory

get_imgw.py

Get latest observations from the Polish Open Data service

Example usage

Verbose output and consider only stations in cluster "poland"

python get_imgw.py -v -c poland

Description of YAML configuration files in "config/" directory

codes/

bufr/

flags_{approach}.yml

- conversion of BUFR code/flag tables into values we use

sequences.yml

- definition of wmo BUFR sequences - only needed for "ex" approach of decode_bufr.py

synop.yml

- conversion of SYNOP codes into values we use

metar.yml

- conversion of METAR codes into values we use

element_aggregation.yml

- information about which element to aggregate OR fill in gaps
- consists of two sections:

duration:
- which element to aggregate over which durations
- fallback elements can be defined (like TMP instead of TMAX)
instant:
- which elements always have the same duration
- for these elements we try to fill in the gaps (using nearby values)

element_info.yml

- information about the value range of elements (lower/upper boundaries)
- also: which values to include or exclude out of that range (extra/exclude)
- extra column is a list of values and these will always be excepted, even if they are out-of-range
- exclude is defined as a regular expression (x means no exluded values)
- used for audit_obs.py script only

templates/

general.yml

- needs to be copied to "config/" in order to be recognized by the python scripts
- main configuration file template with the following sections:

general:
- most general settings which will be overwritten by all following configs
- order of priorities: general -> class -> script -> command line arguments
database:
- default configuration for the main database (usually when DatabaseClass is called for main.db)
bufr:
- default configuration for the BufrClass, higher priority than "general:" but lower than script config
obs:
- default configuration for the ObsClass, higher priority than "general:" but lower than script config

scripts.yml

- needs to be copied to "config/" in order to be recognized by the python scripts
- just change the settings of all scripts to your desire in here
- sections/keys are always the FULL script name (with .py)
- special script configuration entries in detail:

decode_bufr.py:
- TODO
forge_obs.py:
- TODO
reduce_obs.py:
- TODO
derive_obs.py:
- TODO
aggregate_obs.py:
- TODO
audit_obs.py:
- TODO
empty_obs.py:
- TODO
get_obs.py:
- TODO
get_imgw.py:
- TODO
get_knmi.py:
- TODO

sources.yml

- needs to be copied to "config/" in order to be recognized by the python scripts
- define all source-specific settings in here

clusters.yml

- needs to be copied to "config/" in order to be recognized by the python scripts
- define blockNumber, stationIdentifier and station types (str) for different clusters

translations/

bufr/

{approach}.yml

- BUFR key translations for the different approaches

metwatch.yml

- translation for the legacy metwatch element names

imgw.yml

- translation for element names of Polish weather service Open Data

{other_source}.yml

- use this naming scheme if you want to add your own custom source translation files

parser_args.yml

- definition of positional and flag (e.g. -v/--verbose) command line arguments

station_tables/

{mode}_{stage}.yml

- definition of the table structure for the location/station databases
- the syntax is very SQL-like but simpler than a real .sql file
- different mode and stage combination need to be all present if you add custom modes/stages

Bash scripts in "scripts/" directory

export_bufr_tables.sh

Export your custom BUFR table paths to the local and conda environment variables.

export_conda_environment.sh

Export conda environment information to "environment.yml". Only skip "path:" and "variables:" sections because they depend on the local system. Then create a "requirement.txt" which contains all needed packages to successfully run the python scripts.

install.sh

Install the repository using conda and prepare everything to get started immediately. The script creates the "obs" environment, installs all needed packages and sets the right environment variables.

multi_decode_bufr.sh

This scripts starts the decode_bufr.py script multiple times, so you can process a large number of files much faster.
NOTE: You have to calculate manually how many files to process for each instance of the script and define "max_files:" accordingly in the script config's "decode_by.py:" section.

Command line arguments

$1 $APPROACH

set BUFR decoding approach (default: gt)

$2 $PROCESSES

number of processes to use (start decode_bufr.py N times)

$3 $SLEEP_TIME

sleep time in between script execution (wait N seconds before starting the next instance)

Example usage

Start 8 instances of decode_bufr.py using "ex" approach and 2 seconds sleep time in between

./multi_decode_bufr.sh 8 ex 2

Name		Name	Last commit message	Last commit date
Latest commit History 295 Commits
.githooks		.githooks
config		config
modules		modules
scripts		scripts
sql		sql
.gitignore		.gitignore
README.md		README.md
aggregate_obs.py		aggregate_obs.py
audit_obs.py		audit_obs.py
decode_bufr.py		decode_bufr.py
derive_obs.py		derive_obs.py
empty_obs.py		empty_obs.py
environment.yml		environment.yml
export_obs.py		export_obs.py
forge_obs.py		forge_obs.py
get_imgw.py		get_imgw.py
get_knmi.py		get_knmi.py
get_obs.py		get_obs.py
import_metwatch.py		import_metwatch.py
install.sh		install.sh
main.db		main.db
reduce_obs.py		reduce_obs.py
requirements.txt		requirements.txt

sferics/obs-processing

Folders and files

Latest commit

History

Repository files navigation

What is OBS processing?

How to install OBS processing

How to use OBS processing

Python scripts

Note on command line arguments

Common command line arguments

-h/--help

-v/--verbose

-d/--debug

-t/--traceback

-w/--no_warnings

-i/--pid_file

-l/--log_level $LOG_LEVEL

-C/--config_dir $FILE_NAME

-k/--known_stations $LIST_OF_STATIONS

-c/--clusters $LIST_OF_CLUSTERS

-m/--max_retries $RETRIES

-n/--max_files $NUMBER_OF_FILES

-m/--mode $MODE

-s/--stage $STAGE

-o/--timeout $TIMEOUT

-O/--output $OUTPUT_PATH

-P/--processes $NUMBER_OF_PROCESSES

-T/--translation $TRANSLATION_FILE

decode_bufr.py

Unique command line arguments

source

-a/--approach $APPROACH

-f/--file $FILE_PATH

-F/--FILES $LIST_OF_FILES

-D/--divider $DIVIDER

-r/--redo

-R/--restart

-s/--sort_files

-H/--how

Example usages

single file, redo even if already processed:

multiple files, use "," as divider character, show verbose output:

single source, consider only specific stations:

multiple sources, process a maximum of 100 files per source:

custom config file, process all sources which are defined there and use custom output directory:

forge_obs.py

Unique command line arguments

-b/--bare

-e/--export

-L/--legacy_output $LEGACY_OUTPUT

Example usage

Define custum output path and set log level to "INFO"

reduce_obs.py

Example usage

Use 12 processes:

derive_obs.py

Unique command line arguments

-A/--aggregated

Example usage

Only derive observations from a single station:

aggregate_obs.py

Example usage

Enable traceback prints

audit_obs.py

Example usage

Run in debugging mode with debug prints and stop points

empty_obs.py

Unique command line arguments

-B/--bad_obs

Example usage

Use the above option and show no warnings

export_obs.py

Unique command line arguments

-L/--legacy_output $LEGACY_OUTPUT

Example usage

Define a custom directory for the legacy output

get_imgw.py

Example usage

Verbose output and consider only stations in cluster "poland"

Packages