Skip to content

sferics/obs-processing

Repository files navigation

What is OBS processing?

This repository contains everything needed to process and store synoptic observations from a variety of sources. Out-of-the box it supports DWD, KNMI and IMGW Open Data services and can parse BUFR files from many other providers. It is easily extendable via configuration (YAML) files and by adding your own scripts which make use of the existing framework.

How to install OBS processing

  • While in the directory of the repository, run install.sh like this:
    chmod +x install.sh && ./install.sh
  • OR if the permissions cannot be set/changed:
    bash install.sh
  • The install.sh script will install miniconda if not present, create an environment with all necessary packages and install the plbufr package from sferics' github.
  • It then defines ".githook/" as the directory for git hooks. There are currently two git hooks: The pre-commit git hook automatically compiles alls .py files before each commit, so at least some syntax errors can be easily avoided. The post-commit, on the other hand, calls scripts/export_conda_environment.sh which exports the conda environment information to "environment.yml" and creates a "requirement.txt" file.
  • Afterwards, it will compile all .py files in the directory in order to speed-up the first run of each script.
  • Lastly, it executes 3 .sql files (in "sql/") which add some essential tables, columns and values to the main database. These changes should be implemented in amalthea/main for a better integration!

How to use OBS processing

Python scripts

All python scripts offer a -h/--help option which shows their command line arguments with a brief explanation. However, in order to understand them better, you should read the following in-depth information carefully.
To be able to run these scripts, the configuration files general.yml, scripts.yml, sources.yml and clusters.yml are needed. So right before the first usage, you need to make sure to create them by copying the template files from "config/templates/" to config/ and adding your desired configurations of sources and clusters to the respective files. The general.yml and scripts.yml files also need to be adjusted with your desired file paths, system-specific settings etc.

Note on command line arguments

All command line arguments are defined in config/parser_args.yml and they are the same across all scripts. The only difference lies in their availability.
For more details on adding/changing/removing command line arguments, please read the respective section about the YAML configuration files -> parser_args.yml.
IMPORTANT: Settings defined by command line arguments will always overwrite settings defined in the script's configuration!

Common command line arguments

-h/--help
  • show help message which explains the usage of the script briefly
-v/--verbose
  • print (more) verbose output
-d/--debug
  • run in debug mode with additional debug prints and stop points (using pdb module)
-t/--traceback
  • use traceback module to print error messages that occur on module level
-w/--no_warnings
  • supress all warning messages
-i/--pid_file
  • use a PID file to determine whether the script is already running and which processes number it has
-l/--log_level $LOG_LEVEL
  • define logging level (choose one of the following: {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET} )
-C/--config_dir $FILE_NAME
  • define a custom config directory (structure has to be the same as within config/ directory)
-k/--known_stations $LIST_OF_STATIONS
  • comma-seperated list of stations to consider
-c/--clusters $LIST_OF_CLUSTERS
  • comma-seperated list of clusters to consider
-m/--max_retries $RETRIES
  • maximal number of retries when writing to station databases
-n/--max_files $NUMBER_OF_FILES
  • maximal number of files to process (usually this setting applies per source)
-m/--mode $MODE
  • operation mode (can be "dev", "oper" or "test")
-s/--stage $STAGE
  • stage of forging (can be "raw","forge","bad" or "final")
-o/--timeout $TIMEOUT
  • timeout in seconds when trying to write to station databases
-O/--output $OUTPUT_PATH
  • define custom output path where the station databases will be saved
-P/--processes $NUMBER_OF_PROCESSES
  • use multiprocessing if -P > 1; defines number of processes to use
-T/--translation $TRANSLATION_FILE
  • define name of custom (BUFR) translation file (can be necessary for providers which use special encoding or error treatment)

decode_bufr.py

This script decodes one or several BUFR files and inserts all relevant observations into the raw databases.
It can also process intire source/dataset directories which can be provided by the source name(s) as argument(s) or via the "sources.yml" configuration file.

Unique command line arguments

source
  • first and only positional argument
  • can take several sources, seperated by spaces
-a/--approach $APPROACH

You may use 5 different approaches to decode the BUFR files:

  • pd: Using pdbufr package officially provided by ECMWF (very slow because it uses pandas).
  • pl: Using plbufr package forked from pdbufr by sferics (much faster because it uses polars instead).
  • gt: Also using plbufr package, but instead of creating a polars DataFrame, it uses a generator (should be equally fast).
  • us: Fastest decoding method using bufr keys from ECCODES, but lacking some observations like soil temperatures.
  • ex: Slower than "us" method, but significantly faster than pdbufr/plbufr methods. Not guaranteed to work with all files and lacking some information from DWD Open Data files!
-f/--file $FILE_PATH
  • process a single file, given by its file path
-F/--FILES $LIST_OF_FILES
  • process several files, given by their file paths, seperated by divider character (default: ";")
-D/--divider $DIVIDER
  • define a custom divider/seperator character for -F
-r/--redo
  • process file(s) again, even if they have been processed already
-R/--restart
  • usually only used automatically by the script if the RAM is full, so it knows which files are still left to process
-s/--sort_files
  • sort files with sorting algorithm (sorted() by default)
-H/--how
  • define sorting algorithm for the above option (has to be a python callable and will be evaluated by eval() method)

Example usages

single file, redo even if already processed:

decode_bufr.py -a pl -f example_file.bufr -r

multiple files, use "," as divider character, show verbose output:

decode_bufr.py -a ex -F example_file1.bin,example_file2.bin,example_file3.bin -D "," -v

single source, consider only specific stations:

decode_bufr.py DWD -a gt -k 10381,10382,10384,10385

multiple sources, process a maximum of 100 files per source:

decode_bufr.py DWD KNMI RMI -a gt -n 100

custom config file, process all sources which are defined there and use custom output directory:

decode_bufr.py -C obs_custom.yml -O /custom/output/directory

forge_obs.py

This is a chain script which runs the following scripts in the order of occurrence. Only in operational mode, derived_obs.py runs again after aggregate_obs.py and export_obs.py will only be executed if -e/--export is set.

Unique command line arguments

-b/--bare
  • only print out commands and do not actually run the scripts
  • this is meant for debugging purposes only
-e/--export
  • export new observations into old/legacy metwatch csv format after finishing the chain (see export_obs.py for more information)
-L/--legacy_output $LEGACY_OUTPUT
  • define old/legacy metwatch csv output directory for export_obs.py

Example usage

Define custum output path and set log level to "INFO"

python forge_obs.py -e -L /legacy/output/path -l INFO

reduce_obs.py

(only 1 row with max(file) per dataset [UNIQUE datetime,duration,element]) Copy all remaining elements from raw to forge databases [dataset,datetime,duration,element,value]

Example usage

Use 12 processes:

python reduce_obs.py -P 12

derive_obs.py

Compute derived elements like relative humidity, cloud levels or reduced pressure from (a combination of) other elements.

Unique command line arguments

-A/--aggregated

Compute derived elements again, but only considering 30min-values.

Example usage

Only derive observations from a single station:

python derive_obs.py -k 10381

aggregate_obs.py

Aggregate over certain time periods / durations (like 30min,1h,3h,6h,12,24h) and create new elements with "{duration}" suffix (like "TMAX12h_2m_syn"). The information about what elements to aggregate over which durations and which elements need gap filling is contained in config/element_aggregation.yml.

Example usage

Enable traceback prints

python aggregate_obs.py -t

audit_obs.py

Check all obs in forge databases, delete bad data like NaN, unknown value or out-of-range

  • move good data in final databases e.g. "/oper/final" (oper mode)
  • move bad data to seperate databases, e.g. "/dev/bad" (dev mode)

Example usage

Run in debugging mode with debug prints and stop points

python audit_obs.py -d

empty_obs.py

Clear forge station databases (they are temporary and get rebuilt every chain cycle).

Unique command line arguments

-B/--bad_obs
  • clear bad obs as well

Example usage

Use the above option and show no warnings

python empty_obs.py -B -w

export_obs.py

Export observations from final databases into the old/legacy metwatch csv format.

Unique command line arguments

-L/--legacy_output $LEGACY_OUTPUT
  • define old/legacy metwatch csv output directory

Example usage

Define a custom directory for the legacy output

python export_obs.py -L /legacy/output/directory

get_imgw.py

Get latest observations from the Polish Open Data service

Example usage

Verbose output and consider only stations in cluster "poland"

python get_imgw.py -v -c poland

Description of YAML configuration files in "config/" directory

codes/

bufr/

flags_{approach}.yml

- conversion of BUFR code/flag tables into values we use

sequences.yml

- definition of wmo BUFR sequences - only needed for "ex" approach of decode_bufr.py

synop.yml

- conversion of SYNOP codes into values we use

metar.yml

- conversion of METAR codes into values we use

element_aggregation.yml

- information about which element to aggregate OR fill in gaps
- consists of two sections:

duration:
- which element to aggregate over which durations
- fallback elements can be defined (like TMP instead of TMAX)
instant:
- which elements always have the same duration
- for these elements we try to fill in the gaps (using nearby values)

element_info.yml

- information about the value range of elements (lower/upper boundaries)
- also: which values to include or exclude out of that range (extra/exclude)
- extra column is a list of values and these will always be excepted, even if they are out-of-range
- exclude is defined as a regular expression (x means no exluded values)
- used for audit_obs.py script only

templates/

general.yml

- needs to be copied to "config/" in order to be recognized by the python scripts
- main configuration file template with the following sections:

general:
- most general settings which will be overwritten by all following configs
- order of priorities: general -> class -> script -> command line arguments
database:
- default configuration for the main database (usually when DatabaseClass is called for main.db)
bufr:
- default configuration for the BufrClass, higher priority than "general:" but lower than script config
obs:
- default configuration for the ObsClass, higher priority than "general:" but lower than script config

scripts.yml

- needs to be copied to "config/" in order to be recognized by the python scripts
- just change the settings of all scripts to your desire in here
- sections/keys are always the FULL script name (with .py)
- special script configuration entries in detail:

decode_bufr.py:
- TODO
forge_obs.py:
- TODO
reduce_obs.py:
- TODO
derive_obs.py:
- TODO
aggregate_obs.py:
- TODO
audit_obs.py:
- TODO
empty_obs.py:
- TODO
get_obs.py:
- TODO
get_imgw.py:
- TODO
get_knmi.py:
- TODO

sources.yml

- needs to be copied to "config/" in order to be recognized by the python scripts
- define all source-specific settings in here

clusters.yml

- needs to be copied to "config/" in order to be recognized by the python scripts
- define blockNumber, stationIdentifier and station types (str) for different clusters

translations/

bufr/

{approach}.yml

- BUFR key translations for the different approaches

metwatch.yml

- translation for the legacy metwatch element names

imgw.yml

- translation for element names of Polish weather service Open Data

{other_source}.yml

- use this naming scheme if you want to add your own custom source translation files

parser_args.yml

- definition of positional and flag (e.g. -v/--verbose) command line arguments

station_tables/

{mode}_{stage}.yml

- definition of the table structure for the location/station databases
- the syntax is very SQL-like but simpler than a real .sql file
- different mode and stage combination need to be all present if you add custom modes/stages


Bash scripts in "scripts/" directory

export_bufr_tables.sh

Export your custom BUFR table paths to the local and conda environment variables.

export_conda_environment.sh

Export conda environment information to "environment.yml". Only skip "path:" and "variables:" sections because they depend on the local system. Then create a "requirement.txt" which contains all needed packages to successfully run the python scripts.

install.sh

Install the repository using conda and prepare everything to get started immediately. The script creates the "obs" environment, installs all needed packages and sets the right environment variables.

multi_decode_bufr.sh

This scripts starts the decode_bufr.py script multiple times, so you can process a large number of files much faster.
NOTE: You have to calculate manually how many files to process for each instance of the script and define "max_files:" accordingly in the script config's "decode_by.py:" section.

Command line arguments

$1 $APPROACH
  • set BUFR decoding approach (default: gt)
$2 $PROCESSES
$3 $SLEEP_TIME
  • sleep time in between script execution (wait N seconds before starting the next instance)

Example usage

Start 8 instances of decode_bufr.py using "ex" approach and 2 seconds sleep time in between

./multi_decode_bufr.sh 8 ex 2

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages