Data Cube Statistics Tools

Data Cube Statistics is a an application used to calculate large scale temporal statistics on data stored using an Open Data Cube (ODC) installation. It provides a command line application which uses a YAML configuration file to specify the data range and statistics to calculate.

Contents

Main features
Installation
Usage
Configuration format
Available statistics
Custom statistics
Running with PBS job scheduler
Release Notes

Main features

Calculate continental scale statistics over a 30+ year time range
Simple yet powerful Configuration format
Scripts supplied for running in a HPC PBS based environment
Flexible tiling and chunking for efficient execution of 100+Tb jobs
Track and store full provenance record of operations
Round-trip workflow from ODC back into ODC
Supports saving to NetCDF, GeoTIFF or other GDAL supported format
Optional per-pixel metadata tracking
Out of the box support for most common statistics - see Available statistics
Able to create user defined Custom statistics
Able to handle any CRS and resolution combination (through the power of the ODC)

Installation

$ pip install git+https://github.com/opendatacube/datacube-stats/

Usage

Simplest case

At it's simplest, Data Cube Statistics only requires specifying a configuration file:

$ datacube-stats example-configuration.yaml

If a configuration file is not provided and a file named config.yaml is found in the current directory, then it will be used automatically.

More detailed usage information is also available:

$ datacube-stats --help

Parallel processing

Parallel operation is provided by executors in ODC. For example to run across 4 cores:

$ datacube-stats --parallel 4 example-configuration.yaml

Overrides for testing

For tiled jobs, you can specify a single tile as a test run:

$ datacube-stats --tile-index [X] [Y] example-configuration.yaml

Also useful when testing stats configurations, you can override the output directory:

$ datacube-stats --output-location /home/user/example_folder/ example-configuration.yaml

Listing available Statistics

$ datacube-stats --list-statistics

Configuration format

Sources

Specify the product/s of interest, measurements of interest, and any masks to be applied.

A simple example loading a single measurement from a single product:

sources:
  - product: old_wofs
    measurements: [water]
    group_by: solar_day

A (much) more complicated example which combines Landsat 5 and Landsat 7 data, with filtering based on particular flags in a Pixel Quality layer, as well as eliminating data which doesn't meet the minimum required spatial accuracy:

sources:
  - product: ls5_nbar_albers
    measurements: [blue, green, red, nir, swir1, swir2]
    group_by: solar_day
    source_filter:
      product: ls5_level1_scene
      gqa_iterative_mean_xy: [0, 1]
    masks:
      - product: ls5_pq_albers
        measurement: pixelquality
        group_by: solar_day
        fuse_func: datacube.helpers.ga_pq_fuser
        flags:
          contiguous: True
          cloud_acca: no_cloud
          cloud_fmask: no_cloud
          cloud_shadow_acca: no_cloud_shadow
          cloud_shadow_fmask: no_cloud_shadow
          blue_saturated: False
          green_saturated: False
          red_saturated: False
          nir_saturated: False
          swir1_saturated: False
          swir2_saturated: False
  - product: ls7_nbar_albers
    measurements: [blue, green, red, nir, swir1, swir2]
    group_by: solar_day
    source_filter:
      product: ls7_level1_scene
      gqa_iterative_mean_xy: [0, 1]
    masks:
      - product: ls7_pq_albers
        measurement: pixelquality
        group_by: solar_day
        fuse_func: datacube.helpers.ga_pq_fuser
        flags:
          contiguous: True
          cloud_acca: no_cloud
          cloud_fmask: no_cloud
          cloud_shadow_acca: no_cloud_shadow
          cloud_shadow_fmask: no_cloud_shadow
          blue_saturated: False
          green_saturated: False
          red_saturated: False
          nir_saturated: False
          swir1_saturated: False
          swir2_saturated: False

Discrete Values / No-Data Masking

By default, null or no-data values are automatically masked out, according to their definition in the Data Cube Product they are loaded from. In some cases this doesn't make sense, and can be disabled by specifying mask_nodata: False. For example bitfield data like PQ and WOfS Extents that have a more complicated representation of no-data, which will be handled by the statistic being run over them.

sources:
  - product: ls5_pq_albers
    group_by: solar_day
    mask_nodata: False
    fuse_func: datacube.helpers.ga_pq_fuser
    group_by: solar_day

Date ranges

Single duration

Perform statistics over a single time range. The first date is inclusive and the last date is exclusive.

date_ranges:
  start_date: 2000-01-01
  end_date: 2016-01-01

Multiple durations

Or over a sequence of time steps, for example, an output for each year over a 15 year period:

date_ranges:
  start_date: 2000-01-01
  end_date: 2016-01-01
  stats_duration: 1y
  step_size: 1y

Winter seasons in the southern hemisphere over the same 15 year period:

date_ranges:
  start_date: 2000-06-01
  end_date: 2016-09-01
  stats_duration: 3m
  step_size: 1y

Output location

Specify the base output directory where files will be written:

location: /home/user/mystats_outputs/

Output storage format

NetCDF

Able to write fully compliant NetCDF-CF, either projected or unprojected spatially, with optional `Extra metadata attributes`_.

For example, to output 100×100km tiles, with 25m per pixel:

storage:
  driver: NetCDF CF

  crs: EPSG:3577
  tile_size:
      x: 100000.0
      y: 100000.0
  resolution:
      x: 25
      y: -25
  chunking:
      x: 200
      y: 200
      time: 1
  dimension_order: [time, y, x]

GeoTIFF

Write GeoTIFF files for each defined output. Side car dataset metadata documents in YAML format will be written which include the provenance information and allow re-indexing into the Data Cube.

Output 1°×1° tiles, with 4000×4000 pixels per tile:

storage:
  driver: GeoTIFF

  crs: EPSG:4326
  tile_size:
      longitude: 1.0
      latitude: 1.0
  resolution:
      longitude: 0.00025
      latitude: -0.00025
  chunking:
      longitude: 400
      latitude: 400
      time: 1
  dimension_order: [time, latitude, longitude]

Computation/memory usage

Adjust the size of the spatial chunks that are loaded into memory. This setting can be adjusted depending on the time depth being processed, the available memory on the processing machine, and how many simultaneous tasks are being run on the machine.

computation:
  chunking:
    longitude: 1000
    latitude: 1000

Input area of interest (optional)

Shapefile

An easy way to create a Shapefile is to use GeoJSON.io, draw your region of interest, then from the top menu Save -> Shapefile to download the zipped Shapefile.

input_region:
  from_file: /home/user/mdb_floodplan/mdb_floodplain.shp

Whether the output will be gridded (tile-based, default) or not (feature-based) may be specified by setting gridded: True or gridded: False respectively. The features to generate output for may also be specified (in which case the output is feature-based),

input_region:
  from_file: /home/user/mdb_floodplan/mdb_floodplain.shp
  feature_id: [39]

Tile index

The tiling regime is determined by the tile_size parameter of the Output storage format section. A list of tiles can be passed on to tiles parameter.

input_region:
  tiles:
    - [16, -39]
    - [17, -39]

Period of interest

The time period can be specified for individual sensors to include only datasets for this period. This can be sometime useful to exclude datasets for Landsat 7 due to SLC failure.

sources:
  product: ls7_nbar_albers
  name: intertidal_low
  measurements: [blue, green, red, nir, swir1, swir2]
  group_by: solar_day
  time: [1986-01-01, 2003-05-01]

Spatial extents

Specify the maximum and minimum spatial range. You must also specify the CRS to use, normally with an EPSG code, this alters whether you are specifying x/y or latitude/longitude.

input_region:
   crs: EPSG:4326
   longitude: [147.1, 147.9]
   latitude: [-33, -34]

Note

This method of specifying extents will output a single large file, not a set of tiles.

Everywhere

Don't specify an input_region to process all available data.

GeoJSON

Use https://geojson.io/ to draw out a region of interest. Copy the geometry portion of the GeoJSON and paste it into your configuration file under input_region. An example for Australia:

input_region:
    "geometry": {
        "type": "Polygon",
        "coordinates": [ [
            [ 143.26171875, -9.88227549342994 ],
            [ 129.7265625, -9.96885060854611 ],
            [ 125.859375, -12.554563528593656 ],
            [ 119.35546875000001, -18.22935133838667 ],
            [ 111.005859375, -22.350075806124853 ],
            [ 113.818359375, -36.17335693522159 ],
            [ 117.94921874999999, -36.52729481454623 ],
            [ 130.78125, -33.06392419812064 ],
            [ 135.966796875, -37.43997405227057 ],
            [ 147.041015625, -44.59046718130883 ],
            [ 154.248046875, -34.234512362369856 ],
            [ 154.775390625, -24.5271348225978 ],
            [ 143.26171875, -9.88227549342994 ]
          ] ]
      }

Output products/which statistics

This section of the configuration file specifies both which statistics to calculate, and which files to write them out to.

For many statistics workflows, it takes longer to load the data into memory than it does to compute the result. For these cases it makes sense to perform multiple computations on the same set of data, and so output_products is a list of outputs, but at a minimum it only needs one definition.

Statistic/calculation

Specify which statistic to use, and optionally any arguments. For example, a simple mean:

statistic: simple
statistic_args:
  reduction_function: mean

Name

Define the name of the output product. eg:

name: landsat_yearly_mean

Product type

Optional field allows to specify product_type field of the output product. Defaults to !!NOTSET!!. This is needed when output is to be indexed into the datacube.

product_type: seasonal_stats

Output parameters

Any extra arguments to pass to the output driver for an individual output band:

output_params:
  zlib: True
  fletcher32: True

Datacube metadata

Specify arbitrary metadata to attach. The format section is required by datacube.

metadata:
    format:
        name: NetCDF
    platform:
        code: LANDSAT-8

Filter product

NOTE: This feature is being deprecated. We expect to remove it in the next release after release 0.9b1 and replace it with something more general.

To filter out sources that correspond to any derived products. It currently supports two methods to filter out list of dates. Filtering in hydrological months ('by_hydrological_months'), can be used to filter months from July to November for the year after the dry or wet years collected from the polygon. Specific month range can also be specified. Second method of filtering is 'by_tide_height', which uses OTPS model to get tide_height:

filter_product:
  method: by_tide_height
  args:
    tide_range: 10
    tide_percent: 20

filter_product:
  method: by_hydrological_months
  args:
    type: dry
    months: ['07', '11']

File naming

Specify a template string used to name the output files. Uses the python format() string syntax, with the following placeholders available:

Placeholder	Description
x	X Tile Index
y	Y Tile Index
epoch_start	Start date of the epoch, format using strftime syntax
epoch_end	End date of the epoch, format using strftime syntax
name	The product name given to this output product
stat_name	The name of the statistic used to compute this product

For example:

file_path_template: '{y}_{x}/LS_PQ_COUNT_3577_{y}_{x}_{epoch_start:%Y-%m-%d}_{epoch_end:%Y-%m-%d}.nc'

Will output filenames similar to:

10_15/LS_PQ_COUNT_3577_10_15_2010-01-01_2011-01-01.nc

Complete example

output_products:
 - name: landsat_seasonal_mean
   product_type: seasonal_stats
   metadata:
       format:
           name: NetCDF
   statistic: simple
   statistic_args:
     reduction_function: mean
   output_params:
     zlib: True
     fletcher32: True
   file_path_template: 'SR_N_MEAN/SR_N_MEAN_3577_{x:02d}_{y:02d}_{epoch_start:%Y%m%d}.nc'

 - name: landsat_seasonal_medoid
   product_type: seasonal_stats
   statistic: medoid
   output_params:
     zlib: True
     fletcher32: True
   file_path_template: 'SR_N_MEDOID/SR_N_MEDOID_3577_{x:02d}_{y:02d}_{epoch_start:%Y%m%d}.nc'

 - name: landsat_seasonal_percentile_10
   product_type: seasonal_stats
   statistic: percentile
   statistic_args:
     q: 10
   output_params:
     zlib: True
     fletcher32: True
   file_path_template: 'SR_N_PCT_10/SR_N_PCT_10_3577_{x:02d}_{y:02d}_{epoch_start:%Y%m%d}.nc'

Global attributes

Additional metadata can be specified which will be written as global attributes into the output NetCDF file. For example:

global_attributes:
  institution: Commonwealth of Australia (Geoscience Australia)
  instrument: OLI
  keywords: AU/GA,NASA/GSFC/SED/ESD/LANDSAT,ETM+,TM,OLI,EARTH SCIENCE
  keywords_vocabulary: GCMD
  platform: LANDSAT-8
  publisher_email: [email protected]
  publisher_name: Section Leader, Operations Section, NEMO, Geoscience Australia
  publisher_url: https://www.ga.gov.au
  license: CC BY Attribution 4.0 International License
  coverage_content_type: physicalMeasurement
  cdm_data_type: Grid
  product_suite: Pixel Quality 25m

Available statistics

Any reduction operation supported by xarray. eg:
- mean
- median
- percentile
High-dimensional medians implemented by the hdmedians python package
- Medoid
- Geometric median
Normalised difference statistics. eg. NDVI + statistic
Custom statistics

Custom statistics

Statistics operations in Data Cube Statistics are implemented as Python Classes, which extends datacube_stats.statistics.Statistic. Two methods should be implemented, measurements() and compute().

measurements(): Takes a list of measurements provided by the input product type, and returns a list of measurements that this class will produce when asked to compute a statistic over some data.
compute(): Takes a xarray.Dataset containing some data that has been loaded, and returns another xarray.Dataset after doing some computation. The variables on the returned dataset must match the types specified by measurements().

For example, the following implementation requires its input data to contain a variable named water, and outputs data with a single variable named count_wet of type int16. In the configuration file, we will need to pass a list of values for water that indicate "wetness" as an argument named wet_values to the statistic.

import xarray
from datacube_stats.statistics import Statistic
from datacube.model import Measurement

class CountWet(Statistic):
    def __init__(self, wet_values):
        # list of values of 'water' that we count as "wet"
        assert len(wet_values) > 0, 'no wet values provided'

        self.wet_values = wet_values

    def compute(self, data):
        wet = xarray.zeros_like(data.water)

        for val in self.wet_values:
            wet += data.water == val

        return xarray.Dataset({'count_wet': wet.sum(dim='time')},
                              attrs={'crs': data.crs})

    def measurements(self, input_measurements):
        assert 'water' in [m['name'] for m in input_measurements]

        wet = Measurement(name='count_wet', dtype='int16', nodata= -1, units='1')

        return [wet]

Suppose the package that contains this implementation is called pseudo.example, and it is available in the Python path (with the current directory added). Then the configuration file could look like (eliding location, computation, and storage specifications)

sources:
  - product: wofs_albers
    name: wofs_dry
    measurements: [water]
    group_by: solar_day

date_ranges:
  start_date: 2014-01-01
  end_date: 2014-02-01
  stats_duration: 1m
  step_size: 1m

output_products:
 - name: wet_count_summary
   product_type: wofs_statistical_summary
   metadata:
     format:
       name: NetCDF
   statistic: external
   statistic_args:
     impl: pseudo.example.CountWet

     # ignoring PQ sea mask that excludes a lot of useful data
     wet_values:
        - 128 # clear and wet
        - 132 # clear and wet and masked for sea
   output_params:
     zlib: True
     fletcher32: True
   file_path_template: 'WOFS_COUNT/{x}_{y}/WOFS_COUNT_3577_{x}_{y}_{epoch_start:%Y%m%d}_{epoch_end:%Y%m%d}.nc'

Running with PBS job scheduler

To submit a job to PBS, run datacube-stats like

$ datacube-stats --qsub="project=u46,nodes=100,walltime=5h,mem=large,queue=normal" example.yaml

The mem specification can be small, medium, or large, for 2GB, 4GB, or 8GB memory per core respectively. For more details, run

$ datacube-stats --qsub=help

Release Notes

1.0.2 release

Note: This is the last release before our planned move to make use of virtual products now available in ODC core. This version should be compatible datacube 1.805b or 06b.

Fix incompatibilities with ODC 1.8

1.0.1 release

Fix NetCDF CF driver issue with ODC 1.8

1.0 release

Minor update to the percentile algorithm
Some PyLint fixes

0.9b4 release

Note: This is another (minor) maintenance release targeting datacube 1.6.1.

Refactor that should not affect output
Add FC percentile configuration .yaml files

0.9b3 release

Note: This is another maintenance release targeting datacube 1.6.1.

Percentile now supports specifying minimum_valid_observations
One GeoTIFF file per band option is re-enabled (specify var_name in the file_path_template)
Polygon mask is now compatible with mask_nodata
Fix some tests

0.9b2 release

Note: This is a maintenance release compatible with datacube 1.6.1.

Add mangrove statistics
Minor changes to the Python API to support multi-processing
The GeoTIFF output driver name has changed to GeoTIFF from GeoTiff

0.9b1 release

Note: We expect several backwards incompatible changes to the datacube-stats package in the near future. Release 0.9b1 is intended to be the last release fully supporting configurations from earlier releases.

Add tasseled cap indices statistic
Fix GeoTiff output driver
Preliminary Python API and an xarray output driver to produce in-momery results
Support default configurations (config.yaml in the current working folder)
Support discoverable external plugins (in the current working folder)

0.9a9 release

Fix xarray sorting bug
Add ability to specify num_threads to the new_geomedian statistic
Add ability to attach custom metadata to generated datasets

0.9a8 release

Add ability for feature-based task generation from a shapefile
Fix issue with hdmedian GeoMedian statistics

0.9a7 release

Move task execution code to the digitalearthau repository
ITEM and low/high tide composite, FC percentile
Schema-validated configuration

0.9a6 release

Time filters on individual source products

0.9 release

User documentation!
List available statistics from the command line datacube-stats --list-statistics

Name		Name	Last commit message	Last commit date
Latest commit History 706 Commits
.circleci		.circleci
ansible		ansible
configurations		configurations
datacube_stats		datacube_stats
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.rst		README.rst
check-code.sh		check-code.sh
pylintrc		pylintrc
setup.cfg		setup.cfg
setup.py		setup.py

License

opendatacube/datacube-stats

Folders and files

Latest commit

History

Repository files navigation

Data Cube Statistics Tools

About

Topics

Resources

License

Stars

Watchers

Forks

Languages