oeps-backend

This project builds from the Opioid Environment Policy Scan (OEPS) data warehouse stored in github.com/GeoDaCenter/opioid-policy-scan, and published on Zenodo at doi.org/10.5281/zenodo.5842465. This repo allows us to pull the final data from the latest OEPS release and push it into Google BigQuery, which will enable new ways of accessing and analyzing that data.

The backend includes commands for managing data transformation to and from to different destinations. These commands are organized thematically into the following groups:

JCOIN
Google BigQuery
US Census Data
Overture POIs

Getting Started

Install the Python Package

Create and activate a Python virtual environment with venv, Conda, or your other tool of choice.

Clone this repo

 git clone https://github.com/healthyregions/oeps
 cd oeps/backend

Install this package and its dependencies
```
 pip install -e .
```
You can now run any scripts with
```
 python ./scripts/script_name.py
```

JCOIN

section in progress

Google BigQuery

Setup BigQuery Credentials

To use the BigQuery script (scripts/bq.py) you will need to add access credentials as environment variables using the .env file. This file must always stay out of version control.

Make a copy of .env.example and name it .env. Any variables defined in this file will now be available via os.getenv('VARIABLE_NAME')
Obtain a set of JSON credentials for the project, and store the file anywhere on your computer (but outside of this repository).
In your .env file, update the BQ_CREDENTIALS_FILE_PATH variable with the full path to this file. For example:
```
BQ_CREDENTIALS_FILE_PATH="/home/my_username/bq_credentials/oeps-391119-5783b2b59b83.json"
```
It is also possible to set BigQuery credentials without storing a local JSON file. More info on this is in the .env.example file.

Once you have setup these credentials, you can test them with

python ./scripts/bq.py check-credentials

If all is well, the command will print ok.

BigQuery Import/Export

The bq.py script can perform import and export operations on our BigQuery database.

Importing Data

Use the following command to load a new table into BigQuery:

python ./scripts/bq.py load --source ./resources

Where --source is the path to a Data Resource schema (stored as JSON file), or a directory containing multiple Data Resource schemas. Optional flags on this command are:

--table-only will create the BigQuery dataset and table based on the schema, but will not attempt to load data into it.
--dry-run will validate the input dataset against the schema, but not attempt to load it.
--overwrite will drop and recreate the BigQuery table if it already exists in the dataset.

Exporting Data

Use the following command to query the OEPS BigQuery tables:

python scripts/bq.py export --sql sql/states.sql --output states.shp

Where states.sql is an example of a file that holds the SQL query to perform against one or more tables. In the SQL, PROJECT_ID is a placeholder (it will be replaced with the actual project identifier before the query is performed), such that table references look like PROJECT_ID.dataset_name.table_name, or PROJECT_ID.spatial.states2018 for the table that holds state boundaries.

--sql-file path to a file whose contents is a complete SQL query.
--output is the name of a file to which the query results will be written. Either .csv or .shp files can be specified, and if a spatial result is written to CSV the geometries will be in WKT format. If this argument is omitted, the query results will be printed to the console (helpful for testing queries).

You can write your own SQL into a file and use the same command to perform your query and export the results.

Use the BQ-Reference page for quick access to all table and column names.

Table Definitions

A table definition is a JSON file that specifies

The location of a source dataset to load
The Google BigQuery project and table name to load into
A thorough schema defining all fields from the source dataset and how they will be stored in BigQuery

This information is used in various contexts to

Create (or re-create) table schemas in BigQuery
Load data into these tables
Export data from BigQuery into various formats and file types

The structure of a table definition is inspired by the Table Schema specification published by Frictionless Standards, with a few additions for our own use case.

Structure

The top-level properties of a table definition are:

Property	Format	Description
`path`	String	Path or URL for CSV or SHP dataset to load
`bq_table_name`	String	Target table in BigQuery
`bq_dataset_name`	String	Target dataset in BigQuery
`fields`	List	List of definitions for all table fields

Note that in BigQuery, a dataset is akin to a database in other RDBS implementations, such that a dataset holds one or more tables. Often, tables are identified by their fully-qualified identifier: project_id.dataset_name.table_name.

The fields property is a list of one or more field objects, as described below. The only requirement of the fields list is that must contain an entry for a HEROP_ID field. This is our unique GIS join field.

Field Descriptors

Fields are defined by a JSON object that adheres to the field descriptors portion of the table schema standard, though not all possible attributes are required or implemented.

Property	Format	Description	OEPS Use
`name`	String	Canonical name for this column (used in BigQuery)	Required
`title`	String	A nicer human readable label or title for the field	Required
`type`	String	A string specifying the type. See types.	Required
`format`	String	A string specifying a format	Not Implemented
`example`	String	An example value for the field	Optional
`description`	String	A description for the field	Optional
`constraints`	JSON	A constraints descriptor	Not Implemented

The following additional attributes are also supported and in some cases required:

Property	Format	Description	OEPS Use
`src_name`	String	Name of column in source dataset	Required
`bq_data_type`	String	Field type for BigQuery schema	Required
`theme`	String	One of `Social`, `Environment`, `Economic`, `Policy`, `Outcome`, or `Geography`. See OEPS docs.	Optional
`comment`	String	Additional information about the data in this field	Optional
`source`	String	Source of the data in this field	Optional
`max_length`	Integer	Max length of field (used in BigQuery schema)	Optional

HEROP_ID Field

For now, please see this comment for a detailed description of how HEROP_ID values are constructed.

A field descriptor for this field will look something like this:

{
    "name": "HEROP_ID",
    "src_name": "HEROP_ID",
    "type": "string",
    "example": "040US01-2018",
    "description": "A derived unique id corresponding to the relevant geographic unit.",
    "theme": "Geography",
    "bq_data_type": "STRING"
}

Geometry Fields

To load a shapefile you must include the following field descriptor in your fields list:

{
    "name": "geom",
    "title": "Geom",
    "type": "string",
    "src_name": "geometry",
    "bq_data_type": "GEOGRAPHY"
}

Note that for spatial data Table Schema only allows geojson or geopoint as valid geographic types, while Google BigQuery uses GEOGRAPHY. For now, we'll just use the above configuration, and more nuances can be pursued down the road.

Example

The following is a truncated version of a table definition for the 2010 State-level data published in OEPS v2.0. This defines a table project_id.tabular.S_2010 with two fields, HEROP_ID and TotPop. Note also the direct URL to the raw path on GitHub.

{
    "bq_dataset_name": "tabular",
    "bq_table_name": "S_2010",
    "path": "https://raw.githubusercontent.com/GeoDaCenter/opioid-policy-scan/main/data_final/full_tables/S_2010.csv",
    "fields": [
        {
            "name": "HEROP_ID",
            "src_name": "HEROP_ID",
            "type": "string",
            "description": "A derived unique id corresponding to the relevant geographic unit.",
            "constraints": null,
            "theme": "Geography",
            "bq_data_type": "STRING"
        },
        {
            "name": "TotPop",
            "src_name": "TotPop",
            "type": "integer",
            "example": "7294336",
            "description": "Estimated total population",
            "constraints": null,
            "theme": "Social",
            "source": "American Community Survey 2014-2018 5 Year Estimates; 2010 Decennial Census; Integrated Public Use Microdata Service National Historic Geographic Information Systems",
            "comments": "1980, 1990, and 2000 data from respective decennial censuses downloaded from IPUMS NHGIS and aggregated upwards.",
            "bq_data_type": "INTEGER"
        }
    ]
}

US Census Data

A processing pipeline pulls cartographic boundary files from the Census FTP site and does the following:

Merges all state subsets into a single nation-wide dataset
Adds a couple of useful fields
Exports to GeoJSON, Shapefile, and PMTiles formats
Uploads to our S3 bucket for remote access

Only 500k resolution files are used.

A comprehensive example of the command is:

flask census get-geodata shp pmtiles geojson -y 2010 --upload --no-cache --verbose --tippecanoe-path /opt/tippecanoe/tippecanoe

census is the command group
get-geodata is this particular operation
shp indicates that shapefiles will be generated. other valid formats are geojson and pmtiles
-g tract (multiple allowed) specifies that tract geographies should be processed. other geographies are:
- state
- county
- zcta (zip code tabulation area)
- bg (block group)
- place (place geographies: cities, towns, villages)
-y 2010 indicates that 2010 files should be used. 2018 is also supported.
--upload (optional) will upload to S3 (credentials and bucket name are set elsewhere)
--no-cache (optional) will force re-download of the source files from the FTP
--verbose (optional) extra print statements during the process
--tippecanoe-path (required for pmtiles output) provide a full path to a local tippecanoe binary, used to generate PMTiles

Overture POIs

section in progress

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend.md

backend.md

oeps-backend

Getting Started

Install the Python Package

JCOIN

Google BigQuery

Setup BigQuery Credentials

BigQuery Import/Export

Importing Data

Exporting Data

Table Definitions

Structure

Field Descriptors

HEROP_ID Field

Geometry Fields

Example

US Census Data

Overture POIs

Files

backend.md

Latest commit

History

backend.md

File metadata and controls

oeps-backend

Getting Started

Install the Python Package

JCOIN

Google BigQuery

Setup BigQuery Credentials

BigQuery Import/Export

Importing Data

Exporting Data

Table Definitions

Structure

Field Descriptors

HEROP_ID Field

Geometry Fields

Example

US Census Data

Overture POIs