This project builds from the Opioid Environment Policy Scan (OEPS) data warehouse stored in github.com/GeoDaCenter/opioid-policy-scan, and published on Zenodo at doi.org/10.5281/zenodo.5842465. This repo allows us to pull the final data from the latest OEPS release and push it into Google BigQuery, which will enable new ways of accessing and analyzing that data.
The backend includes commands for managing data transformation to and from to different destinations. These commands are organized thematically into the following groups:
-
Create and activate a Python virtual environment with venv, Conda, or your other tool of choice.
-
Clone this repo
git clone https://github.com/healthyregions/oeps cd oeps/backend
-
Install this package and its dependencies
pip install -e .
-
You can now run any scripts with
python ./scripts/script_name.py
section in progress
To use the BigQuery script (scripts/bq.py
) you will need to add access credentials as environment variables using the .env
file. This file must always stay out of version control.
-
Make a copy of
.env.example
and name it.env
. Any variables defined in this file will now be available viaos.getenv('VARIABLE_NAME')
-
Obtain a set of JSON credentials for the project, and store the file anywhere on your computer (but outside of this repository).
-
In your
.env
file, update theBQ_CREDENTIALS_FILE_PATH
variable with the full path to this file. For example:BQ_CREDENTIALS_FILE_PATH="/home/my_username/bq_credentials/oeps-391119-5783b2b59b83.json"
-
It is also possible to set BigQuery credentials without storing a local JSON file. More info on this is in the
.env.example
file.
Once you have setup these credentials, you can test them with
python ./scripts/bq.py check-credentials
If all is well, the command will print ok
.
The bq.py
script can perform import and export operations on our BigQuery database.
Use the following command to load a new table into BigQuery:
python ./scripts/bq.py load --source ./resources
Where --source
is the path to a Data Resource schema (stored as JSON file), or a directory containing multiple Data Resource schemas. Optional flags on this command are:
--table-only
will create the BigQuery dataset and table based on the schema, but will not attempt to load data into it.--dry-run
will validate the input dataset against the schema, but not attempt to load it.--overwrite
will drop and recreate the BigQuery table if it already exists in the dataset.
Use the following command to query the OEPS BigQuery tables:
python scripts/bq.py export --sql sql/states.sql --output states.shp
Where states.sql
is an example of a file that holds the SQL query to perform against one or more tables. In the SQL, PROJECT_ID
is a placeholder (it will be replaced with the actual project identifier before the query is performed), such that table references look like PROJECT_ID.dataset_name.table_name
, or PROJECT_ID.spatial.states2018
for the table that holds state boundaries.
--sql-file
path to a file whose contents is a complete SQL query.--output
is the name of a file to which the query results will be written. Either .csv or .shp files can be specified, and if a spatial result is written to CSV the geometries will be in WKT format. If this argument is omitted, the query results will be printed to the console (helpful for testing queries).
You can write your own SQL into a file and use the same command to perform your query and export the results.
Use the BQ-Reference page for quick access to all table and column names.
A table definition is a JSON file that specifies
- The location of a source dataset to load
- The Google BigQuery project and table name to load into
- A thorough schema defining all fields from the source dataset and how they will be stored in BigQuery
This information is used in various contexts to
- Create (or re-create) table schemas in BigQuery
- Load data into these tables
- Export data from BigQuery into various formats and file types
The structure of a table definition is inspired by the Table Schema specification published by Frictionless Standards, with a few additions for our own use case.
The top-level properties of a table definition are:
Property | Format | Description |
---|---|---|
path |
String | Path or URL for CSV or SHP dataset to load |
bq_table_name |
String | Target table in BigQuery |
bq_dataset_name |
String | Target dataset in BigQuery |
fields |
List | List of definitions for all table fields |
Note that in BigQuery, a dataset
is akin to a database in other RDBS implementations, such that a dataset holds one or more tables. Often, tables are identified by their fully-qualified identifier: project_id.dataset_name.table_name
.
The fields
property is a list of one or more field objects, as described below. The only requirement of the fields
list is that must contain an entry for a HEROP_ID
field. This is our unique GIS join field.
Fields are defined by a JSON object that adheres to the field descriptors portion of the table schema standard, though not all possible attributes are required or implemented.
Property | Format | Description | OEPS Use |
---|---|---|---|
name |
String | Canonical name for this column (used in BigQuery) | Required |
title |
String | A nicer human readable label or title for the field | Required |
type |
String | A string specifying the type. See types. | Required |
format |
String | A string specifying a format | Not Implemented |
example |
String | An example value for the field | Optional |
description |
String | A description for the field | Optional |
constraints |
JSON | A constraints descriptor | Not Implemented |
The following additional attributes are also supported and in some cases required:
Property | Format | Description | OEPS Use |
---|---|---|---|
src_name |
String | Name of column in source dataset | Required |
bq_data_type |
String | Field type for BigQuery schema | Required |
theme |
String | One of Social , Environment , Economic , Policy , Outcome , or Geography . See OEPS docs. |
Optional |
comment |
String | Additional information about the data in this field | Optional |
source |
String | Source of the data in this field | Optional |
max_length |
Integer | Max length of field (used in BigQuery schema) | Optional |
For now, please see this comment for a detailed description of how HEROP_ID
values are constructed.
A field descriptor for this field will look something like this:
{
"name": "HEROP_ID",
"src_name": "HEROP_ID",
"type": "string",
"example": "040US01-2018",
"description": "A derived unique id corresponding to the relevant geographic unit.",
"theme": "Geography",
"bq_data_type": "STRING"
}
To load a shapefile you must include the following field descriptor in your fields
list:
{
"name": "geom",
"title": "Geom",
"type": "string",
"src_name": "geometry",
"bq_data_type": "GEOGRAPHY"
}
Note that for spatial data Table Schema only allows geojson
or geopoint
as valid geographic types, while Google BigQuery uses GEOGRAPHY
. For now, we'll just use the above configuration, and more nuances can be pursued down the road.
The following is a truncated version of a table definition for the 2010 State-level data published in OEPS v2.0. This defines a table project_id.tabular.S_2010
with two fields, HEROP_ID
and TotPop
. Note also the direct URL to the raw path
on GitHub.
{
"bq_dataset_name": "tabular",
"bq_table_name": "S_2010",
"path": "https://raw.githubusercontent.com/GeoDaCenter/opioid-policy-scan/main/data_final/full_tables/S_2010.csv",
"fields": [
{
"name": "HEROP_ID",
"src_name": "HEROP_ID",
"type": "string",
"description": "A derived unique id corresponding to the relevant geographic unit.",
"constraints": null,
"theme": "Geography",
"bq_data_type": "STRING"
},
{
"name": "TotPop",
"src_name": "TotPop",
"type": "integer",
"example": "7294336",
"description": "Estimated total population",
"constraints": null,
"theme": "Social",
"source": "American Community Survey 2014-2018 5 Year Estimates; 2010 Decennial Census; Integrated Public Use Microdata Service National Historic Geographic Information Systems",
"comments": "1980, 1990, and 2000 data from respective decennial censuses downloaded from IPUMS NHGIS and aggregated upwards.",
"bq_data_type": "INTEGER"
}
]
}
A processing pipeline pulls cartographic boundary files from the Census FTP site and does the following:
- Merges all state subsets into a single nation-wide dataset
- Adds a couple of useful fields
- Exports to GeoJSON, Shapefile, and PMTiles formats
- Uploads to our S3 bucket for remote access
Only 500k resolution files are used.
A comprehensive example of the command is:
flask census get-geodata shp pmtiles geojson -y 2010 --upload --no-cache --verbose --tippecanoe-path /opt/tippecanoe/tippecanoe
census
is the command groupget-geodata
is this particular operationshp
indicates that shapefiles will be generated. other valid formats aregeojson
andpmtiles
-g tract
(multiple allowed) specifies that tract geographies should be processed. other geographies are:state
county
zcta
(zip code tabulation area)bg
(block group)place
(place geographies: cities, towns, villages)
-y 2010
indicates that 2010 files should be used. 2018 is also supported.--upload
(optional) will upload to S3 (credentials and bucket name are set elsewhere)--no-cache
(optional) will force re-download of the source files from the FTP--verbose
(optional) extra print statements during the process--tippecanoe-path
(required for pmtiles output) provide a full path to a local tippecanoe binary, used to generate PMTiles
section in progress