vignettes/articles/pas_introduction.Rmd
pas_introduction.Rmd
Synoptic data provides a synopsis - a comprehensive view of something at a moment in time. This vignette demonstrates an example workflow for exploring air quality synoptic data using the AirSensor R package and data captured by PurpleAir air quality sensors.
PurpleAir sensor readings are uploaded to the cloud every 120 seconds. (Every 80 seconds prior to a May 31, 2019 firmware upgrade.) Data are processed by PurpleAir and a version of the data is displayed on the PurpleAir website.
You can generate a current PurpleAir Synoptic (PAS) object (hereafter called a pas
) by using the pas_createNew()
function. A pas
object is just a large dataframe with 44 data columns and a record for each PurupleAir sensor channel (2 channels per sensor).
The pas_createNew()
function performs the following tasks under the hood:
Download a raw dataset of the entire PurpleAir network that includes both metadata and recent PM2.5 averages for each deployed sensor across the globe. See downloadParseSynopticData()
for more info.
Subset and enhance the raw dataset by replacing variables with more consistent, human readable names and adding spatial metadata for each sensor including the nearest official air quality monitor. For a more in depth explanation, see enhanceSynopticData()
.
To create a new pas
object you must first properly initialize the MazamaSpatialUtils package. The following example will create a brand new pas
object with up-to-the-minute data:
NOTE: This can take up to a minute to process.
It is also possible to load pre-generated pas
objects from a data archive. These objects are updated regularly throughout each day and are typically used by other package functions primarily for the location metadata they contain. Archived pas
objects from previous days will thus have data associated with near midnight of that date.
The archived pas
objects can be loaded very quickly with the pas_load()
function which obtains pas
objects from the archive specified with setArchvieBaseUrl()
. When used without specifying the datestamp
argument, pas_load()
will obtain the most recently processed pas
object – typically less than an hour old.
The pas
dataset contains 45 columns, and each row corresponds to different PurpleAir sensors. For the data analysis examples we will focus on the columns labeled stateCode
, pm25_*
, humidity
, pressure
, temperature
, and pwfsl_closestDistance
.
The complete list of columns is given below. Names in ALL_CAPS
have been retained from the PurpleAir .json file. Other columns have been renamed for human readability.
## [1] "ID" "label"
## [3] "DEVICE_LOCATIONTYPE" "THINGSPEAK_PRIMARY_ID"
## [5] "THINGSPEAK_PRIMARY_ID_READ_KEY" "THINGSPEAK_SECONDARY_ID"
## [7] "THINGSPEAK_SECONDARY_ID_READ_KEY" "latitude"
## [9] "longitude" "pm25"
## [11] "lastSeenDate" "sensorType"
## [13] "flag_hidden" "isOwner"
## [15] "humidity" "temperature"
## [17] "pressure" "age"
## [19] "parentID" "flag_highValue"
## [21] "flag_attenuation_hardware" "Ozone1"
## [23] "pm25_current" "pm25_10min"
## [25] "pm25_30min" "pm25_1hr"
## [27] "pm25_6hr" "pm25_1day"
## [29] "pm25_1week" "statsLastModifiedDate"
## [31] "statsLastModifiedInterval" "countryCode"
## [33] "stateCode" "timezone"
## [35] "deviceID" "locationID"
## [37] "deviceDeploymentID" "airDistrict"
## [39] "pwfsl_closestDistance" "pwfsl_closestMonitorID"
## [41] "sensorManufacturer" "targetPollutant"
## [43] "technologyType" "communityRegion"
Let’s take a quick peek at some of the PM2.5 data:
# Extract and round just the PM2.5 data
pm25_data <-
pas %>%
select(starts_with("pm25_")) %>%
round(1)
# Combine sensor label and pm2.5 data
bind_cols(label = pas$label, pm25_data) %>%
head(10) %>%
knitr::kable(
col.names = c("label", "current", "10 min", "30 min", "1 hr", "6 hr", "1 day", "1 wk"),
caption = "PAS PM2.5 Values"
)
label | current | 10 min | 30 min | 1 hr | 6 hr | 1 day | 1 wk |
---|---|---|---|---|---|---|---|
Hazelwood canary | 40.0 | 39.2 | 40.0 | 42.5 | 60.3 | 75.4 | 45.5 |
Hazelwood canary B | 38.5 | 37.4 | 38.1 | 40.5 | 57.0 | 72.0 | 43.2 |
WC Hillside | 47.4 | 49.2 | 49.9 | 50.7 | 75.2 | 100.6 | 67.1 |
WC Hillside B | 43.2 | 45.3 | 45.7 | 46.3 | 66.9 | 90.0 | 60.9 |
#ValleyClimate | 92.9 | 93.5 | 95.3 | 98.0 | 140.2 | 140.3 | 71.5 |
#ValleyClimate B | 106.7 | 102.0 | 104.0 | 106.5 | 151.6 | 151.8 | 76.6 |
‘S’ St Between Inyo and Mono | 77.7 | 73.2 | 71.1 | 71.5 | 98.4 | 96.9 | 56.2 |
‘S’ St Between Inyo and Mono B | 91.2 | 85.9 | 83.8 | 84.6 | 116.2 | 113.3 | 62.6 |
(Indoors) Lansing St | 0.9 | 0.8 | 0.9 | 1.2 | 3.9 | 5.7 | 4.1 |
(Indoors) Lansing St B | 0.2 | 0.4 | 0.4 | 0.7 | 2.8 | 4.3 | 3.2 |
pas
PM2.5 DataTo visually explore a region, we can use our pas
data with the pas_leaflet()
function to plot an interactive leaflet map. By default, pas_leaflet()
will map the coordinates of each PurpleAir sensor and the hourly PM2.5 data. Clicking on a sensor will show sensor metadata.
If we want to narrow our selection, for example to California, we can look at which locations have a moderate to unhealthy 6-hour average air quality rating with the following short script that uses the %>%
“pipe” operator:
pas %>%
pas_filter(stateCode == 'CA') %>%
pas_filter(pm25_6hr >= 25.0) %>%
pas_leaflet(parameter = "pm25_6hr")
This code pipes our pas
data into pas_filter()
where we can set our selection criteria. The stateCode
is the ISO 3166-2 state code, which tells pas_filter()
to subset for only those station sin California. The pm25_6hr > 25.0
filter selects those records where the 6-hour average is above 25.0. The final function in the pipeline plots the remaining sensors colored by pm25_6hr
.
pas
Auxiliary DataWe can also explore and utilize other PurpleAir sensor data. Check the pas_leaflet()
documentation for all supported parameters.
Here is an example of humidity data captured from PurpleAir sensors across the state of California.
Because the pas
object is a dataframe, we can use functionality from various other R packages. For example, pas
data is compatible with “tidyverse” syntax so we can use dplyr, ggplot2, and sf to help transform, summarize, and visualize pas
data in a tidy way.
Below, we demonstrate the flexibility of AirSensor’s pas
data objects and a few examples of exploratory data analysis pipelines, focusing on the state of California and their PurpleAir sensor data.
# Load the extra libraries
library(sf)
library(ggplot2)
# Load spatial data with US counties
MazamaSpatialUtils::setSpatialDataDir('~/Data/Spatial/')
MazamaSpatialUtils::loadSpatialData("USCensusCounties")
# Subset to California and convert to an 'sf' spatial object
ca_counties <-
subset(USCensusCounties, stateCode == "CA") %>%
st_as_sf()
# Subset the 'pas' data to Californa and convert to `sf`
ca_pas <-
pas %>%
pas_filter(stateCode == 'CA') %>%
# Be sure to set the reference system
st_as_sf(
coords = c('longitude', 'latitude'), # identify coordinate variables
crs = st_crs(ca_counties) # set the reference system
)
Using our California county spatial data, we can count the number of PurpleAir sensors that exist in the each region “feature” (aka polygon). This is an important metric to determine statistical confidence we might have – i.e. the more sensors a region contains the more confidence we can have in any county-wide, aggregated statistic.
(The mutate()
function comes from dplyr while the st_contains()
comes from sf. Some familiarity with these packages is required.)
# Count the number of PurpleAir sensors in each County polygon feature
n_sensors <-
ca_counties %>%
mutate(sensorCount = lengths(st_contains(., ca_pas)))
Let’s take a look at the counties that contain the most PurpleAir sensors.
n_sensors %>%
select(sensorCount, name) %>%
filter(sensorCount > 50) %>%
st_drop_geometry() %>%
arrange(desc(sensorCount)) %>%
knitr::kable(
col.names = c("Sensor Count", "County"),
caption = "Counties with >50 Sensors"
)
Sensor Count | County |
---|---|
1365 | Santa Clara |
1279 | Alameda |
991 | Los Angeles |
990 | San Mateo |
863 | San Francisco |
742 | Contra Costa |
546 | Marin |
470 | Sonoma |
311 | Sacramento |
198 | Santa Cruz |
192 | Ventura |
188 | Solano |
187 | Riverside |
182 | Orange |
150 | San Bernardino |
146 | El Dorado |
120 | Yolo |
119 | San Diego |
112 | Placer |
96 | Fresno |
96 | Shasta |
92 | San Luis Obispo |
88 | Monterey |
82 | Santa Barbara |
82 | Napa |
76 | Nevada |
70 | Kern |
54 | Siskiyou |
52 | Mendocino |
We can visualize the number of PurpleAir sensors per county using the tmap package. Here we create a “chloropleth” map, where each county is filled with a color representing the number of PurpleAir sensors it contains.
## Warning: replacing previous import 'sf::st_make_valid' by
## 'lwgeom::st_make_valid' when loading 'tmap'
## Warning: multiple methods tables found for 'wkt'
# Set up breaks
my_breaks = c(0,20,50,100,200,500,1000,Inf)
my_labels <- c("0 to 20", "20 to 50", "50 to 100", "100 to 200",
"200 to 500", "500 to 1,000", "1,000 or more")
tm_shape(n_sensors) +
tm_polygons(
col = "sensorCount",
palette = "YlOrBr",
breaks = my_breaks
)
Sometimes a barplot is preferable for reading off values. Here is an example showing the top 15 counties with the the most PurpleAir sensors that are active.
# Sort the counties by the number of PurpleAir sensors
sorted_n_sensors <-
n_sensors %>%
arrange(desc(sensorCount))
ggplot(sorted_n_sensors) +
geom_col(aes(
x = reorder(name, sensorCount),
y = sensorCount,
fill = cut_number(sensorCount, n = 8, breaks = my_breaks, labels = my_labels)
)) +
scale_fill_brewer(
palette = "YlOrBr",
direction = 1
) +
labs(fill = "Sensor Count") +
xlab('County') +
theme_minimal() +
coord_flip()
Our pas
object contains the latest ~2-minute resolution PM2.5 data (pm25_current
) as well as other, longer interval averages of PM2.5, e.g. pm25_1week
, which we can use to calculate region averaged weekly PM2.5 sensor data.
# Join the PurpleAir and county data and calculate the mean per county
mean_pm25 <-
ca_counties %>%
st_join(ca_pas) %>%
group_by(name) %>%
summarise(pm25_weeklyMean = round(mean(pm25_1week, na.rm = TRUE), 1))
mean_pm25 %>%
select(name, pm25_weeklyMean) %>%
st_drop_geometry() %>%
arrange(desc(pm25_weeklyMean)) %>%
head(20) %>%
knitr::kable(
col.names = c("County", "Mean Sensor PM2.5"),
caption = "Counties with Highest weekly PM2.5"
)
County | Mean Sensor PM2.5 |
---|---|
Plumas | 156.1 |
Madera | 141.7 |
Butte | 104.1 |
Mariposa | 102.3 |
Yuba | 93.1 |
Lassen | 87.7 |
Tuolumne | 83.1 |
Siskiyou | 73.4 |
Trinity | 67.1 |
Sutter | 65.7 |
Mono | 65.3 |
Calaveras | 65.0 |
Fresno | 62.7 |
Glenn | 62.6 |
Del Norte | 62.5 |
Tehama | 62.4 |
Shasta | 60.6 |
Amador | 59.9 |
San Joaquin | 59.8 |
Tulare | 57.4 |
It is important to clarify that this data does not represent a region’s “spatially averaged” air quality. It is only a simple summary of data from the available PurpleAir sensors without any regard for how they are spatially distributed.
In a similar fashion to that illustrated above, we can also create a 1-week PurpleAir averages per county chloropleth map.
# Set up breaks
my_breaks = seq(0,40,5)
my_labels <- paste(my_breaks[1:8], my_breaks[2:9], sep = " to ")
tm_shape(mean_pm25) +
tm_polygons(
col = "pm25_weeklyMean",
palette = "YlOrBr",
breaks = seq(0,40,5)
)
Just as before, a barplot is sometimes a better choice. Below is an example of plotting the Top 15 1-week PurpleAir county averages.
# Sort the weekly averages per county and select the top 15
top15_mean_pm25 <-
mean_pm25 %>%
arrange(desc(pm25_weeklyMean)) %>%
head(n = 30)
# NOTE: The color legend may not be identical to the legend above becuase we
# NOTE: are only plotting a subset of all counties.
ggplot(top15_mean_pm25) +
geom_col(aes(
x = reorder(name, pm25_weeklyMean),
y = pm25_weeklyMean,
fill = cut_number(pm25_weeklyMean, n = 8, breaks = my_breaks, labels = my_labels)
)) +
scale_fill_brewer(
palette = "YlOrBr",
direction = 1
) +
labs(fill = "Sensor Count") +
xlab('County') +
theme_minimal() +
coord_flip()
Happy Exploring!
Mazama Science