Skip to content

speckerf/treemendous

Repository files navigation

Treemendous: An R Package for Standardizing Taxonomical Names of Tree Species

R-CMD-check codecov DOI

Treemendous is an open-source software package for the R programming environment that provides a toolset for standardizing tree species names and translating between different databases according to four publicly available backbones [World Flora Online (WFO), the Botanical Gardens Convention International (BGCI), the World Consensus on Vascular Plants (WCVP) and the Global Biodiversity Information Facility (GBIF)]. The package simultaneously leverages information and relationships across all these backbones to increase matching rates and minimize data loss, while ensuring the resulting species are accepted and consistent with a single reference backbone. The package provides a flexible workflow depending on the use case, in which users can chain together different functionalities ranging from simple matching to a single backbone, to graph-based iterative matching using synonym-accepted relations across all backbones in the database. In addition, the package allows users to translate' one tree species list into another, streamlining the assimilation of new data into preexisting datasets or models. In this readme file we provide installation instructions and worked-out examples for more detailed information please refer to the reference manual or the publication associated to this package [Add link to pub]

Reference Manual

The latest version of the reference manual is available here.

Package Installation

library(devtools)
install_github("speckerf/treemendous")

If you are encountering problems installing devtools, try to install it instead with the package remotes:

remotes::install_github("speckerf/treemendous")

Alternative Installation: Run with Docker

If for any reason the installation was not successful, we provide a Docker image with the package already preinstalled. The Docker image is available on Dockerhub at 'speckerf/treemendous'.

Download Docker Desktop Client

  • Download and install here https://www.docker.com/products/docker-desktop/, or follow your system-specific Docker installation (e.g., for some Linux distributions)
  • Start Docker Desktop Application
  • Keep in mind the image was built for AMD64 and might cause issues on machines with Apple M1/M2. For these machines make sure you go to Docker settings> Features in development and turn on "Use Rosetta for x86/amd64 emulation on Apple Silicon" before pulling the docker image. The major steps are described below:

Pull the image Open a terminal and navigate to the desired location. Then pull the image from Dockerhub with:

docker pull speckerf/treemendous

Run the container

Running the Docker container:

docker run \
           -p 8888:8787 \
           -e PASSWORD=password \
           speckerf/treemendous

Go to your browser: open https://localhost:8888/

  • this should open an rstudio interface: log in with username 'rstudio' and password 'password'

Load the package with library(treemendous).

Note, the docker container cannot actually see any data on your local machine. You have to mount a repository. To mount your current working directory, use: (if $(pwd) doesn't work in your terminal, you can use the absolute path)

docker run --rm \
	-p 8888:8787 \
	-e PASSWORD=password \
	-v $(pwd):/home/rstudio \
	speckerf/treemendous

Example

Species List Preparation

All functions of \textit{Treemendous} require the species name to be split into two columns, Genus and Species, with the former being capitalized. Assume you have two species, \textit{Acer platanoides} and \textit{Fagus sylvatica}, you can create the input tibble by calling:

### Species list preparation
library(tidyverse)
species <- c('Acer platanoides', 'Fagus sylvatica')
input <- species %>%
  tibble::as_tibble_col(column_name = 'binomial') %>%
  tidyr::separate(col = 'binomial', into = c('Genus', 'Species'))
input

Other useful functions for creating the input tibble include:

readr::read_csv('path') # import data
dplyr::select(Genus, Species) # select columns
dplyr::distinct(Genus, Species) # remove duplicate binomials
dplyr::rename('Genus' = 'old_genus_name',
                'Species' = 'old_species_name') # rename columns
dplyr::mutate(Genus = stringr::str_to_title(Genus)) # capitalize Genus
dplyr::mutate(Species = stringr::str_remove(Species, ".*?\\s")) # remove everything before first space
tidyr::drop_na(c('Genus', 'Species')) # remove rows with NA's
dplyr::arrange(Genus, Species) # sort names
dplyr::bind_rows(x, y) # concatenate two tibble's

FIA: Standardize species names from the U.S. Forest Inventory and Analysis program.

Along with the package comes an example dataset fia with $2171$ different tree species names. Assume that we want to standardize these species names according to a certain backbone (use the backbone argument). The function summarize_output() can be used to get a summary of the process.

library(treemendous)
result <- fia %>% matching(backbone = 'BGCI')
summarize_output(result)

From $2171$ species names in total, we were able to match $1822$ according to the backbone BGCI, with $1779$ names matching exactly, and $43$ species names matching using fuzzy- and suffix-matching. Besides information about the matching process, the output contains the old names (prefix Orig.) as well as the matched names (prefix Matched.) as follows:

result %>% 
  dplyr::slice_head(n=3) %>%
  dplyr::select(1:5)

We can further increase the number of matched species by using the functions matching() followed by enforce_matching(). Here, we specify the backbone BGCI.

result <- fia %>% 
  matching(backbone = 'BGCI') %>% 
  enforce_matching(backbone = 'BGCI')
result %>% summarize_output()

Now, we are able to match $2097$ species names in total, with $275$ species being matched via enforce_matching(). Note that the number of matched distinct species names is lower with $2044$, because several input species were matched to the same species in the target database BGCI.

Note that if we choose a different backbone than BGCI, then species can matched names that are not accepted (synonyms), we can further resolve synonyms after matching the species names with the function resolve_synonyms(). Now, the output contains additionally the accepted species names (prefix Accepted.), as well as a column Accepted.Backbone, which states according to which backbone the synonym was resolved.

result <- fia %>% 
  matching('WFO') %>% 
  resolve_synonyms('WFO')
result %>% 
  dplyr::slice_head(n=3) %>% 
  dplyr::select(dplyr::matches('Orig|Matched|Accepted'), -'matched')

Note that a warning message is produced "Please consider calling highlight_flags() to investigate potential ambiguities upon resolving synonyms to accepted names". Potential ambiguities could have been resolved in your dataset and it is suggested to use highlight_flags() to know more and decide if you want to check them manually. The highlight_flags function should be used separately from the others as it will only return species that have some flag and not the full dataset. Note also that each entry can have multiple flags:

flags <- result %>% highlight_flags('WFO')
flags %>% 
  dplyr::slice_head(n=3) %>% 
  dplyr::select(dplyr::matches('Acc|ambiguity|link'))

We can see the full breakdown of these flags as follows:

flags %>% dplyr::select(dplyr::contains("WFO")) %>% dplyr::summarize_all(.funs = sum)

The bulk of these flags denotes an infraspecific_ambiguity, which can generally be ignored, provided that the user did not manually truncate any trinomials to binomials for input. The $37$ infraspecific_link flags are likewise typically not problematic, as these simply highlight when the input binomial differs from the output binomial via a trinomial link at some point in the graph. The remaining $142$ authorship_ambiguity are the most problematic, as these indicate taxa that have multiple conflicting matches. These should be manually explored and used with caution.

Instead of using a single backbone, the user can decide to use any subset of the backbones c('BGCI', 'WFO', 'WCVP', 'GBIF') or use all of them by simply calling matching() without any argument. While matching() considers all backbones being equally important, the function sequential_matching() can be used to call matching() for individual backbones sequentially. For every species, the matched backbone is provided in the column Matched.Backbone.

result <- fia %>% 
  sequential_matching(sequential_backbones = c('BGCI', 'WFO', 'WCVP'))

Remember that matching() and sequential_matching() match any species in the database and thus can provide matches to synonyms rather than accepted species. To get only accepted species returned use resolve_synonyms() after the matching function.

Translate species names between two databases.

Oftentimes, researches require integrating multi-modal data from different sources for their analyses. Here, we demonstrate the use of the function translate_trees(), which allows a user directly translate names from an input database to a target database. First, we resolve both databases individually according to the single backbone (WFO) and compare the resolved names. Then, we use translate_trees to translate the input species names into the target names.

input <- tibble::tibble(
  Genus = c('Aria', 'Ardisia', 'Malus'),
  Species = c('umbellata', 'japonica', 'sylvestris')
)
target <- tibble::tibble(
  Genus = c('Sorbus', 'Ardisia', 'Malus'),
  Species = c('umbellata', 'montana', 'orientalis')
)
input %>%
  matching(backbone = 'WFO') %>%
  resolve_synonyms('WFO') %>%
  dplyr::select(1:6)
target %>%
  matching(backbone = 'WFO') %>%
  resolve_synonyms('WFO') %>%
  dplyr::select(1:6)

Resolving both sets individually leads to a mismatch - Malus orientalis and Malus sylvestris were resolved to two different names. Now let's see whether translate_trees can be used to match all three species:

translate_trees(df = input, target = target) %>% 
  dplyr::select(1:4) 

Essentially, all three species names can be translated from the input set to the target set. Incorporating the knowledge of the desired target names, the function leverages the information about synonym-accepted relations in the three backbones WFO, WCVP and GBIF and is able to translate Malus sylvestris into Malus orientalis.

Overview of functionality

Please refer to the documentation for a detailed description of the functions: treemendous_1.1.1.pdf

grafik