Treemendous is an open-source software package for the R programming environment that provides a toolset for standardizing tree species names and translating between different databases according to four publicly available backbones [World Flora Online (WFO), the Botanical Gardens Convention International (BGCI), the World Consensus on Vascular Plants (WCVP) and the Global Biodiversity Information Facility (GBIF)]. The package simultaneously leverages information and relationships across all these backbones to increase matching rates and minimize data loss, while ensuring the resulting species are accepted and consistent with a single reference backbone. The package provides a flexible workflow depending on the use case, in which users can chain together different functionalities ranging from simple matching to a single backbone, to graph-based iterative matching using synonym-accepted relations across all backbones in the database. In addition, the package allows users to translate' one tree species list into another, streamlining the assimilation of new data into preexisting datasets or models. In this readme file we provide installation instructions and worked-out examples for more detailed information please refer to the reference manual or the publication associated to this package [Add link to pub]
The latest version of the reference manual is available here.
library(devtools)
install_github("speckerf/treemendous")
If you are encountering problems installing devtools, try to install it instead with the package remotes:
remotes::install_github("speckerf/treemendous")
If for any reason the installation was not successful, we provide a Docker image with the package already preinstalled. The Docker image is available on Dockerhub at 'speckerf/treemendous'.
Download Docker Desktop Client
- Download and install here https://www.docker.com/products/docker-desktop/, or follow your system-specific Docker installation (e.g., for some Linux distributions)
- Start Docker Desktop Application
- Keep in mind the image was built for AMD64 and might cause issues on machines with Apple M1/M2. For these machines make sure you go to Docker settings> Features in development and turn on "Use Rosetta for x86/amd64 emulation on Apple Silicon" before pulling the docker image. The major steps are described below:
Pull the image Open a terminal and navigate to the desired location. Then pull the image from Dockerhub with:
docker pull speckerf/treemendous
Run the container
Running the Docker container:
docker run \
-p 8888:8787 \
-e PASSWORD=password \
speckerf/treemendous
Go to your browser: open https://localhost:8888/
- this should open an rstudio interface: log in with username 'rstudio' and password 'password'
Load the package with library(treemendous)
.
Note, the docker container cannot actually see any data on your local machine. You have to mount a repository. To mount your current working directory, use:
(if $(pwd)
doesn't work in your terminal, you can use the absolute path)
docker run --rm \
-p 8888:8787 \
-e PASSWORD=password \
-v $(pwd):/home/rstudio \
speckerf/treemendous
All functions of \textit{Treemendous} require the species name to be split into two columns, Genus and Species, with the former being capitalized. Assume you have two species, \textit{Acer platanoides} and \textit{Fagus sylvatica}, you can create the input tibble by calling:
### Species list preparation
library(tidyverse)
species <- c('Acer platanoides', 'Fagus sylvatica')
input <- species %>%
tibble::as_tibble_col(column_name = 'binomial') %>%
tidyr::separate(col = 'binomial', into = c('Genus', 'Species'))
input
Other useful functions for creating the input tibble include:
readr::read_csv('path') # import data
dplyr::select(Genus, Species) # select columns
dplyr::distinct(Genus, Species) # remove duplicate binomials
dplyr::rename('Genus' = 'old_genus_name',
'Species' = 'old_species_name') # rename columns
dplyr::mutate(Genus = stringr::str_to_title(Genus)) # capitalize Genus
dplyr::mutate(Species = stringr::str_remove(Species, ".*?\\s")) # remove everything before first space
tidyr::drop_na(c('Genus', 'Species')) # remove rows with NA's
dplyr::arrange(Genus, Species) # sort names
dplyr::bind_rows(x, y) # concatenate two tibble's
Along with the package comes an example dataset fia with
library(treemendous)
result <- fia %>% matching(backbone = 'BGCI')
summarize_output(result)
From
result %>%
dplyr::slice_head(n=3) %>%
dplyr::select(1:5)
We can further increase the number of matched species by using the functions matching() followed by enforce_matching(). Here, we specify the backbone BGCI.
result <- fia %>%
matching(backbone = 'BGCI') %>%
enforce_matching(backbone = 'BGCI')
result %>% summarize_output()
Now, we are able to match
Note that if we choose a different backbone than BGCI, then species can matched names that are not accepted (synonyms), we can further resolve synonyms after matching the species names with the function resolve_synonyms(). Now, the output contains additionally the accepted species names (prefix Accepted.), as well as a column Accepted.Backbone, which states according to which backbone the synonym was resolved.
result <- fia %>%
matching('WFO') %>%
resolve_synonyms('WFO')
result %>%
dplyr::slice_head(n=3) %>%
dplyr::select(dplyr::matches('Orig|Matched|Accepted'), -'matched')
Note that a warning message is produced "Please consider calling highlight_flags() to investigate potential ambiguities upon resolving synonyms to accepted names". Potential ambiguities could have been resolved in your dataset and it is suggested to use highlight_flags() to know more and decide if you want to check them manually. The highlight_flags function should be used separately from the others as it will only return species that have some flag and not the full dataset. Note also that each entry can have multiple flags:
flags <- result %>% highlight_flags('WFO')
flags %>%
dplyr::slice_head(n=3) %>%
dplyr::select(dplyr::matches('Acc|ambiguity|link'))
We can see the full breakdown of these flags as follows:
flags %>% dplyr::select(dplyr::contains("WFO")) %>% dplyr::summarize_all(.funs = sum)
The bulk of these flags denotes an infraspecific_ambiguity, which can generally be ignored, provided that the user did not manually truncate any trinomials to binomials for input. The
Instead of using a single backbone, the user can decide to use any subset of the backbones c('BGCI', 'WFO', 'WCVP', 'GBIF') or use all of them by simply calling matching() without any argument. While matching() considers all backbones being equally important, the function sequential_matching() can be used to call matching() for individual backbones sequentially. For every species, the matched backbone is provided in the column Matched.Backbone.
result <- fia %>%
sequential_matching(sequential_backbones = c('BGCI', 'WFO', 'WCVP'))
Remember that matching() and sequential_matching() match any species in the database and thus can provide matches to synonyms rather than accepted species. To get only accepted species returned use resolve_synonyms() after the matching function.
Oftentimes, researches require integrating multi-modal data from different sources for their analyses. Here, we demonstrate the use of the function translate_trees(), which allows a user directly translate names from an input database to a target database. First, we resolve both databases individually according to the single backbone (WFO) and compare the resolved names. Then, we use translate_trees to translate the input species names into the target names.
input <- tibble::tibble(
Genus = c('Aria', 'Ardisia', 'Malus'),
Species = c('umbellata', 'japonica', 'sylvestris')
)
target <- tibble::tibble(
Genus = c('Sorbus', 'Ardisia', 'Malus'),
Species = c('umbellata', 'montana', 'orientalis')
)
input %>%
matching(backbone = 'WFO') %>%
resolve_synonyms('WFO') %>%
dplyr::select(1:6)
target %>%
matching(backbone = 'WFO') %>%
resolve_synonyms('WFO') %>%
dplyr::select(1:6)
Resolving both sets individually leads to a mismatch - Malus orientalis and Malus sylvestris were resolved to two different names. Now let's see whether translate_trees can be used to match all three species:
translate_trees(df = input, target = target) %>%
dplyr::select(1:4)
Essentially, all three species names can be translated from the input set to the target set. Incorporating the knowledge of the desired target names, the function leverages the information about synonym-accepted relations in the three backbones WFO, WCVP and GBIF and is able to translate Malus sylvestris into Malus orientalis.
Please refer to the documentation for a detailed description of the functions: treemendous_1.1.1.pdf