vignettes/spatial-transcriptomics.Rmd

---
title: "Applying RCTD to Spatial Transcriptomics Data"
author: "Dylan Cable"
date: "March 17th, 2021"
output:  
  html_document:
    keep_md: yes
  rmarkdown::html_vignette:
    keep_md: yes
vignette: >
  %\VignetteIndexEntry{spatial-transcriptomics}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r setup}
library(spacexr)
library(Matrix)
```

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = TRUE,
  out.width = "100%"
)
```
## Introduction
Robust Cell Type Decomposition, or RCTD, is an statistical method for learning cell types from spatial transcriptomics data. In this Vignette, we will assign cell types to a cerebellum Slide-seq dataset. We will define cell type profiles using an annotated single nucleus RNA-sequencing (snRNA-seq) cerebellum dataset. Please note that the reference could also be a single-cell dataset or a cell type-specific bulk RNA-seq dataset.

## Data Preprocessing
Let's begin by loading in the data to be usable for RCTD. <br/>

### Single-Cell Reference
In order to run RCTD, the first step is to process the single cell reference. The reference is created using the RCTD `Reference` constructor function, which requires 3 parameters: <br/>
1. counts: A matrix (or dgCmatrix) representing Digital Gene Expression (DGE). Rownames should be genes and colnames represent barcodes/cell names. Counts should be untransformed count-level data. <br/>
2. cell_types: 	
A named (by cell barcode) factor of cell type for each cell. The 'levels' of the factor would be the possible cell type identities. <br/>
3. nUMI: 	
Optional, a named (by cell barcode) list of total counts or UMI's appearing at each pixel. If not provided, nUMI will be assumed to be the total counts appearing on each pixel. <br/>

The reference may come from various data types, but it needs to be loaded into the R envirnoment. Here, we emphasize that our protocol for loading in the counts, cell_types, and nUMI obejects does not need to be the way that you load your objects into R. In this case our reference is stored in the 'Reference/Vignette' folder as two csv files:<br/>
1. meta_data.csv: a CSV file (with 3 columns, with headers "barcode", "cluster", and "nUMI") containing the nUMI and cell type assignment for each cell.<br/>
 <br/>
2. dge.csv: a Digital Gene Expression (DGE) (barcodes by gene counts) CSV file in the standard 10x format. <br/>
We use the `Reference` constructor function:
```{r scRNA}
### Load in/preprocess your data, this might vary based on your file type
refdir <- system.file("extdata",'Reference/Vignette',package = 'spacexr') # directory for the reference
counts <- read.csv(file.path(refdir,"dge.csv")) # load in counts matrix
rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames
meta_data <- read.csv(file.path(refdir,"meta_data.csv")) # load in meta_data (barcodes, clusters, and nUMI)
cell_types <- meta_data$cluster; names(cell_types) <- meta_data$barcode # create cell_types named list
cell_types <- as.factor(cell_types) # convert to factor data type
nUMI <- meta_data$nUMI; names(nUMI) <- meta_data$barcode # create nUMI named list

### Create the Reference object
reference <- Reference(counts, cell_types, nUMI)

## Examine reference object (optional)
print(dim(reference@counts)) #observe Digital Gene Expression matrix
table(reference@cell_types) #number of occurences for each cell type

## Save RDS object (optional)
saveRDS(reference, file.path(refdir,'SCRef.rds'))
```

### Spatial Transcriptomics data
Next, we will load the Spatial Transcriptomics data into a `SpatialRNA` object. Similarly to the reference, we will first load the data into R, and then pass the data into the RCTD `SpatialRNA` constructor function, which requires 3 parameters: <br/>
1. coords: A numeric data.frame (or matrix) representing the spatial pixel locations. rownames are barcodes/pixel names, and there should be two columns for 'x' and for 'y'.<br/>
2. counts: A matrix (or dgCmatrix) representing Digital Gene Expression (DGE). Rownames should be genes and colnames represent barcodes/pixel names. Counts should be untransformed count-level data. <br/>
3. nUMI: 	
Optional, a named (by pixel barcode) list of total counts or UMI's appearing at each pixel. If not provided, nUMI will be assumed to be the total counts appearing on each pixel. <br/>

These elements can be constructed differently, depending on how you store your spatial transcriptomic data. In our case (one of many ways to do it), our (Slide-seq) data is stored in the 'SpatialRNA/Vignette' file as two csv files: <br/>
1. BeadLocationsForR.csv: a CSV file (with 3 columns, with headers "barcodes", "xcoord", and "ycoord") containing the spatial locations of the pixels. <br/>
2. MappedDGEForR.csv: a DGE (gene counts by barcodes) CSV file. Represents raw counts at each pixel. <br/>
```{r SpatialRNA, results = 'hide', fig.height = 6, fig.width = 6}
datadir <- system.file("extdata",'SpatialRNA/Vignette',package = 'spacexr') # directory for sample Slide-seq dataset
counts <- read.csv(file.path(datadir,"MappedDGEForR.csv")) # load in counts matrix
coords <- read.csv(file.path(datadir,"BeadLocationsForR.csv"))
rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames
rownames(coords) <- coords$barcodes; coords$barcodes <- NULL # Move barcodes to rownames
nUMI <- colSums(counts) # In this case, total counts per pixel is nUMI

### Create SpatialRNA object
puck <- SpatialRNA(coords, counts, nUMI)

## Examine SpatialRNA object (optional)
print(dim(puck@counts)) # observe Digital Gene Expression matrix
hist(log(puck@nUMI,2)) # histogram of log_2 nUMI
print(head(puck@coords)) # start of coordinate data.frame
barcodes <- colnames(puck@counts) # pixels to be used (a list of barcode names). 

# This list can be restricted if you want to crop the puck e.g. 
# puck <- restrict_puck(puck, barcodes) provides a basic plot of the nUMI of each pixel
# on the plot:
plot_puck_continuous(puck, barcodes, puck@nUMI, ylimit = c(0,round(quantile(puck@nUMI,0.9))), 
                     title ='plot of nUMI') 
```
The RDS file 'puck.RDS' saves the 'SpatialRNA' file we have created, and from now on it can be loaded in by the init_RCTD function. 
## Running RCTD

### Creating RCTD Object

We are now ready to create an `RCTD` object using the `create.RCTD` function. We simply need to pass in the `SpatialRNA` and scRNA-seq objects. There are several configuration options that can be set with this function: 

* `max_cores:` for parallel processing, the number of cores used. If set to 1, parallel processing is not used. The system will additionally be checked for number of available cores. Note, that we recommend setting `max_cores` to at least `4` or `8` to improve efficiency.
* `gene_cutoff, fc_cutoff, gene_cutoff_reg, fc_cutoff_reg: ` are used for differentially expressed gene selection, with `gene_cutoff` filtering for average expression and `fc_cutoff` filtering for log-fold-change across cell types.
* `UMI_min, UMI_max: ` are the minimum and maximum read depth for pixels in the `SpatialRNA` dataset.

### Running RCTD

Now, we are ready to run RCTD, using the `run.RCTD` function. This function is equivalent to sequentially running the functions `fitBulk`, `choose_sigma_c`, and `fitPixels`. The `doublet_mode` argument sets whether RCTD will be run in 'doublet mode' (at most 1-2 cell types per pixel), 'full mode' (no restrictions on number of cell types), or 'multi mode' (finitely many cell types per pixel, e.g. 3 or 4).

```{r DEgenes}
myRCTD <- create.RCTD(puck, reference, max_cores = 1)
myRCTD <- run.RCTD(myRCTD, doublet_mode = 'doublet')
```

## RCTD results

The results of RCTD are located in the `@results` field. Of particular interest is `@results$weights`, a data frame of cell type weights for each pixel (for full mode). This section will generate various plots which can be found in `resultsdir`. The results of 'doublet_mode' are stored in `@results$results_df` and `@results$weights_doublet`, the weights of each cell type. 
More specifically, the `results_df` object contains one column per pixel (barcodes as rownames). Important columns are:

* `spot_class`, a factor variable representing RCTD's classification in doublet mode: "singlet" (1 cell type on pixel), "doublet_certain" (2 cell types on pixel), "doublet_uncertain" (2 cell types on pixel, but only confident of 1), "reject" (no prediction given for pixel). 
* Next, the `first_type` column gives the first cell type predicted on the bead (for all spot_class conditions except "reject"). 
* The `second_type column` gives the second cell type predicted on the bead for doublet spot_class conditions (not a confident prediction for "doublet_uncertain").

Note that in multi-mode, results consists of a list of results for each pixel, which contains `all_weights` (weights from full-mode), `cell_type_list` (cell types on multi-mode), `conf_list` (which cell types are confident on multi-mode) and `sub_weights` (proportions of cell types on multi-mode).

Note, some of the plots are not displayed here, but rather saved as pdf files in the 'RCTD_Plots' directory.

```{r results, results = 'hide',  fig.width = 8, fig.height=8}
results <- myRCTD@results
# normalize the cell type proportions to sum to 1.
norm_weights = normalize_weights(results$weights) 
cell_type_names <- myRCTD@cell_type_info$info[[2]] #list of cell type names
spatialRNA <- myRCTD@spatialRNA
resultsdir <- 'RCTD_Plots' ## you may change this to a more accessible directory on your computer.
dir.create(resultsdir)
```


```{r results2, results = 'hide', fig.width = 8, fig.height=8}
# make the plots 
# Plots the confident weights for each cell type as in full_mode (saved as 
# 'results/cell_type_weights_unthreshold.pdf')
plot_weights(cell_type_names, spatialRNA, resultsdir, norm_weights) 
# Plots all weights for each cell type as in full_mode. (saved as 
# 'results/cell_type_weights.pdf')
plot_weights_unthreshold(cell_type_names, spatialRNA, resultsdir, norm_weights) 
# Plots the weights for each cell type as in doublet_mode. (saved as 
# 'results/cell_type_weights_doublets.pdf')
plot_weights_doublet(cell_type_names, spatialRNA, resultsdir, results$weights_doublet, 
                     results$results_df) 
# Plots the number of confident pixels of each cell type in 'full_mode'. (saved as 
# 'results/cell_type_occur.pdf')
plot_cond_occur(cell_type_names, resultsdir, norm_weights, spatialRNA)
# makes a map of all cell types, (saved as 
# 'results/all_cell_types.pdf')
plot_all_cell_types(results$results_df, spatialRNA@coords, cell_type_names, resultsdir) 


# doublets
#obtain a dataframe of only doublets
doublets <- results$results_df[results$results_df$spot_class == "doublet_certain",] 
# Plots all doublets in space (saved as 
# 'results/all_doublets.pdf')
plot_doublets(spatialRNA, doublets, resultsdir, cell_type_names) 
# Plots all doublets in space for each cell type (saved as 
# 'results/all_doublets_type.pdf')
plot_doublets_type(spatialRNA, doublets, resultsdir, cell_type_names) 
# a table of frequency of doublet pairs 
doub_occur <- table(doublets$second_type, doublets$first_type) 
# Plots a stacked bar plot of doublet ocurrences (saved as 
# 'results/doublet_stacked_bar.pdf')

plot_doub_occur_stack(doub_occur, resultsdir, cell_type_names) 

```