This repository contains two snakemake pipelines to preprocess 10x Visium data:
- single-cell: Pipeline to filter single-cell RNA-seq data and create deconvolution references.
- visium: Pipeline to filter and normalise Visium spatial transcriptomics data, align slides, perform cell type deconvolution, identify tumour spots, infer cancer subclones and compute beyondcell scores.
These two pipelines must be executed in sequential order since the output of single-cell is the input of visium.
This code was developed to preprocess the data analysed in the study "Spatial Transcriptomics in Breast Cancer Reveals Tumour Microenvironment-Driven Drug Responses and Clonal Therapeutic Heterogeneity" Nevertheless, the code is modular so that it can be customised for other analyses.
From the Single Cell Portal (ID: SCP1039), we downloaded single-cell RNA-seq (scRNA-seq) data from 26 primary breast cancer tumours (11 luminal, 5 HER2+ and 10 TNBCs)[1].
The single-cell pipeline takes two inputs:
- Merged raw expression counts stored in
matrix.mtx.gz
,features.tsv.gz
andbarcodes.tsv.gz
files. - Merged metadata with cell type annotations stored in
.csv
format.
And performs the following steps:
- Creates a merged Seurat object with data from all tumours.
- Filters the cells according to user-supplied thresholds.
- Splits the filtered object into different Seurat objects according to a specified metadata column (in our case, the breast cancer subtype).
- Creates a deconvolution reference for each split level using RCTD [2].
- Creates an integrated object with all data.
The output of the single-cell pipeline is the deconvolution references required in the visium pipeline. They are stored in the resultspath/spacexr/
folder as {split_level}/{split_level}_reference.rds
files.
We downloaded luminal and TNBC Visium ST data (CID4290, CID4535, 1142243F, 1160920F, CID4465 and CID44971) [1] from Zenodo (DOI: 10.5281/zenodo.4739739). HER2+ Visium ST data (738811QB, 1168993F and V19L29), licensed under the Creative Commons Attribution licence, were downloaded from the 10x Genomics Datasets portal [3–6]. All these data included raw counts and matching tissue images. Wu et al. dataset also included metadata information with pathologist annotations for each spot. 10x Genomics provided an image with pathologist annotations for patient 738811QB, and we manually labelled each spot using Loupe Browser v5.0.1 [7].
By leveraging a new version of Beyondcell [8] (cnio-bu/beyondcell), a tool for identifying tumour cell subpopulations with distinct drug response patterns, we predicted sensitivity to over 1,200 drugs while accounting for the spatial context and interaction between the tumour and TME compartments. Moreover, we also used Beyondcell to compute spot-wise functional enrichment scores and identify niche-specific biological functions. To do so, we used the preprocessed Seurat objects generated in this pipeline and two collections of gene signatures:
-
SSc breast: Collection of gene signatures used to predict sensitivity to > 1,200 drugs derived from breast cancer cell lines. The code used to create the SSc breast collection is available at cnio-bu/SSc-breast.
-
Functional signatures: Collection of gene signatures used to compute enrichment in different biological pathways.
The output of the visium pipeline, together with these two collections of signatures, can be downloaded from Zenodo (DOI: 10.5281/zenodo.10638906).
The visium pipeline takes several inputs:
-
Sample-specific Visium data stored in standard files:
- Raw expression counts stored in
matrix.mtx.gz
,features.tsv.gz
andbarcodes.tsv.gz
files. - Metadata with pathologist annotations, if available, stored in
.csv
format. - Spatial data stored in
tissue_lowres_image.png
,scalefactors_json.json
andtissue_positions_list.csv
files.
- Raw expression counts stored in
-
Deconvolution references generated by the single-cell pipeline.
-
Drug sensitivity signatures collection (SSc breast) in
.gmt
format. -
Functional signatures collection in
.gmt
format.
And performs the following steps:
-
Creates a Seurat object for each sample, with several slides if available.
-
Filters the spots according to sample-specific user-supplied thresholds.
-
Aligns the slides of the same sample with PASTE [9].
-
Normalises the expression counts using the SCTransform function from Seurat, regressing the cell cycle effect if necessary.
-
Deconvolves the cell types in each spot using using RCTD.
-
Estimates the tumour purity of each spot with ESTIMATE [10].
-
Labels as "tumour" the spots that meet at least 2 of these criteria:
- Annotated as "Ductal carcinoma in situ" or containing the "cancer" keyword according to the pathologist.
- Scaled ESTIMATE score <=0.4.
- Proportion of deconvoluted cancer cells >=0.6.
-
Infers the clonal composition of the tumour spots in each sample using SCEVAN [11] and the non-tumour spots as reference.
-
Computes beyondcell scores for the drug sensitivity and functional signatures.
Software Versions
Software | Use | Version |
---|---|---|
Seurat | Data preprocessing, filtering, normalisation and integration | v4.3.0.1 |
RCTD | Spot deconvolution | v2.2.1 |
PASTE | Slide alignment | v1.3.0 |
ESTIMATE | Tumour purity estimation | v1.0.13 |
SCEVAN | Cancer subclone inference | v1.0.1 |
Beyondcell | Spot-wise drug sensitivity prediction or functional enrichment | v2.2.0 |
References
-
Wu SZ, Al-Eryani G, Roden DL, Junankar S, Harvey K, Andersson A, et al. A single-cell and spatially resolved atlas of human breast cancers. Nat Genet. 2021;53:1334–47.
-
Cable DM, Murray E, Zou LS, Goeva A, Macosko EZ, Chen F, et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat Biotechnol. 2022;40:517–26.
-
10x Genomics. Human Breast Cancer (Block A Section 1), Spatial Gene Expression Dataset by Space Ranger 1.1.0. 2020. Accessed 16 Jun 2023.
-
10x Genomics. Human Breast Cancer (Block A Section 2), Spatial Gene Expression Dataset by Space Ranger 1.1.0. 2020. Accessed 16 Jun 2023.
-
10x Genomics. Human Breast Cancer: Ductal Carcinoma In Situ, Invasive Carcinoma (FFPE), Spatial Gene Expression Dataset by Space Ranger 1.3.0. 2021. Accessed 16 Jun 2023.
-
10x Genomics. Human Breast Cancer: Visium Fresh Frozen, Whole Transcriptome, Spatial Gene Expression Dataset by Space Ranger 1.3.0. 2022. Accessed 16 Jun 2023.
-
10x Genomics. Loupe Browser. Version 5.0.1. 2021.
-
Fustero-Torre C, Jiménez-Santos MJ, García-Martín S, Carretero-Puche C, García-Jimeno L, Ivanchuk V, et al. Beyondcell: targeting cancer therapeutic heterogeneity in single-cell RNA-seq data. Genome Med. 2021;13:187.
-
Zeira R, Land M, Strzalkowski A, Raphael BJ. Alignment and integration of spatial transcriptomics data. Nat Methods. 2022;19:567-575.
-
Yoshihara K, Shahmoradgoli M, Martínez E, Vegesna R, Kim H, Torres-Garcia W, et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat Commun. 2013;4:2612.
-
De Falco A, Caruso F, Su X-D, Iavarone A, Ceccarelli M. A variational algorithm to detect the clonal copy number substructure of tumors from scRNA-seq data. Nat Commun. 2023;14:1074.
The code in this repository extensively uses Snakemake's integration with the conda package manager to take care of software requirements and dependencies automatically.
First, install conda by following the installation instructions.
Next, use the git clone command to create a local copy:
git clone https://github.com/cnio-bu/ST-preprocess
And create a snakemake environment:
conda env create -f envs/snakemake.yaml
You need to modify the configuration file:
-
config.yaml
: Contains all pipeline parameters. You must specify:- The location of the data, output and log files.
- The limits to filter out low quality cells.
- The metadata column to split the matrix by and the levels for which a deconvolution reference must be computed.
Please note that the data folder must have this structure:
├── data
│ ├── meta
│ │ ├── metadata.csv
├── filtered_cnts
│ ├── matrix.mtx.gz
│ ├── features.tsv.gz
│ ├── barcodes.tsv.gz
Also, create a named conda environment for RCTD:
# Create an environment with dependencies from the YAML receipt
conda env create -f envs/spacexr.yaml
# Activate the RCTD environment
conda activate spacexr
# Activate R and install RCTD from within R
R
options(timeout = 600000000) ### set this to avoid timeout error
devtools::install_github("dmcable/spacexr", build_vignettes = FALSE)
q()
# Deactivate the RCTD environment
conda deactivate
You need to modify the configuration files:
-
config.yaml
: Contains all pipeline parameters. You must specify:- The location of the data, output and log files.
- The location of the
patients.tsv
file. - The directory where the deconvolution references are stored.
- The location of the drug sensitivity signatures collection (SSc breast).
- The location of the functional signatures collection.
-
patients.tsv
: Contains sample-specific parameters. You must specify:- patient: The sample name.
- slide: The number of slides.
- gene_column: Column of
features.tsv
where gene names are. - ref_deconvolution: Prefix of the deconvolution reference to use (i.e.
{split_level}
). - mt_limits, rb_limits, count_limits, feat_limits: The limits to filter out low-quality spots.
- regress_cell_cycle: Whether or not regress the cell cycle effect.
- expr_thres: Beyondcell parameter. Minimum fraction of signature genes that must be expressed in a spot to compute its beyondcell score.
Please note that the data folder must have this structure:
├── data
│ ├── filtered_cnts
│ │ ├── {patient}_filtered_count_matrix
│ │ │ ├── slide1
│ │ │ │ ├── matrix.mtx.gz
│ │ │ │ ├── features.tsv.gz
│ │ │ │ ├── barcodes.tsv.gz
│ │ │ ├── slide2
│ │ │ │ ├── matrix.mtx.gz
│ │ │ │ ├── features.tsv.gz
│ │ │ │ ├── barcodes.tsv.gz
│ ├── meta
│ │ ├── {patient}_metadata_slide1.csv
│ │ ├── {patient}_metadata_slide2.csv
│ ├── imgs
│ │ ├── {patient}_spatial
│ │ │ ├── slide1
│ │ │ │ ├── tissue_lowres_image.png
│ │ │ │ ├── scalefactors_json.json
│ │ │ │ ├── tissue_positions_list.csv
│ │ │ ├── slide2
│ │ │ │ ├── tissue_lowres_image.png
│ │ │ │ ├── scalefactors_json.json
│ │ │ │ ├── tissue_positions_list.csv
Also, create named conda environments for ESTIMATE and SCEVAN:
# Create environments with dependencies from the YAML receipts
conda env create -f envs/estimate.yaml
conda env create -f envs/scevan.yaml
# Install ESTIMATE from within R
conda activate estimate
R
install.packages("estimate", repos="http:https://R-Forge.R-project.org")
q()
conda deactivate
# Install SCEVAN from within R
conda activate scevan
R
devtools::install_github("miccec/yaGST")
devtools::install_github("AntonioDeFalco/SCEVAN")
q()
conda deactivate
Once each pipeline is configured, the user just needs to move to src/single-cell
or src/visium
folders and enter these commands:
conda activate snakemake
snakemake --use-conda -j 10
conda deactivate
The mandatory arguments are:
- --use-conda: To install and use the conda environments.
- -j: Number of threads/jobs provided to snakemake.
- María José Jiménez-Santos
If you have any questions, feel free to submit an issue.