ST-preprocess

This repository contains two snakemake pipelines to preprocess 10x Visium data:

single-cell: Pipeline to filter single-cell RNA-seq data and create deconvolution references.
visium: Pipeline to filter and normalise Visium spatial transcriptomics data, align slides, perform cell type deconvolution, identify tumour spots, infer cancer subclones and compute beyondcell scores.

These two pipelines must be executed in sequential order since the output of single-cell is the input of visium.

Description

This code was developed to preprocess the data analysed in the study "Spatial Transcriptomics in Breast Cancer Reveals Tumour Microenvironment-Driven Drug Responses and Clonal Therapeutic Heterogeneity" Nevertheless, the code is modular so that it can be customised for other analyses.

Single-cell pipeline

From the Single Cell Portal (ID: SCP1039), we downloaded single-cell RNA-seq (scRNA-seq) data from 26 primary breast cancer tumours (11 luminal, 5 HER2+ and 10 TNBCs)[1].

The single-cell pipeline takes two inputs:

Merged raw expression counts stored in matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz files.
Merged metadata with cell type annotations stored in .csv format.

And performs the following steps:

Creates a merged Seurat object with data from all tumours.
Filters the cells according to user-supplied thresholds.
Splits the filtered object into different Seurat objects according to a specified metadata column (in our case, the breast cancer subtype).
Creates a deconvolution reference for each split level using RCTD [2].
Creates an integrated object with all data.

The output of the single-cell pipeline is the deconvolution references required in the visium pipeline. They are stored in the resultspath/spacexr/ folder as {split_level}/{split_level}_reference.rds files.

Visium pipeline

We downloaded luminal and TNBC Visium ST data (CID4290, CID4535, 1142243F, 1160920F, CID4465 and CID44971) [1] from Zenodo (DOI: 10.5281/zenodo.4739739). HER2+ Visium ST data (738811QB, 1168993F and V19L29), licensed under the Creative Commons Attribution licence, were downloaded from the 10x Genomics Datasets portal [3–6]. All these data included raw counts and matching tissue images. Wu et al. dataset also included metadata information with pathologist annotations for each spot. 10x Genomics provided an image with pathologist annotations for patient 738811QB, and we manually labelled each spot using Loupe Browser v5.0.1 [7].

By leveraging a new version of Beyondcell [8] (cnio-bu/beyondcell), a tool for identifying tumour cell subpopulations with distinct drug response patterns, we predicted sensitivity to over 1,200 drugs while accounting for the spatial context and interaction between the tumour and TME compartments. Moreover, we also used Beyondcell to compute spot-wise functional enrichment scores and identify niche-specific biological functions. To do so, we used the preprocessed Seurat objects generated in this pipeline and two collections of gene signatures:

SSc breast: Collection of gene signatures used to predict sensitivity to > 1,200 drugs derived from breast cancer cell lines. The code used to create the SSc breast collection is available at cnio-bu/SSc-breast.
Functional signatures: Collection of gene signatures used to compute enrichment in different biological pathways.

The output of the visium pipeline, together with these two collections of signatures, can be downloaded from Zenodo (DOI: 10.5281/zenodo.10638906).

The visium pipeline takes several inputs:

Sample-specific Visium data stored in standard files:
- Raw expression counts stored in matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz files.
- Metadata with pathologist annotations, if available, stored in .csv format.
- Spatial data stored in tissue_lowres_image.png, scalefactors_json.json and tissue_positions_list.csv files.
Deconvolution references generated by the single-cell pipeline.
Drug sensitivity signatures collection (SSc breast) in .gmt format.
Functional signatures collection in .gmt format.

And performs the following steps:

Creates a Seurat object for each sample, with several slides if available.
Filters the spots according to sample-specific user-supplied thresholds.
Aligns the slides of the same sample with PASTE [9].
Normalises the expression counts using the SCTransform function from Seurat, regressing the cell cycle effect if necessary.
Deconvolves the cell types in each spot using using RCTD.
Estimates the tumour purity of each spot with ESTIMATE [10].
Labels as "tumour" the spots that meet at least 2 of these criteria:
- Annotated as "Ductal carcinoma in situ" or containing the "cancer" keyword according to the pathologist.
- Scaled ESTIMATE score <=0.4.
- Proportion of deconvoluted cancer cells >=0.6.
Infers the clonal composition of the tumour spots in each sample using SCEVAN [11] and the non-tumour spots as reference.
Computes beyondcell scores for the drug sensitivity and functional signatures.

Software Versions

Software	Use	Version
Seurat	Data preprocessing, filtering, normalisation and integration	v4.3.0.1
RCTD	Spot deconvolution	v2.2.1
PASTE	Slide alignment	v1.3.0
ESTIMATE	Tumour purity estimation	v1.0.13
SCEVAN	Cancer subclone inference	v1.0.1
Beyondcell	Spot-wise drug sensitivity prediction or functional enrichment	v2.2.0

References

Wu SZ, Al-Eryani G, Roden DL, Junankar S, Harvey K, Andersson A, et al. A single-cell and spatially resolved atlas of human breast cancers. Nat Genet. 2021;53:1334–47.
Cable DM, Murray E, Zou LS, Goeva A, Macosko EZ, Chen F, et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat Biotechnol. 2022;40:517–26.
10x Genomics. Human Breast Cancer (Block A Section 1), Spatial Gene Expression Dataset by Space Ranger 1.1.0. 2020. Accessed 16 Jun 2023.
10x Genomics. Human Breast Cancer (Block A Section 2), Spatial Gene Expression Dataset by Space Ranger 1.1.0. 2020. Accessed 16 Jun 2023.
10x Genomics. Human Breast Cancer: Ductal Carcinoma In Situ, Invasive Carcinoma (FFPE), Spatial Gene Expression Dataset by Space Ranger 1.3.0. 2021. Accessed 16 Jun 2023.
10x Genomics. Human Breast Cancer: Visium Fresh Frozen, Whole Transcriptome, Spatial Gene Expression Dataset by Space Ranger 1.3.0. 2022. Accessed 16 Jun 2023.
10x Genomics. Loupe Browser. Version 5.0.1. 2021.
Fustero-Torre C, Jiménez-Santos MJ, García-Martín S, Carretero-Puche C, García-Jimeno L, Ivanchuk V, et al. Beyondcell: targeting cancer therapeutic heterogeneity in single-cell RNA-seq data. Genome Med. 2021;13:187.
Zeira R, Land M, Strzalkowski A, Raphael BJ. Alignment and integration of spatial transcriptomics data. Nat Methods. 2022;19:567-575.
Yoshihara K, Shahmoradgoli M, Martínez E, Vegesna R, Kim H, Torres-Garcia W, et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat Commun. 2013;4:2612.
De Falco A, Caruso F, Su X-D, Iavarone A, Ceccarelli M. A variational algorithm to detect the clonal copy number substructure of tumors from scRNA-seq data. Nat Commun. 2023;14:1074.

Installation

The code in this repository extensively uses Snakemake's integration with the conda package manager to take care of software requirements and dependencies automatically.

First, install conda by following the installation instructions.

Next, use the git clone command to create a local copy:

git clone https://github.com/cnio-bu/ST-preprocess

And create a snakemake environment:

conda env create -f envs/snakemake.yaml

How to run

Single-cell pipeline set up

You need to modify the configuration file:

config.yaml: Contains all pipeline parameters. You must specify:
- The location of the data, output and log files.
- The limits to filter out low quality cells.
- The metadata column to split the matrix by and the levels for which a deconvolution reference must be computed.

Please note that the data folder must have this structure:

├── data
│   ├── meta
│   │   ├── metadata.csv
├── filtered_cnts
│   ├── matrix.mtx.gz
│   ├── features.tsv.gz
│   ├── barcodes.tsv.gz

Also, create a named conda environment for RCTD:

# Create an environment with dependencies from the YAML receipt
conda env create -f envs/spacexr.yaml

# Activate the RCTD environment
conda activate spacexr

# Activate R and install RCTD from within R
R
options(timeout = 600000000) ### set this to avoid timeout error
devtools::install_github("dmcable/spacexr", build_vignettes = FALSE)
q()

# Deactivate the RCTD environment
conda deactivate

Visium pipeline set up

You need to modify the configuration files:

config.yaml: Contains all pipeline parameters. You must specify:
- The location of the data, output and log files.
- The location of the patients.tsv file.
- The directory where the deconvolution references are stored.
- The location of the drug sensitivity signatures collection (SSc breast).
- The location of the functional signatures collection.
patients.tsv: Contains sample-specific parameters. You must specify:
- patient: The sample name.
- slide: The number of slides.
- gene_column: Column of features.tsv where gene names are.
- ref_deconvolution: Prefix of the deconvolution reference to use (i.e. {split_level}).
- mt_limits, rb_limits, count_limits, feat_limits: The limits to filter out low-quality spots.
- regress_cell_cycle: Whether or not regress the cell cycle effect.
- expr_thres: Beyondcell parameter. Minimum fraction of signature genes that must be expressed in a spot to compute its beyondcell score.

Please note that the data folder must have this structure:

├── data
│   ├── filtered_cnts
│   │   ├── {patient}_filtered_count_matrix
│   │   │   ├── slide1
│   │   │   │   ├── matrix.mtx.gz
│   │   │   │   ├── features.tsv.gz
│   │   │   │   ├── barcodes.tsv.gz
│   │   │   ├── slide2
│   │   │   │   ├── matrix.mtx.gz
│   │   │   │   ├── features.tsv.gz
│   │   │   │   ├── barcodes.tsv.gz
│   ├── meta
│   │   ├── {patient}_metadata_slide1.csv
│   │   ├── {patient}_metadata_slide2.csv
│   ├── imgs
│   │   ├── {patient}_spatial
│   │   │   ├── slide1
│   │   │   │   ├── tissue_lowres_image.png
│   │   │   │   ├── scalefactors_json.json 
│   │   │   │   ├── tissue_positions_list.csv
│   │   │   ├── slide2
│   │   │   │   ├── tissue_lowres_image.png
│   │   │   │   ├── scalefactors_json.json 
│   │   │   │   ├── tissue_positions_list.csv

Also, create named conda environments for ESTIMATE and SCEVAN:

# Create environments with dependencies from the YAML receipts
conda env create -f envs/estimate.yaml
conda env create -f envs/scevan.yaml

# Install ESTIMATE from within R
conda activate estimate
R
install.packages("estimate", repos="http:https://R-Forge.R-project.org")
q()
conda deactivate

# Install SCEVAN from within R
conda activate scevan
R
devtools::install_github("miccec/yaGST")
devtools::install_github("AntonioDeFalco/SCEVAN")
q()
conda deactivate

Execution

Once each pipeline is configured, the user just needs to move to src/single-cell or src/visium folders and enter these commands:

conda activate snakemake
snakemake --use-conda -j 10
conda deactivate

The mandatory arguments are:

--use-conda: To install and use the conda environments.
-j: Number of threads/jobs provided to snakemake.

Authors

María José Jiménez-Santos

Support

If you have any questions, feel free to submit an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
envs		envs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ST-preprocess

Description

Single-cell pipeline

Visium pipeline

Installation

How to run

Single-cell pipeline set up

Visium pipeline set up

Execution

Authors

Support

About

Releases 1

Packages

Languages

License

cnio-bu/ST-preprocess

Folders and files

Latest commit

History

Repository files navigation

ST-preprocess

Description

Single-cell pipeline

Visium pipeline

Installation

How to run

Single-cell pipeline set up

Visium pipeline set up

Execution

Authors

Support

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages