Haplotype-aware pantranscriptome analyses using spliced pangenome graphs

This repository contains the scripts that were used to generate the results presented in Haplotype-aware pantranscriptome analyses using spliced pangenome graphs, bioRxiv (2021).

For more up-to-date information on how to run the different methods please go to the github page of the vg toolkit and rpvg. The spliced pangenome graphs and pantranscriptomes (haplotype-specific transcripts) presented in the paper are avaliable to download in the Data section for use in other projects.

This repository is organized in four subdirectories.

The installation_and_demo directory contains installation directions for vg and rpvg. It also includes a short demo of using the tools for transcriptomic inference, with example data included.
The scripts directory contains the scripts used for analysis and plotting in this project. It is further subdivided by which language the scripts are written in. However, the scripts in the bash subdirectory are not the exact scripts we used. They have been simplified to make them easier for others to use, mainly by removing hard-coded paths and replacing environment-defined variables with variables that can be easily edited.
The originals directory contains the raw, unedited bash scripts, as well as the log files. These files are not particularly user-friendly as they include a lot of hard-coded paths. However, we have included them here for transparency and reproducibility. By looking at the scripts and log files you can see exactly how each method was run in the paper. Most of the log files will include a short header which specifies the Docker image that was used. The Docker files used for the Docker containers are available in the dockerfiles directory. For the log files without this header it should be clear from the script itself what version was used.
The dockerfiles directory contains recommended Docker files for running scripts in this repository.

Data

Here you can find links to the data used in the paper. This includes both raw data and data constructed as part of the analyses in the paper. The constructed data included here is data that are either not guaranteed to be reproducible (subsampled transcript annotation and simulated reads) or that are deemed potentially useful in other projects (graphs, pantranscriptomes and indexes).

Graphs, pantranscriptomes and indexes

The spliced pangenome graphs, pantranscriptomes and indexes:

https://cgl.gi.ucsc.edu/data/vgrna/pantranscriptomes/

Genome

The GRCh38 (primary assembly) reference genome:

ftp:https://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Transcripts

The GENCODE v29 (primary assembly) transcript annotation:

ftp:https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.primary_assembly.annotation.gtf.gz

The subsampled (80%) GENCODE v29 transcript annotation:

https://cgl.gi.ucsc.edu/data/vgrna/transcript_annotation/

Variants and haplotypes

The 1000 Genomes Project variants and haplotypes lifted to GRCh38:

https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/

The IPD-IMGT/HLA gene allele sequences:

https://github.com/ANHIG/IMGTHLA (release 3.43.0)

Reads

The simulated RNA-seq reads:

https://cgl.gi.ucsc.edu/data/vgrna/simulated_data/

The real RNA-seq reads:

https://www.ncbi.nlm.nih.gov/sra/?term=SRR1153470 (NA12878)
https://www.encodeproject.org/experiments/ENCSR000AED/ (NA12878, replicate 1)
https://github.com/nanopore-wgs-consortium/CHM13/ (CHM13, replicate 1)
https://www.encodeproject.org/experiments/ENCSR146ZKR/ (Adrenal gland)
https://www.encodeproject.org/experiments/ENCSR825GWD/ (Sigmoid colon)
https://www.encodeproject.org/experiments/ENCSR686JJB/ (Adipose tissue)
https://www.encodeproject.org/experiments/ENCSR502OTI/ (Psoas muscle)
https://www.encodeproject.org/experiments/ENCSR995BHD/ (Aorta)
https://www.internationalgenome.org/data-portal/data-collection/geuvadis (NA07051, NA11832, NA11840, NA11930, NA12287, NA12775, NA12889, NA06994, NA07037, NA07357, NA11829, NA11893, NA12006, NA12043, NA12234, NA12272 & NA12275)
https://www.internationalgenome.org/data-portal/data-collection/structural-variation (NA19238, NA19239, NA19240, HG00512, HG00513, HG00514, HG00731, HG00732 & HG00733)
https://www.ncbi.nlm.nih.gov/sra/?term=SRR12765650 (intestine, African American individual)
https://www.ncbi.nlm.nih.gov/sra/?term=SRR12765534 (intestine, European American individual)

The Iso-Seq alignments:

https://www.encodeproject.org/experiments/ENCSR706ANY/ (NA12878, all replicates)

Mapping and expression data

Mapping benchmark tables and haplotype-specific expression estimates:

https://doi.org/10.5281/zenodo.7234454

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
dockerfiles		dockerfiles
installation_and_demo		installation_and_demo
originals		originals
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Haplotype-aware pantranscriptome analyses using spliced pangenome graphs

Data

Graphs, pantranscriptomes and indexes

Genome

Transcripts

Variants and haplotypes

Reads

Mapping and expression data

About

Releases

Packages

Languages

License

Lyang556/vgrna-project-paper

Folders and files

Latest commit

History

Repository files navigation

Haplotype-aware pantranscriptome analyses using spliced pangenome graphs

Data

Graphs, pantranscriptomes and indexes

Genome

Transcripts

Variants and haplotypes

Reads

Mapping and expression data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages