- PDIVAS is a pathogenicity predictor for deep-intronic variants causing aberrant splicing.
- The deep-intronic variants can cause pathogenic pseudoexons or extending exons which disturb the normal gene expression and can be the cause of patients with Mendelian diseases.
- PDIVAS efficiently prioritizes the causal candidates from a vast number of deep-intronic variants detected by whole-genome sequencing.
- The scope of PDIVAS prediction is variants in protein-coding genes on autosomes and X chromosome.
- This command-line interface is compatible with variant files in VCF format.
PDIVAS is modeled on random forest algorism to classify pathogenic and benign variants with referring to features from
-
Splicing predictors of SpliceAI (Jaganathan et al., Cell 2019) and MaxEntScan (Yeo and Berge, j. Comput. Biol. 2004)
(*)The output module of SpliceAI was customed for PDIVAS features (see the Option2, for the details). -
Human splicing constraint score of ConSplice (Cormier et al., BMC Bioinfomatics 2022).
Kurosawa et al. BMC Genomics 2023
[email protected] (Ryo Kurosawa at Kyoto University)
For the quick implementation of PDIVAS, please use the score-precomputed file here. Possible rare SNVs and short indels (1~4nt) in genes (n=4,512) of Mendelian diseases were comprehensively annotated in the file. To annotate your VCF file, please run the command below,for example.
0. Installation
conda install -c bioconda vcfanno
git clone https://github.com/brentp/vcfanno.git
0. Setting score-precomputed files
(Download score-precomputed file above and create a configure file (following https://github.com/brentp/vcfanno))
vi ./conf.toml
Write as below
[[annotation]]
file="./PDIVAS_precomputed/GRCh38/PDIVAS_precomputed_short_GRCh38.vcf.gz"
# ID and FILTER are special fields that pull the ID and FILTER columns from the VCF
fields = ["PDIVAS"]
ops=["self"]
names=["PDIVAS"]
2. Perform PDIVAS annotation
# Move to your working directory. (The case below is the directory in this repository.)
cd examples
# Perform annotation
vcfanno -lua ./vcfanno/example/custom.lua ./conf.toml ./ex.vcf > output_precomp.vcf
#Compare the output_precomp.vcf with output_precomp_expect.vcf.gz to validate the successful annotation.
For more comprehensive annotation than pre-computed files, run PDIVAS by following the description below.
0-1. Installation
#It is better to prepare new conda environments for PDIVAS installation.
#They take a little long time to solve the environment.
conda create -n PDIVAS -c bioconda -c conda-forge spliceai tensorflow==2.6.2 pdivas bcftools vcfanno
conda create -n VEP -c conda-forge -c bioconda perl==5.26.2 ensembl-vep==105
The successful installation was verified on anaconda version 23.3.1
0-2. Setting customed usages
-For output-customized SpliceAI for PDIVAS conda environment
git clone https://github.com/shiro-kur/PDIVAS.git
cd PDIVAS/Customed_SpliceAI
cp ./__main__for_customed_SpliceAI.py installed_path/__main__.py
cp ./utils_for_customed-SpliceAI.py installed_path/utils.py
cp -rf ./annotations_for_customed_SpliceAI installed_path/annotations
# Examples of installed_path (~/miniconda3/envs/ex/lib/python3.9/site-packages/spliceai)
# files and directories included in the spliceai directory by default ↓
# __init__.py __main__.py __pycache__ annotations models utils.py
# the successfully-customed result was described in examples/~~.vcf
-For VEP custom usage
- Download VEP cache file (version>=107, should correspond to your installed VEP version).
Follow the instructions of "Manually downloading caches" part below.
(https://asia.ensembl.org/info/docs/tools/vep/script/vep_cache.html) - To implement MaxEntScan plugin, follow the instructions below.
(https://asia.ensembl.org/info/docs/tools/vep/script/vep_plugins.html#maxentscan) - Download ConSplice score file from here.
The file was edited from the originally scored file by (Cormier et al., BMC Bioinformatics 2022).
1. Preprocessing VCF format (resolve the multi-allelic site to biallelic sites)
conda activate PDIVAS
bcftools norm -m - multi.vcf > bi.vcf
2. Add gene annotations, MaxEntScan scores, and ConSplice scores with VEP.
conda activate VEP
vep \
--cache --offline --cache_version 107 --assembly GRCh38 --hgvs --pick_allele_gene \
--fasta ./references/hg38.fa.gz --vcf --force \
--custom ./references/ConSplice.50bp_region.inverse_proportion_refo_hg38.bed.gz,ConSplice,bed,overlap,0 \
--plugin MaxEntScan,./references/MaxEntScan/fordownload,SWA,NCSS \
--fields "Consequence,SYMBOL,Gene,INTRON,HGVSc,STRAND,ConSplice,MES-SWA_acceptor_diff,MES-SWA_acceptor_alt,MES-SWA_donor_diff,MES-SWA_donor_alt" \
--compress_output bgzip \
-i ./examples/ex.vcf.gz -o ./examples/ex_vep.vcf.gz
3. Add output-customized SpliceAI scores
conda activate PDIVAS
spliceai -I examples/ex_vep.vcf.gz -O examples/ex_vep_AI.vcf -R hg38.fa -A grch38 -D 300 -M 1
4. Perform the detection of deep-intronic variants and PDIVAS prediction
pdivas predict -I examples/ex_vep_AI.vcf -O examples/ex_vep_AI_PD.vcf.gz -F off
5. (Optional) Convert VCF file with PDIVAS annotation to TSV file (1 gene annotation per 1 line)
pdivas vcf2tsv -I examples/ex_vep_AI_PD.vcf.gz -O examples/ex_vep_AI_PD.tsv
1. $ pdivas predict
Required parameters:
-I
: Input VCF(.vcf/.vcf.gz) with variants of interest.-O
: Output VCF(.vcf/.vcf.gz) with PDIVAS predictionsGENE_ID|PDIVAS_score
Variants in multiple genes have separate predictions for each gene.
Optional parameters:
-F
: filtering function (off/on) : Output all variants (-F off; default) or only deep-intronic variants with PDIVAS scores (-F on)")
Details of PDIVAS INFO field:
ID | Description |
---|---|
GENE_ID | Ensembl gene ID based on GENCODE V41(GRCh38) or V19(GRCh37) |
PDIVAS | <Predicted result> Pattern 1 : 0.000-1.000 float value (The higher, the more deleterious) <Exceptions> - Output with '-F off'. Filtered with '-F on'. Pattern 2 : 'wo_annots', variants out of VEP or SpliceAI annotations : Pattern 3 : 'out_of_scope', variants without PDIVAS annotation scope (chrY, non-coding gene or non-deep-intronic variants) Pattern 4 :'no_gene_match', variants without matched gene annotation between VEP and SpliceAI |
2. $ pdivas vcf2tsv
Required parameters:
-I
: *Input VCF(.vcf/.vcf.gz) with VEP, SpliceAI,and PDIVAS annotations.-O
: The path to output tsv file name and pass.
*Input VCF is valid only when it was generated through this pipeline.
More details in Kurosawa et al. medRxiv 2023 .
Threshold | Sensitivity (*1) | candidates/individual (*2) |
---|---|---|
>=0.082 | 95% | 26.8 |
>=0.151 | 90% | 14.5 |
>=0.340 | 85% | 6.7 |
>=0.501 | 80% | 4.1 |
>=0.575 | 75% | 3.0 |
>=0.763 | 70% | 1.2 |
(*1) Sensitivities were calculated on curated pathogenic deep-intronic variants in a test dataset.
(*2) Candidates of pathogenic deep-intronic variants were obtained through the process described below. (WGS: Whole-genome sequencing)