Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate all variants in the VCF file with exomiser score and ACMG #553

Open
Bioinf-usr opened this issue Mar 13, 2024 · 5 comments
Open

Comments

@Bioinf-usr
Copy link

Bioinf-usr commented Mar 13, 2024

Hi,

What is the best approach to annotate all the available variants in the VCF file with exomiser scores?

I am using this yaml configuration for the analysis. However, some of the variants are still being filtered out. Any suggestions to keep all the variants?

# These are all the possible options for running exomiser. Use this as a template for
# your own set-up.
---
analysis:
    # hg19 or hg38 - ensure that the application has been configured to run the specified assembly otherwise it will halt.
    genomeAssembly: hg38
    vcf: examples/Pfeiffer.vcf
    ped:
    proband:
    hpoIds: ['HP:0001156', 'HP:0001363', 'HP:0011304', 'HP:0010055']
    # These are the default settings, with values representing the maximum minor allele frequency in percent (%) permitted for an
    # allele to be considered as a causative candidate under that mode of inheritance.
    # If you just want to analyse a sample under a single inheritance mode, delete/comment-out the others. For AUTOSOMAL_RECESSIVE
    # or X_RECESSIVE ensure *both* relevant HOM_ALT and COMP_HET modes are present.
    # In cases where you do not want any cut-offs applied an empty map should be used e.g. inheritanceModes: {}
    inheritanceModes: {}
  #FULL or PASS_ONLY
    analysisMode: PASS_ONLY
  # Possible frequencySources:
  # Thousand Genomes project http:https://www.1000genomes.org/
  #   THOUSAND_GENOMES,
  # ESP project http:https://evs.gs.washington.edu/EVS/
  #   ESP_AFRICAN_AMERICAN, ESP_EUROPEAN_AMERICAN, ESP_ALL,
  # ExAC project http:https://exac.broadinstitute.org/about
  #   EXAC_AFRICAN_INC_AFRICAN_AMERICAN, EXAC_AMERICAN,
  #   EXAC_SOUTH_ASIAN, EXAC_EAST_ASIAN,
  #   EXAC_FINNISH, EXAC_NON_FINNISH_EUROPEAN,
  #   EXAC_OTHER
  # Possible frequencySources:
  # Thousand Genomes project - http:https://www.1000genomes.org/ (THOUSAND_GENOMES)
  # TOPMed - https://www.nhlbi.nih.gov/science/precision-medicine-activities (TOPMED)
  # UK10K - http:https://www.uk10k.org/ (UK10K)
  # ESP project - http:https://evs.gs.washington.edu/EVS/ (ESP_)
  #   ESP_AFRICAN_AMERICAN, ESP_EUROPEAN_AMERICAN, ESP_ALL,
  # ExAC project http:https://exac.broadinstitute.org/about (EXAC_)
  #   EXAC_AFRICAN_INC_AFRICAN_AMERICAN, EXAC_AMERICAN,
  #   EXAC_SOUTH_ASIAN, EXAC_EAST_ASIAN,
  #   EXAC_FINNISH, EXAC_NON_FINNISH_EUROPEAN,
  #   EXAC_OTHER
  # gnomAD - http:https://gnomad.broadinstitute.org/ (GNOMAD_E, GNOMAD_G)
    frequencySources: [
        THOUSAND_GENOMES,
        TOPMED,
        UK10K,
        ESP_AFRICAN_AMERICAN, ESP_EUROPEAN_AMERICAN, ESP_ALL,
        EXAC_AFRICAN_INC_AFRICAN_AMERICAN, EXAC_AMERICAN,
        EXAC_SOUTH_ASIAN, EXAC_EAST_ASIAN,
        EXAC_FINNISH, EXAC_NON_FINNISH_EUROPEAN,
        EXAC_OTHER,
        GNOMAD_E_AFR,
        GNOMAD_E_AMR,
#        GNOMAD_E_ASJ,
        GNOMAD_E_EAS,
        GNOMAD_E_FIN,
        GNOMAD_E_NFE,
        GNOMAD_E_OTH,
        GNOMAD_E_SAS,
        GNOMAD_G_AFR,
        GNOMAD_G_AMR,
      #        GNOMAD_G_ASJ,
        GNOMAD_G_EAS,
        GNOMAD_G_FIN,
        GNOMAD_G_NFE,
        GNOMAD_G_OTH,
        GNOMAD_G_SAS
    ]
  # Possible pathogenicitySources: (POLYPHEN, MUTATION_TASTER, SIFT), (REVEL, MVP), CADD, REMM
  # REMM is trained on non-coding regulatory regions
  # *WARNING* if you enable CADD or REMM ensure that you have downloaded and installed the CADD/REMM tabix files
  # and updated their location in the application.properties. Exomiser will not run without this.
    pathogenicitySources: [REVEL, MVP]
  # this is the standard exomiser order.
    #all steps are optional
    steps: [
        #intervalFilter: {interval: 'chr10:123256200-123256300'},
        # or for multiple intervals:
        #intervalFilter: {intervals: ['chr10:123256200-123256201', 'chr2:14535991-14535992', 'chr14:134253-134254']},
        # or using a BED file - NOTE this should be 0-based, Exomiser otherwise uses 1-based coordinates in line with VCF
        #intervalFilter: {bed: /full/path/to/bed_file.bed},
        #genePanelFilter: {geneSymbols: ['FGFR1','FGFR2']},
        #variantEffectFilter: {remove: [NON_CODING_TRANSCRIPT_EXON_VARIANT, THREE_PRIME_UTR_EXON_VARIANT, INTERGENIC_VARIANT, CODING_TRANSCRIPT_INTRON_VARIANT, NON_CODING_TRANSCRIPT_INTRON_VARIANT, UPSTREAM_GENE_VARIANT, FIVE_PRIME_UTR_EXON_VARIANT, FIVE_PRIME_UTR_INTRON_VARIANT, DOWNSTREAM_GENE_VARIANT, REGULATORY_REGION_VARIANT, SYNONYMOUS_VARIANT, THREE_PRIME_UTR_INTRON_VARIANT]},
        #failedVariantFilter: {},
        #regulatoryFeatureFilter: {},
        #qualityFilter: {minQuality: 50.0},
        #frequencyFilter:  {},
        pathogenicityFilter: { keepNonPathogenic: true },
        inheritanceFilter: {},
        omimPrioritiser: {},
        hiPhivePrioritiser: {},
    ]
outputOptions:
    outputContributingVariantsOnly: false
    #numGenes options: 0 = all or specify a limit e.g. 500 for the first 500 results
    numGenes: 0
    # Path to the desired output directory. Will default to the 'results' subdirectory of the exomiser install directory
    outputDirectory: results
    # Filename for the output files. Will default to {input-vcf-filename}-exomiser
    outputFileName: Pfeiffer-hiphive-exome-PASS_ONLY
    #out-format options: HTML, JSON, TSV_GENE, TSV_VARIANT, VCF (default: HTML)
    outputFormats: [HTML, JSON, TSV_GENE, TSV_VARIANT, VCF]

Thank you.

@julesjacobsen
Copy link
Contributor

When you say 'Exomiser scores', which ones do you mean? Are you trying to use Exomiser to annotate all the variants with other scores too? Exomiser hasn't been developed to act as an intermediate step in a pipeline to annotate everything. It actively tries to remove as much as possible and only return the variants passing the specified filters and ranks them in order of decreasing score, so hacking this isn't going to provide the best results and will require a sizeable chunk of RAM to run if used on a genome.

Running Exomiser using analysisMode: FULL will return all input variants and state which filters failed in the FILTER field. If you don't run the frequencyFilter the frequency data will not be added, so make sure this is enabled but set to 100%. The frequency is a component of the variantScore which is a component of the combinedScore so make sure you keep this in otherwise your results will be bad.

# This is 
---
analysis:
    # hg19 or hg38 - ensure that the application has been configured to run the specified assembly otherwise it will halt.
    genomeAssembly: hg38
    vcf: examples/Pfeiffer.vcf
    ped:
    proband:
    hpoIds: ['HP:0001156', 'HP:0001363', 'HP:0011304', 'HP:0010055']
    # These are the default settings, with values representing the maximum minor allele frequency in percent (%) permitted for an
    # allele to be considered as a causative candidate under that mode of inheritance.
    # If you just want to analyse a sample under a single inheritance mode, delete/comment-out the others. For AUTOSOMAL_RECESSIVE
    # or X_RECESSIVE ensure *both* relevant HOM_ALT and COMP_HET modes are present.
    # In cases where you do not want any cut-offs applied an empty map should be used e.g. inheritanceModes: {}
    inheritanceModes: {}
  #FULL or PASS_ONLY
    analysisMode: FULL
    frequencySources: [
        THOUSAND_GENOMES,
        TOPMED,
        UK10K,
        ESP_AFRICAN_AMERICAN, ESP_EUROPEAN_AMERICAN, ESP_ALL,
        GNOMAD_E_AFR,
        GNOMAD_E_AMR,
#        GNOMAD_E_ASJ,
        GNOMAD_E_EAS,
#        GNOMAD_E_FIN,
        GNOMAD_E_NFE,
 #       GNOMAD_E_OTH,
        GNOMAD_E_SAS,
        GNOMAD_G_AFR,
        GNOMAD_G_AMR,
      #        GNOMAD_G_ASJ,
        GNOMAD_G_EAS,
#        GNOMAD_G_FIN,
        GNOMAD_G_NFE,
#        GNOMAD_G_OTH,
        GNOMAD_G_SAS
    ]
    #keep these steps or you will have incomplete results
    steps: [
        frequencyFilter:  { maxFrequency: 100.0},
        pathogenicityFilter: { keepNonPathogenic: true },
        inheritanceFilter: {},
        omimPrioritiser: {},
        hiPhivePrioritiser: {},
    ]
outputOptions:
    outputContributingVariantsOnly: false
    #numGenes options: 0 = all or specify a limit e.g. 500 for the first 500 results
    numGenes: 0
    # Path to the desired output directory. Will default to the 'results' subdirectory of the exomiser install directory
    outputDirectory: results
    # Filename for the output files. Will default to {input-vcf-filename}-exomiser
    outputFileName: Pfeiffer-hiphive-exome-PASS_ONLY
    #out-format options: HTML, JSON, TSV_GENE, TSV_VARIANT, VCF (default: HTML)
    outputFormats: [HTML, JSON, TSV_GENE, TSV_VARIANT, VCF]

@Bioinf-usr
Copy link
Author

Hi,
Screenshot 2024-04-03 at 23 22 49

Thank you for your response. Indeed, I want to annotate all the variants with the scores in the attached screenshot. Basically, I don't want to filter out any variants. I'll give it a try with the config that you provided.

Thank you.

@Bioinf-usr
Copy link
Author

Bioinf-usr commented Apr 23, 2024

Hello,

I tried with the config, but the number of variants shown in the log and the number of variants in the final tsv file are not matching. I don't understand why the variants are still being filtered out.

For example, here is the run log:

2024-04-23T14:50:35.654+02:00  INFO 54623 --- [           main] o.m.e.c.analysis.AbstractAnalysisRunner  : FREQUENCY_FILTER: pass=49831 fail=0
2024-04-23T14:50:35.654+02:00  INFO 54623 --- [           main] o.m.e.c.analysis.AbstractAnalysisRunner  : PATHOGENICITY_FILTER: pass=49831 fail=0
2024-04-23T14:50:38.476+02:00  INFO 54623 --- [           main] o.m.e.c.analysis.AbstractAnalysisRunner  : Scoring genes
2024-04-23T14:50:39.912+02:00  INFO 54623 --- [           main] o.m.e.c.analysis.AbstractAnalysisRunner  : Analysed sample 40306300739 with 12583 genes containing 49831 filtered variants
2024-04-23T14:50:39.914+02:00  INFO 54623 --- [           main] o.m.e.c.analysis.AbstractAnalysisRunner  : Finished analysis in 6m 3s 894ms (363894 ms)
2024-04-23T14:50:39.914+02:00  INFO 54623 --- [           main] o.m.e.cli.ExomiserCommandLineRunner      : Writing results...
^[[B^[[B2024-04-23T14:51:28.738+02:00 DEBUG 54623 --- [           main] o.s.b.a.ApplicationAvailabilityBean      : Application availability state ReadinessState changed to ACCEPTING_TRAFFIC
2024-04-23T14:51:28.814+02:00  INFO 54623 --- [           main] o.monarchinitiative.exomiser.cli.Main    : Exomising finished - Bye!

According to this 49831 variants should be written in the final tsv file but I only have 37973. Could you please let me know what could be the reason and how to skip that filtering as well?

@Bioinf-usr
Copy link
Author

Hello,

Just wondering if you had any update regarding this issue?

Thank you.

@julesjacobsen
Copy link
Contributor

julesjacobsen commented May 21, 2024

Are you able to share which variants are missing? Is this a multi-sample VCF?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants