VCF Genotyping

McCortex can be used to genotype a VCF using read kmers from unmapped reads. This avoids issues arising from read mapping, however choosing a good kmer size may be important for regions of low complexity or samples with low coverage.

You need a VCF, the reference genome and uncleaned graph files for your samples (see Graph construction). Genopyting is done with two commands: vcfcov adds kmer counts to each allele; vcfgeno assigns a genotype to each variants using the kmer counts. VCFs can be multi-allelic and may contain insertions, deletions, SNPs or MNPs (multi-nucleotide polymorphisms).

Example:

# Build graphs
mccortex31 build -m 60G -k 21 -s NA12878 -1 NA12878.fa NA12878.k21.ctx
mccortex31 build -m 60G -k 21 -s NA12345 -1 NA12345.fa NA12345.k21.ctx

# Add kmer counts and genotype, passing coverage with -D
mccortex31 vcfcov -m 1G -r ref.fa -o calls.cov.vcf calls.vcf NA12878.k21.ctx NA12345.k21.ctx
mccortex31 vcfgeno -D 55,61 -o calls.gt.vcf calls.cov.vcf

Notes:

-r <ref.fa> can be specified multiple times if your reference is in more than one file.
The VCF must be sorted, and the reference must contain all chromosomes referenced in the VCF (you can sort a vcf with <mccortex-dir>/libs/biogrok/vcf-sort)
The VCF must have duplicate entries removed. Indels need to be left-aligned to ensure duplicates are removed.
The order of the chromosomes in the VCF and reference do not matter and do not need to be the same.
Input/output formats can be any of: vcf, vcf.gz, bcf

Why uncleaned graph files? There's no need to remove low coverage unitigs or tips from graphs beforehand as genotyping includes it's own error model. Cleaning a graph may remove real sequence.

How much memory? A human sample can run with <1GB of RAM. We only load kmers that are informative in genotyping. vcfcov can be run in stages to add kmer counts from a subset of samples at a time. vcfgeno uses negligible memory.

Example of running `vcfcov` in stages

Samples are s0.ctx, s1.ctx, s2.ctx, ref: ref.fa, VCF: calls.vcf.

Intermediate VCF files created: calls.s1.vcf, calls.s1.2.vcf, calls.s1.2.3.vcf

Output VCF: calls.gt.vcf

mccortex31 vcfcov -m 1G -r ref.fa -o calls.s1.vcf calls.vcf s1.ctx
mccortex31 vcfcov -m 1G -r ref.fa -o calls.s1.2.vcf calls.s1.vcf s2.ctx
mccortex31 vcfcov -m 1G -r ref.fa -o calls.s1.2.3.vcf calls.s1.2.vcf s3.ctx
mccortex31 vcfgeno -D 30,31,25 -o calls.gt.vcf calls.s1.2.3.vcf

`vcfcov` help:

For each sample in the graph, we add a sample to the VCF with the same name. For each sample we add two fields:

K31R: mean coverage of kmers unique to the reference
K31A: mean coverage of kmers unique to the alternative allele

vcfcov works with biallelic and multiallelic VCFs. The VCF must be sorted and will be printed in the order it was read in. With no --out argument vcfcov prints to STDOUT.

Coverage is calculated by assembling all local haplotypes and finding which kmers uniquely support an alternative allele. We then report the number and mean kmer coverage on each reference and alternative allele.

usage: mccortex31 vcfcov [options] <in.vcf> <in.ctx> [in2.ctx ...]

  Get coverage of a VCF in the cortex graphs. VCF must be sorted by position. 
  It is recommended to use uncleaned graphs.

  -h, --help             This help message
  -q, --quiet            Silence status output normally printed to STDERR
  -f, --force            Overwrite output files
  -m, --memory <mem>     Memory to use
  -n, --nkmers <kmers>   Number of hash table entries (e.g. 1G ~ 1 billion)
  -o, --out <out.vcf>    Output file [default: STDOUT]
  -O, --out-fmt <f>      Format vcf|vcfgz|bcf|ubcf
  -r, --ref <ref.fa>     Reference file [required]
  -L, --max-var-len <A>  Only use alleles <= A bases long [default: 100]
  -N, --max-nvars <N>    Limit haplotypes to <= N variants [default: 8]
  -M, --low-mem          Two-passes of VCF to only load needed kmers [default]
  -H, --high-mem         One-pass of VCF, all kmers loaded (when streaming VCF)

`vcfgeno` help:

After adding sample coverage for variants using the vcfcov command -- one sample at a time or all at once -- you can genotype a VCF with the vcfgeno command:

mccortex31 vcfgeno -m 60G --out out.vcf.gz --cov 40 in.vcf graph.ctx

vcfgeno needs to know the expected kmer coverage, this can be calculated from genome coverage -D,--cov, or specified with -C,--kcov. Adding -r (--rm-cov) will remove the annotations added by vcfcov after setting the genotype.

usage: mccortex31 vcfgeno [options] <in.vcf>

  Genotype a VCF after running vcfcov to add sample coverage. VCF does not
  need to be sorted. Currently only supports biallelic and ploidy 1 or 2.

  -h, --help           This help message
  -q, --quiet          Silence status output normally printed to STDERR
  -f, --force          Overwrite output files
  -o, --out <out.vcf>  Output file [default: STDOUT]
  -O, --out-fmt <f>    Format vcf|vcfgz|bcf|ubcf
  -E, --err <E>        List of sample error rates (per bp) [0.01]
  -D, --cov <C>        List of genome coverage per colour
  -C, --kcov <C>       List of kmer coverage per colour
  -P, --ploidy <P>     <ploidy> or sample:chr:ploidy (can be used repeatedly)
                       'sample' and 'chr' can be comma-separated lists
                       '.' means all. Be careful: applied in order.
  -l, --llk            Print all log likelihoods
  -r, --rm-cov         Remove tags set by 'vcfcov' command
  -R, --read-len <R>   List to override read lengths [optional]

  Notes: 
  1. Lists are comma-separated. If given a single value it will apply to all colours.
  2. User must give exactly one of --cov, --kcov
  3. 'kmer coverage' is calculated by: D * (R - k + 1) / R  where:
       D is sequence depth (X);  R is read length;  k is kmer size
     Or calculate with: `mccortex31 view <graph.ctx>`
     Or look at output from: mccortex31 clean <graph.ctx>

 Human ploidy example:
   mccortex31 vcfgeno \
      --cov 21,42,30 \
      --ploidy .:.:2 --ploidy .:Y:0 --ploidy John,Tom:X,Y:1 \
      in.vcf > out.vcf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCF Genotyping

Example:

Example of running `vcfcov` in stages

`vcfcov` help:

`vcfgeno` help:

Clone this wiki locally

VCF Genotyping

Example:

Example of running vcfcov in stages

vcfcov help:

vcfgeno help:

Clone this wiki locally

Example of running `vcfcov` in stages

`vcfcov` help:

`vcfgeno` help: