-
Notifications
You must be signed in to change notification settings - Fork 25
VCF Genotyping
McCortex can be used to genotype a VCF using read kmers from unmapped reads. This avoids issues arising from read mapping, however choosing a good kmer size may be important for regions of low complexity or samples with low coverage.
You need a VCF, the reference genome and uncleaned graph files for your samples (see Graph construction). Genopyting is done with two commands: vcfcov
adds kmer counts to each allele; vcfgeno
assigns a genotype to each variants using the kmer counts. VCFs can be multi-allelic and may contain insertions, deletions, SNPs or MNPs (multi-nucleotide polymorphisms).
# Build graphs
mccortex31 build -m 60G -k 21 -s NA12878 -1 NA12878.fa NA12878.k21.ctx
mccortex31 build -m 60G -k 21 -s NA12345 -1 NA12345.fa NA12345.k21.ctx
# Add kmer counts and genotype, passing coverage with -D
mccortex31 vcfcov -m 1G -r ref.fa -o calls.cov.vcf calls.vcf NA12878.k21.ctx NA12345.k21.ctx
mccortex31 vcfgeno -D 55,61 -o calls.gt.vcf calls.cov.vcf
Notes:
-
-r <ref.fa>
can be specified multiple times if your reference is in more than one file. - The VCF must be sorted, and the reference must contain all chromosomes referenced in the VCF (you can sort a vcf with
<mccortex-dir>/libs/biogrok/vcf-sort
) - The VCF must have duplicate entries removed. Indels need to be left-aligned to ensure duplicates are removed.
- The order of the chromosomes in the VCF and reference do not matter and do not need to be the same.
- Input/output formats can be any of: vcf, vcf.gz, bcf
Why uncleaned graph files? There's no need to remove low coverage unitigs or tips from graphs beforehand as genotyping includes it's own error model. Cleaning a graph may remove real sequence.
How much memory?
A human sample can run with <1GB of RAM. We only load kmers that are informative in genotyping. vcfcov
can be run in stages to add kmer counts from a subset of samples at a time. vcfgeno
uses negligible memory.
Samples are s0.ctx
, s1.ctx
, s2.ctx
, ref: ref.fa
, VCF: calls.vcf
.
Intermediate VCF files created: calls.s1.vcf
, calls.s1.2.vcf
, calls.s1.2.3.vcf
Output VCF: calls.gt.vcf
mccortex31 vcfcov -m 1G -r ref.fa -o calls.s1.vcf calls.vcf s1.ctx
mccortex31 vcfcov -m 1G -r ref.fa -o calls.s1.2.vcf calls.s1.vcf s2.ctx
mccortex31 vcfcov -m 1G -r ref.fa -o calls.s1.2.3.vcf calls.s1.2.vcf s3.ctx
mccortex31 vcfgeno -D 30,31,25 -o calls.gt.vcf calls.s1.2.3.vcf
For each sample in the graph, we add a sample to the VCF with the same name. For each sample we add two fields:
-
K31R
: mean coverage of kmers unique to the reference -
K31A
: mean coverage of kmers unique to the alternative allele
vcfcov
works with biallelic and multiallelic VCFs. The VCF must be sorted and will be printed in the order it was read in. With no --out
argument vcfcov
prints to STDOUT.
Coverage is calculated by assembling all local haplotypes and finding which kmers uniquely support an alternative allele. We then report the number and mean kmer coverage on each reference and alternative allele.
usage: mccortex31 vcfcov [options] <in.vcf> <in.ctx> [in2.ctx ...]
Get coverage of a VCF in the cortex graphs. VCF must be sorted by position.
It is recommended to use uncleaned graphs.
-h, --help This help message
-q, --quiet Silence status output normally printed to STDERR
-f, --force Overwrite output files
-m, --memory <mem> Memory to use
-n, --nkmers <kmers> Number of hash table entries (e.g. 1G ~ 1 billion)
-o, --out <out.vcf> Output file [default: STDOUT]
-O, --out-fmt <f> Format vcf|vcfgz|bcf|ubcf
-r, --ref <ref.fa> Reference file [required]
-L, --max-var-len <A> Only use alleles <= A bases long [default: 100]
-N, --max-nvars <N> Limit haplotypes to <= N variants [default: 8]
-M, --low-mem Two-passes of VCF to only load needed kmers [default]
-H, --high-mem One-pass of VCF, all kmers loaded (when streaming VCF)
After adding sample coverage for variants using the vcfcov
command -- one sample at a time or all at once -- you can genotype a VCF with the vcfgeno
command:
mccortex31 vcfgeno -m 60G --out out.vcf.gz --cov 40 in.vcf graph.ctx
vcfgeno
needs to know the expected kmer coverage, this can be calculated from genome coverage -D,--cov
, or specified with -C,--kcov
. Adding -r
(--rm-cov
) will remove the annotations added by vcfcov after setting the genotype.
usage: mccortex31 vcfgeno [options] <in.vcf>
Genotype a VCF after running vcfcov to add sample coverage. VCF does not
need to be sorted. Currently only supports biallelic and ploidy 1 or 2.
-h, --help This help message
-q, --quiet Silence status output normally printed to STDERR
-f, --force Overwrite output files
-o, --out <out.vcf> Output file [default: STDOUT]
-O, --out-fmt <f> Format vcf|vcfgz|bcf|ubcf
-E, --err <E> List of sample error rates (per bp) [0.01]
-D, --cov <C> List of genome coverage per colour
-C, --kcov <C> List of kmer coverage per colour
-P, --ploidy <P> <ploidy> or sample:chr:ploidy (can be used repeatedly)
'sample' and 'chr' can be comma-separated lists
'.' means all. Be careful: applied in order.
-l, --llk Print all log likelihoods
-r, --rm-cov Remove tags set by 'vcfcov' command
-R, --read-len <R> List to override read lengths [optional]
Notes:
1. Lists are comma-separated. If given a single value it will apply to all colours.
2. User must give exactly one of --cov, --kcov
3. 'kmer coverage' is calculated by: D * (R - k + 1) / R where:
D is sequence depth (X); R is read length; k is kmer size
Or calculate with: `mccortex31 view <graph.ctx>`
Or look at output from: mccortex31 clean <graph.ctx>
Human ploidy example:
mccortex31 vcfgeno \
--cov 21,42,30 \
--ploidy .:.:2 --ploidy .:Y:0 --ploidy John,Tom:X,Y:1 \
in.vcf > out.vcf