Utility for Nanopore Current Alignment to Large Expanses of DNA (version 4)
A toolkit for nanopore signal alignment, analysis, and visualization
Performs accurate basecaller-guided nanopore signal alignment, similar to nanopolish eventalign or tombo resquiggle, to map nanopore signal segments to reference nucleotides. Also supports conversion of any signal alignments to an efficient BAM format, allowing for interactive visualizations, modification detection, and other signal analyses.
For real-time targeted sequencing via rapid signal mapping, see UNCALLED
pip install uncalled4
OR
git clone https://github.com/skovaka/uncalled4.git
cd uncalled4
pip install .
Requires python >= 3.8 and GCC>=4.9.2
Uncalled4 is currently only tested Linux, primarily on Ubuntu 18.04 and 20.04
The example/
directory contains scripts to download example data and test each command:
cd uncalled4/example
./download.sh
./run.sh
Uncalled4 signal alignment is guided by BAM alignments output by the basecaller with "move" tags (Dorado --emit-moves --emit-sam
or Guppy --moves_out
), which include mv:
, ts:
, and related BAM tags. This BAM file should contain reference-aligned reads (Dorado --reference
or guppy --align_ref
), unmapped will be ignored. To (re)align reads while preserving these tags, you can use samtools fastq -T "mv,ts,pi,sp,ns" <file>.bam > file.fastq
(v1.16 or newer) to convert the tags to FASTQ format, then align using minimap2 -y ...
to propagate the tags to the SAM file. Guppy only includes BAM tag on primary alignments by default. See scripts/bamprep.py to copy tags to supplemental and secondary aligments.
Uncalled4 primarily stores signal alignments in BAM alignment tags, including per-reference signal coordinates and current summary statistics required for most signal analyses. A sorted and indexed (via samtools index
) Uncalled4 BAM file is required for most visualization and analysis commands. Nanopolish, f5c, or Tombo alignments can be converted to Uncalled4 format via the convert
command.
Signal alignment requires a pore model to map k-mers to expected current. Uncalled4 will attempt to automatically detect the appropriate pore model from the input data, but may require you to specify a preset pore model or custom pore model. This can be specified using the --pore-model
flag or --flowcell
and --kit
flags. uncalled4 train
trains new pore models, either starting from a initialization pore model, or from scratch using basecaller moves.
Preliminary support for RNA004 alignment is implemented on the dev
branch. Signal alignment quality appears to be better than RNA002, but the exact implementation may change before migrating to the main branch.
Perform DTW alignment guided by basecalled alignments
uncalled4 align [-h] [--ref REF | --self] [--reads READS [READS ...]] --bam-in [BAM_IN] [-p PROCESSES] [--flowcell FLOWCELL]
[--kit KIT] [--basecaller-profile BASECALLER_PROFILE] [--rna] [--ordered-out] [-f] [--kmer-shift KMER_SHIFT]
[--bam-chunksize BAM_CHUNKSIZE] [--out-name OUT_NAME] [-r] [-l READ_FILTER] [-x READ_INDEX] [-n MAX_READS]
[--count-events] [--del-max DEL_MAX] [--ins-max INS_MAX] [--unmask-splice] [--method METHOD]
[-c {abs_diff,z_score,norm_pdf}] [-b BAND_WIDTH] [-s BAND_SHIFT] [--mvcmp-mask MVCMP_MASK]
[--max-norm-dist MAX_NORM_DIST] [--max-sd MAX_SD] [--min-aln-length MIN_ALN_LENGTH] [-N {ref_mom,model_mom}]
[--zero-ts] [-C CONFIG] [-o [BAM_OUT] | --tsv-out [TSV_OUT] | --eventalign-out [EVENTALIGN_OUT]] [-m PORE_MODEL]
[--bam-f5c] [--tsv-cols TSV_COLS] [--tsv-na [TSV_NA]] [--tsv-noref] [--eventalign-flags EVENTALIGN_FLAGS]
[--norm-iterations NORM_ITERATIONS] [--mask-skips [MASK_SKIPS]] [--skip-cost SKIP_COST] [--stay-cost STAY_COST]
[--move-cost MOVE_COST]
required arguments:
--ref [fasta] | --self Reference to align to, or perform signal-to-read self-alignment
--reads [path] Paths to fast5, slow5, or pod5 files, or to directories containing those files (optionally recursive)
--bam-in [bam_in] BAM input file (or "-"/no argument for stdin) (default: None)
output arguments (one required):
-o, --bam-out BAM output file
--tsv-out TSV output file (or "-"/no argument for stdout)
--eventalign-out Eventalign (nanopolish) output file (or "-"/no argument for stdout)
selected optional arguments:
-p, --processess Number of parallel processes (default: 1)
--flowcell FLOWCELL Flowcell used for sequencing (e.g. FLO-MIN106)
--kit KIT Kit used for sequencing (e.g. SQK-LSK109)
-m, --pore-model Custom pore model
--rna Required for custom RNA pore models
--tsv-cols TSV_COLS TSV file output alignment layers (comma-separated, can also include "read_id" (default: dtw)
--tsv-na [TSV_NA] Missing value representation for TSV output (default: *)
--tsv-noref Will NOT output reference coordinates to TSV if True (default: False)
--eventalign-flags Eventalign optional flags (comma-separated list of ""print-read-names", "signal-index", "samples")
-h, --help Print full command line usage
--bam-in
must be a BAM file produced by Dorado using --emit-moves --emit-sam
flags or the Guppy --moves_out
flag.
The --self
option is an alternative to --ref
, and performs signal-to-read alignment to the basecalled read, which must be defined in the SEQ
field in the input BAM file.
Convert between signal alignment file formats
uncalled4 convert [-h] [--bam-in BAM_IN | --eventalign-in [EVENTALIGN_IN] | --tombo-in TOMBO_IN]
[--eventalign-out [EVENTALIGN_OUT] | --tsv-out [TSV_OUT] |
--m6anet-out [M6ANET_OUT]] [--tsv-cols TSV_COLS]
[--eventalign-flags EVENTALIGN_FLAGS]
[--mask-skips [MASK_SKIPS]] [--ref FASTA] [--reads READS [READS ...]]
[-l READ_FILTER] [-x READ_INDEX] [-r] [--rna] [-R REF_BOUNDS] [-f] [-a]
Generally only one --*-in
and one --*-out
option should be specified, with the exception of --bam-out
where a template bam file should be specified via --bam-in
.
nanopolish or f5c eventalign
should be run with the --signal-index
and --scale-events
options, and can be converted with uncalled4 convert --eventalign-in <eventalign.txt> --bam-in <mm2.bam>
, where <mm2.bam>
is the exact BAM file used to guide the eventalign command.
--m6anet-out
efficently implements m6anet dataprep
for sorted Uncalled4 BAM files. This should be used with an m6Anet model trained on Uncalled4 alignments
Iteratively train a new k-mer pore model
Accepts most of the same paramters as uncalled4 align
, in addition to number of iterations. First iteration must use some starting pore model, while subsequent iterations use the pore model from the previous iteration.
uncalled4 train [-h] [-i TRAIN_ITERATIONS] [-m INIT_MODEL] [--init-mode INIT_MODE] [--moves-avg MOVES_AVG] [-k KMER_LEN]
[--kmer-samples KMER_SAMPLES] [--buffer-size BUFFER_SIZE] [-d MAX_MOVES_DIST] [--train-mean] [--out-dir OUT_DIR] [-a]
[--skip-dtw] [--mask-skips [MASK_SKIPS]] [--norm-iterations NORM_ITERATIONS] [--skip-cost SKIP_COST]
[--stay-cost STAY_COST] [--move-cost MOVE_COST] [--ref REF | --self] [--reads READS [READS ...]] --bam-in [BAM_IN]
[-p PROCESSES] [--flowcell FLOWCELL] [--kit KIT] [--basecaller-profile BASECALLER_PROFILE] [--rna] [--ordered-out]
[-f] [--kmer-shift KMER_SHIFT] [--bam-chunksize BAM_CHUNKSIZE] [--out-name OUT_NAME] [-r] [-l READ_FILTER]
[-x READ_INDEX] [-n MAX_READS] [--count-events] [--del-max DEL_MAX] [--ins-max INS_MAX] [--unmask-splice]
[--method METHOD] [-c {abs_diff,z_score,norm_pdf}] [-b BAND_WIDTH] [-s BAND_SHIFT] [--mvcmp-mask MVCMP_MASK]
[--max-norm-dist MAX_NORM_DIST] [--max-sd MAX_SD] [--min-aln-length MIN_ALN_LENGTH] [-N {ref_mom,model_mom}]
[--zero-ts] [-C CONFIG]
All visualizations are generated using Plotly.
Plot signal-to-reference alignment dotplots
uncalled4 dotplot [-h] [--bam-in BAM_IN [BAM_IN ...]] [-o OUT_PREFIX] [--ref REF] [--names NAMES] [--reads READS [READS ...]]
[-x READ_INDEX] [-r] [--rna] [-f {svg,png,pdf}] [-R REGION] [-l READ_FILTER] [-L LAYERS] [-p PORE_MODEL]
[--multi-background] [--no-model] [--svg] [-C CONFIG]
options:
-h, --help show this help message and exit
--bam-in BAM_IN [BAM_IN ...]
BAM input file (default: None)
-o OUT_PREFIX, --out-prefix OUT_PREFIX
If included will output images with specified prefix, otherwise will display interactive plot. (default: None)
--ref REF Reference FASTA file, must match --bam-in reference (default: None)
--names NAMES Names of tracks to read from input(s) (default: None)
--reads READS [READS ...]
Paths to FAST5, SLOW5, or POD5 files, or to directories containing those files (optionally recursive) (default:
None)
-x READ_INDEX, --read-index READ_INDEX
File containing a mapping of read IDs to filenames (default: None)
-r, --recursive Recursively search 'paths' for FAST5, SLOW5, or POD5 files (default: False)
--rna Should be set for direct RNA data (default: None)
-f {svg,png,pdf}, --out-format {svg,png,pdf}
Image output format. Only has an effect with -o option. (default: svg)
-R REGION, --region REGION
Only load reads which overlap these coordinates (default: None)
-l READ_FILTER, --read-filter READ_FILTER
List of read IDs to load, or file containing one read ID per line (default: None)
-L LAYERS, --layers LAYERS
-p PORE_MODEL, --pore-model PORE_MODEL
Model preset name or TSV filename (default: None)
--multi-background Will plot multiple stacked background colors for multiple tracks if True (default: False)
--no-model Will not plot the expected reference signal if True (default: False)
--svg Make SVG-friendly figures (default: False)
-C CONFIG, --config CONFIG
Configuration file in TOML format (default: None)
Plot alignment tracks and per-reference statistics
Trackplots are defined by a series of panels displaying different layers. A mat
panel display a heatmap of layer values for each ref/read coordinate on each track. A box
panel displays boxplots of layer summary statistics for each track. line
and scatter
panels display refstats
summary statistics, specified by <layer>.<statistic>
(e.g. current.mean
, model_diff.median
).
uncalled4 trackplot [-h] -R REGION --bam-in BAM_IN [BAM_IN ...] [--ref REF] [--read-paths READ_PATHS [READ_PATHS ...]]
[-x READ_INDEX] [-r] [--rna] [--pore-model PORE_MODEL] [--svg] [-f] [-l READ_FILTER]
[-H PANEL_HEIGHTS [PANEL_HEIGHTS ...]] [--shared-refs-only] [--shared-reads-only] [--share-reads] [--hover-read]
[-o OUTFILE] [-C CONFIG] [--bases] [--mat LAYER] [--box LAYER] [--line LAYER.STAT] [--scatter LAYER.STAT]
options:
-h, --help show this help message and exit
-R REGION, --region REGION
Only load reads which overlap these coordinates (default: None)
--bam-in BAM_IN [BAM_IN ...]
BAM input file (default: None)
--ref REF Reference FASTA file, must match --bam-in reference (default: None)
--read-paths READ_PATHS [READ_PATHS ...]
Paths to FAST5, SLOW5, or POD5 files, or to directories containing those files (optionally recursive) (default:
None)
-x READ_INDEX, --read-index READ_INDEX
File containing a mapping of read IDs to filenames (default: None)
-r, --recursive Recursively search 'paths' for FAST5, SLOW5, or POD5 files (default: False)
--rna Should be set for direct RNA data (default: None)
--pore-model PORE_MODEL
Model preset name or TSV filename (default: )
--svg Make SVG-friendly figures (default: False)
-f, --full-overlap If true will only include reads which fully cover reference bounds (default: False)
-l READ_FILTER, --read-filter READ_FILTER
List of read IDs to load, or file containing one read ID per line (default: None)
-H PANEL_HEIGHTS [PANEL_HEIGHTS ...], --panel-heights PANEL_HEIGHTS [PANEL_HEIGHTS ...]
Relative height of each panel (default: None)
--shared-refs-only If true will only contain reference positions where all tracks have sufficient coverage (see min_coverage) (default:
False)
--shared-reads-only If true will only contain reads shared between all tracks (default: False)
--share-reads If True will only display reads shared by all alignment tracks with shared y-axis (default: False)
--hover-read If True will display read_id in mat hover (default: False)
-o OUTFILE, --outfile OUTFILE
Output file (default: None)
-C CONFIG, --config CONFIG
Configuration file in TOML format (default: None)
--bases Display a ref-by-read matrix of specified alignment layer (default: None)
--mat LAYER Display a ref-by-read matrix of specified alignment layer (default: None)
--box LAYER Display a boxplot of specified layer (default: None)
--line LAYER.STAT Display a line plot of specifed layer summary statistic (default: None)
--scatter LAYER.STAT Display a line plot of specifed layer summary statistic (default: None)
Interactive signal alignment genome browser
Integrates trackplot and dotplot for interactive alignment browsing
uncalled4 browser [-h] -R REGION [--bam-in BAM_IN [BAM_IN ...]] [--shared-reads-only] [--reads READS [READS ...]] [--ref REF]
[-x READ_INDEX] [-r] [--rna] [-l READ_FILTER] [-f] [--pore-model PORE_MODEL] [--names NAMES] [-p PORT] [-o OUTFILE]
[-C CONFIG]
options:
-h, --help show this help message and exit
-R REGION, --region REGION
Reference coordinates to visualize (chr:start-end) (default: None)
--bam-in BAM_IN [BAM_IN ...]
BAM input file (default: None)
--shared-reads-only If true will only contain reads shared between all tracks (default: False)
--reads READS [READS ...]
Paths to FAST5, SLOW5, or POD5 files, or to directories containing those files (optionally recursive) (default:
None)
--ref REF Reference FASTA file, must match --bam-in reference (default: None)
-x READ_INDEX, --read-index READ_INDEX
File containing a mapping of read IDs to filenames (default: None)
-r, --recursive Recursively search 'paths' for FAST5, SLOW5, or POD5 files (default: False)
--rna Should be set for direct RNA data (default: None)
-l READ_FILTER, --read_filter READ_FILTER
Only load reads which overlap these coordinates (default: None)
-f, --full-overlap If true will only include reads which fully cover reference bounds (default: False)
--pore-model PORE_MODEL
Model preset name or TSV filename (default: )
--names NAMES Names of tracks to read from input(s) (default: None)
-p PORT, --port PORT Browser port (default: 8000)
-o OUTFILE, --outfile OUTFILE
Output file (default: None)
-C CONFIG, --config CONFIG
Configuration file in TOML format (default: None)
These functions compute statistics over reference and read coordinates. refstats
computes summary and comparison statistics (e.g. Kolmogorov-Smirnov test) over reference coordinates, while layerstats
maintains the read-by-reference dimensions of the DTW alignments.
(Coming soon: readstats
to compute read-level statistics)
Calculate per-reference-coordinate statistics
uncalled4 refstats [-h] [--bam-in BAM_IN [BAM_IN ...]] [--layers LAYERS [LAYERS ...]] [--stats REFSTATS [REFSTATS ...]]
[-t TRACKS] [-R REGION] [--min-coverage MIN_COVERAGE] [--bed-filter BED_FILTER] [--ref-chunksize REF_CHUNKSIZE]
[--aln-chunksize ALN_CHUNKSIZE] [-c] [--ref REF] [-m PORE_MODEL] [-p PROCESSES] [-o OUTFILE]
required arguments:
--bam-in BAM_IN [BAM_IN ...]
BAM input file (default: None)
--layers LAYERS [LAYERS ...]
Comma-separated list of layers over which to compute summary statistics (default: [])
--stats REFSTATS [REFSTATS ...]
Comma-separated list of summary statistics to compute. Some statisitcs (ks) can only be used if exactly two tracks
are provided {q5,kurt,q75,KS,q95,max,var,skew,median,stdv,mean,q25,min} (default: None)
options:
-h, --help show this help message and exit
-t TRACKS, --tracks TRACKS
Names of tracks to read from input(s) (default: None)
-R REGION, --region REGION
Only load reads which overlap these coordinates (default: None)
--min-coverage MIN_COVERAGE
Reference positions with less than this coverage will be excluded from each track (or all tracks if shared_refs_only
is true) (default: 1)
--bed-filter BED_FILTER
Only parse regions in BED file (default: None)
--ref-chunksize REF_CHUNKSIZE
Number of reference coordinates to query for iteration (default: 10000)
--aln-chunksize ALN_CHUNKSIZE
Number of alignments to query for iteration (default: 500)
-c, --cov Output track coverage for each reference position (default: False)
--ref REF Reference FASTA file, must match --bam-in reference (default: None)
-m PORE_MODEL, --pore-model PORE_MODEL
Model preset name or TSV filename (default: None)
-p PROCESSES, --processes PROCESSES
Number of parallel processes (default: 1)
-o OUTFILE, --outfile OUTFILE
Compute per-read summary statistics
uncalled4 readstats [-h] [--bam-in BAM_IN [BAM_IN ...]] [--layers LAYERS [LAYERS ...]] [--stats STATS [STATS ...]] [-R REGION]
[-s SUMMARY_STATS] [-C CONFIG]
options:
-h, --help show this help message and exit
--bam-in BAM_IN [BAM_IN ...]
BAM input file (default: None)
--layers LAYERS [LAYERS ...]
Which layers to compute statistics (default: None)
--stats STATS [STATS ...]
Summary statistics to compute (any builtin numpy function, e.g. mean, std, etc) (default: None)
-R REGION, --region REGION
Only load reads which overlap these coordinates (default: None)
-s SUMMARY_STATS, --summary-stats SUMMARY_STATS
Summary statistics to compute for "model_diff" command. (default: ['mean'])
-C CONFIG, --config CONFIG
Configuration file in TOML format (default: None)
Compute distance between alignments of the same reads
uncalled4 compare [-h] [--bam-in BAM_IN [BAM_IN ...]] [-t TRACKS] [-l READ_FILTER] [-R REGION] [-m] [--tsv-cols TSV_COLS]
[--tsv-na [TSV_NA]] [--tsv-noref] [--tsv-out [TSV_OUT]] [-C CONFIG]
options:
-h, --help show this help message and exit
--bam-in BAM_IN [BAM_IN ...]
BAM input file (default: None)
-t TRACKS, --tracks TRACKS
Names of tracks to read from input(s) (default: None)
-l READ_FILTER, --read-filter READ_FILTER
Only load reads which overlap these coordinates (default: None)
-R REGION, --region REGION
Only load reads which overlap these coordinates (default: None)
-m, --moves Compare against basecalled alignment. If two tracks input will look for "moves" group in second track, otherwise
will look in the first track. (default: False)
--tsv-cols TSV_COLS TSV file output alignment layers (comma-separated, can also include "read_id" (default: dtwcmp,mvcmp)
--tsv-na [TSV_NA] Missing value representation for TSV output (default: *)
--tsv-noref Will NOT output reference coordinates to TSV if True (default: False)
--tsv-out [TSV_OUT], -o [TSV_OUT]
TSV output file (or "-"/no argument for stdout) (default: -)
-C CONFIG, --config CONFIG
Configuration file in TOML format (default: None)
Uncalled4 pore models map k-mers to their expected current characteristics for a specific sequencing chemistry, minimally defining the expected mean current (current.mean
) for each k-mer. Pore models trained by Uncalled4 include means and standard deviations for per-kmer current means, current standard deviations, and dwell time measured in raw sample length: current.mean
/current.stdv
, current_sd.mean
/current_sd.stdv
, length.mean
/length.stdv
. In addition to per-kmer statistics, each model has a defined k-mer shift
used to define the central base that has the most effect on the current, the picoamp mean and standard deviation (pa_mean
and pa_stdv
) that can be used to scale the normalized current values to picoamps, and other model-specifc parameters like sample_rate
and bases_per_sec
. A reverse
parameter is also included, which is set to True
for RNA to indicated reversed sequencing direction.
Four pore model presets are provided by Uncalled4: dna_r10.4.1_400bps_9mer
, dna_r9.4.1_400bps_6mer
, rna_r9.4.1_70bps_5mer
, and rna004_130bps_9mer
. These are stored efficently in binary NumPy "npz" files, and can be converted to TSV format using the provided model2tsv.py script.
Custom pore models can be provided in TSV format via the --pore-model
command line argument, which should minimally include columns named kmer
and current.mean
. Column names for current levels are also aliased to support Nanopolish and other similar models, so current.mean
can be level_mean
or mean
, current_sd.mean
can be sd_mean
or stdv
, etc. K-mer offsets can also be defined for custom pore models using the --kmer-shift
option. All pore models are automatically normalized such that the mean and standard deviation of current.mean
is 0 and 1 respecively, which is required for BAM encoding, and the resulting BAM file will store scaling factors to convert to the original input values in pa_mean
and pa_stdv
.
Uncalled4 will attempt to automatically detect the sequencing chemistry based on POD5/FAST5/BLOW5 metadata, which is required even with custom pore models to determine the appropriate offset to use for basecaller moves metadata. If this cannot be automatically detected, the --basecaller-profile
must also be provided, which is defined similar to the pore model presets but without a defined k-mer length: either dna_r10.4.1_400bps
, dna_r10.4.1_260bps
, dna_r9.4.1_400bps
, rna_r9.4.1_70bps
, or rna004_130bps
. If you are using a sequenicng chemistry which does not have an appropirate preset, please submit an issue and we can implement one.
Uncalled4 stores signal alignments as a set of layers associated with read and reference coordinates. Each layer is derived from the read signal (e.g. current
), the reference sequence (e.g. kmer
), or other layers (e.g. model_diff
). Layers are organized into layer groups: dtw
for signal alignments, bcaln
for projected basecalled alignments, and cmp
for alignment comparisons. Several subcommands detailed above take layer names as input, which should generaly be in the form <group>.<layer>
. Below is a table of currently available layers:
Group | Layer | Description |
---|---|---|
dtw | current | Normalized mean read signal current |
dtw | current_sd | Normalized read signal current standard deviation |
dtw | start | Read signal sample start index |
dtw | length | Read signal sample length |
dtw | dwell | Signal dwell time (ms/nt, proportional to length) |
dtw | model_diff | Difference between predicted (via a pore model) and observed current |
dtw | abs_diff | Absolute value of dtw.model_diff |
dtw | kmer | Binarized reference k-mer |
dtw | events | Number of raw signal events aligned ("stays" > 1, "skips" < 1) |
dtw | base | Binarized reference base |
moves | start | Estimated read signal sample start index |
moves | length | Estimated read signal sample length |
moves | indel | Number of inserted (>0) or deleted (<0) nucleotides |
mvcmp | mean_ref_dist | Mean reference distance between signal alignment and ref-moves |
mvcmp | jaccard | Raw sample jaccard distances between signal alignment and ref-moves |
dtwcmp | mean_ref_dist | Mean reference distance between two alignments (must first run layerstats compare ) |
dtwcmp | jaccard | Raw sample jaccard distances between two alignments (must first run layerstats compare ) |
All tracks must be written to the same database for multi-track visualization and analysis (e.g. comparing alignments, calculating KS statistics). You can merge multiple databases into a single file using uncalled db merge
- v4.1.1: Bug fixes and added example output
- v4.1.0: Major update.
- Added RNA004 support
- Added signal-to-read alignment via
align --self
- Changed r10.4.1 output coordinates to be centered on central base
- Changed all positional arguments to flags
- v4.0.0: Pre-print release For earlier development history, see https://github.com/skovaka/UNCALLED/tree/dev4