Skip to content

New kallisto index

Compare
Choose a tag to compare
@pmelsted pmelsted released this 27 Jun 10:54
· 52 commits to master since this release
a38143d

kallisto index

The improved kallisto index reduces memory consumption for large FASTA files and features a d-list option to improve k-mer mapping specificity. Additionally, new input and output features have been added as well as support for sample barcodes (which can be recorded in addition to cell barcodes).

New features

  • kallisto quant-tcc: This new command can run the EM algorithm on a supplied transcripts-compatibility counts (TCC) matrix file, such as that generated by "bustools count", to generate transcript-level estimates. When a gene-mapping file is supplied, gene-level abundances will also be outputted. Effective length normalization will only be performed if a kallisto index is supplied and if fragment length information is provided.
  • New technologies were added to "kallisto bus": -x SmartSeq3 (--tag can be used to supply a 5′ tag sequence that identifies UMI-containing reads), -x BDWTA (BD Rhapsody), -x Visium (10x Visium), -x SPLIT-SEQ (SPLiT-seq preprocessing), and -x Bulk (for preprocessing non-demultiplexed Bulk RNA-seq files)
  • "kallisto bus" can be run with -x BULK specified: In this case, it will either process a batch file (supplied via --batch) like in the old "kallisto pseudo" or will process fastQ files supplied directly on the command line, treating each fastQ file or each pair of fastQ file (if --paired is specified) as an individual sample. This is useful for generating BUS files when each sample is in a separate fastQ file. With bustools and kallisto quant-tcc, this feature effectively entirely deprecates the old "kallisto pseudo".
  • Strand-specificity is now enabled by default for 10X, SureCell, CelSeq, BD Rhapsody, and Smart-seq3 UMI technologies (unstranded is default for other technologies) and the user can override this by supplying --fr-stranded, --rf-stranded, and --unstranded options.
  • Various performance improvements (mostly in regards to data ingestion throughput)
  • A minimal form of the kallisto index is outputted in a file named index.saved and a file containing fragment length distributions (flens.txt) is outputted when "kallisto bus" is run on paired-end reads (which can be specified via the option --paired). This is so kallisto quant-tcc can perform effective length normalization should the need arise.

New index

  • A new index is used that is incompatible with the old index, and users should upgrade to this new index for kallisto v0.50.0
  • With the new index, users can set the minimizer length (--min-size) which can tune indexing runtime+memory performance
  • --max-ec-size has been added so that users can cap the size of equivalence classes (i.e. the number of transcripts compatible with a given k-mer); k-mers that exceed this size aren't considered in the pseudoalignment. This can reduce memory usage and increase runtime performance (with some loss of information if --max-ec-size is too small).
  • --threads option now enabled for kallisto index to allow indices to be created in a multithreaded fashion (to improve runtime)
  • --d-list can be used to supply a FASTA file where distinguishing flanking k-mers will be extracted from (to act as a general k-mer filter for improving mapping specificity)
  • --distinguish option is added (where no polyA trimming, etc. occur) and each target is indexed as-is with the targets distinguished from one another by the target name (e.g. two targets can have the same name and be indexed together as a single target)
  • kallisto inspect can output more information: minimizer length, number of unitigs, max EC size, number of ECs discarded (i.e. over the --max-ec-size threshold), and number of D-listed elements (DFKs)

New input features

  • --inleaved option added to kallisto bus to support reading in interleaved FASTQ input
  • Streaming FASTQ reads directly into kallisto bus is enabled by supplying - in lieu of FASTQ files
  • --x technology string Bustools technology string can read RX:Z: UMIs in FASTQ header comments by supplying something like 0,0,8:RX:1,0,0 (i.e. RX can be supplied into the UMI portion of the technolog string)
  • --numReads can be set to terminate after a certain number of reads have been processed

New sample barcode feature

  • --batch-barcodes in kallisto bus will record encode batch ID as a unique nucleotide sequence in the hidden metadata of the barcode column of the BUS file (i.e. serving as a sample barcode).
  • --batch in kallisto bus now allows a technology string to be supplied (if --batch-barcodes is not supplied, only the barcodes extracted from the technology string are stored in the BUS file [i.e. sample barcodes aren't recorded]; if -1 is supplied in the barcode part of the technology string, only the batch-specific barcodes [i.e. sample barcodes] are stored directly in the BUS file, not in the hidden metadata unless --batch-barcodes is supplied)

New output features

  • kallisto quant-tcc command can output exactly what “kallisto quant” does (including w/ bootstraps for sleuth) for each barcode into separate abundance.tsv files (if --matrix-to-files is specified) or into separate directories, each containing an abundance.tsv file (if ---matrix-to-directories is specified). Also, h5ad will be produced if compiled with that options (unless --plaintext is supplied to quant-tcc).

Other new features

  • Progress is outputted every 1M reads
  • --aa option enabled in kallisto bus and kallisto index for amino acid mapping to nucleotide (functionalities to be described in a paper)

New compilation options

  • HTSLIB is no longer enabled by default; need to use cmake .. -DUSE_BAM=ON
  • Zlib is still compatible and used by default but the better zlib-ng is included and can be used if the given cmake option is supplied.
  • Compilation flags to enable all features are as follows: cmake .. -DZLIBNG=ON -DUSE_BAM=ON -DBUILD_FUNCTESTING=ON -DUSE_HDF5=ON

End of support for existing bulk RNAseq features

  • --bias, --fusion, --genomebam, and --pseudobam in kallisto quant and kallisto bus are no longer supported -- users should use v0.48.0 for use of these features.
  • --gfa,--gtf, and --bed options in kallisto inspect are no longer support -- users should use v0.48.0 for use of these features.