kallisto quant/pseudo counts differ #272

jnotwell · 2020-06-19T23:19:32Z

I'm trying the kallisto pseudo --quant --batch command that was used in the recent isoform-level analysis preprint.

I found that it produced different counts than the kallisto quant command - is there a reason for this difference? Steps to reproduce below:

Download the sequencing reads and kallisto index:

$ wget http:https://data.nemoarchive.org/biccn/lab/zeng/transcriptome/scell/SMARTer/raw/MOp/LS-15395_S48_E1-50.fastq.tar

$ tar -xvf LS-15395_S48_E1-50.fastq.tar

$ wget https://github.com/pachterlab/kallisto-transcriptome-indices/releases/download/ensembl-96/mus_musculus.tar.gz

$ tar -zxvf mus_musculus.tar.gz

Next, quantify with kallisto quant:

$ kallisto version
kallisto, version 0.46.2

$ kallisto quant -i mus_musculus/transcriptome.idx -o LS-15395_S48_E1-50 LS-15395_S48_E1-50_R1.fastq.gz LS-15395_S48_E1-50_R2.fastq.gz 

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 118,489
[index] number of k-mers: 100,614,952
[index] number of equivalence classes: 433,624
[quant] running in paired-end mode
[quant] will process pair 1: LS-15395_S48_E1-50_R1.fastq.gz
                             LS-15395_S48_E1-50_R2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 2,604,792 reads, 2,289,341 reads pseudoaligned
[quant] estimated average fragment length: 182.784
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 996 rounds

$ cat LS-15395_S48_E1-50/abundance.tsv | tail -n+2 | datamash sum 4
2289340.9621364

Now, perform a similar quantification with kallisto pseudo:

$ ls LS-15395_S48_E1-50_R1.fastq.gz | sed "s/_/\t/3" | awk '{print $1 "\t" $1 "_" $2 "\t" $1 "_R2.fastq.gz"}' > batch.tsv

$ kallisto pseudo -i mus_musculus/transcriptome.idx -o LS-15395_S48_E1-50_batch --quant --batch=batch.tsv  

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 118,489
[index] number of k-mers: 100,614,952
[index] number of equivalence classes: 433,624
[quant] running in paired-end mode
[quant] will process pair 1: LS-15395_S48_E1-50_R1.fastq.gz
                             LS-15395_S48_E1-50_R2.fastq.gz
[quant] finding pseudoalignments for all files ... done
[quant] processed 2,604,792 reads, 2,289,341 reads pseudoaligned
[quant] Running EM algorithm for each cell .. done

We can load the total counts in python:

$ import numpy as np

$ from scipy.io import mmread

$ np.sum(mmread('LS-15395_S48_E1-50_batch/matrix.abundance.mtx'))
2196198.893168244

The counts produced by kallisto quant match the number of pseudoalignments, but the number produced by kallisto pseudo do not.

The text was updated successfully, but these errors were encountered:

Yenaled · 2023-01-23T20:05:56Z

This was a while ago but this was fixed a while back in version 0.48.0 so I'm closing this issue. Also, pseudo is now deprecated in favor of "kallisto bus" (which can now do the exact same preprocessing as pseudo).

Yenaled closed this as completed Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kallisto quant/pseudo counts differ #272

kallisto quant/pseudo counts differ #272

jnotwell commented Jun 19, 2020

Yenaled commented Jan 23, 2023

kallisto quant/pseudo counts differ #272

kallisto quant/pseudo counts differ #272

Comments

jnotwell commented Jun 19, 2020

Yenaled commented Jan 23, 2023