Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kallisto quant/pseudo counts differ #272

Closed
jnotwell opened this issue Jun 19, 2020 · 1 comment
Closed

kallisto quant/pseudo counts differ #272

jnotwell opened this issue Jun 19, 2020 · 1 comment

Comments

@jnotwell
Copy link

I'm trying the kallisto pseudo --quant --batch command that was used in the recent isoform-level analysis preprint.

I found that it produced different counts than the kallisto quant command - is there a reason for this difference? Steps to reproduce below:

Download the sequencing reads and kallisto index:

$ wget http:https://data.nemoarchive.org/biccn/lab/zeng/transcriptome/scell/SMARTer/raw/MOp/LS-15395_S48_E1-50.fastq.tar

$ tar -xvf LS-15395_S48_E1-50.fastq.tar

$ wget https://github.com/pachterlab/kallisto-transcriptome-indices/releases/download/ensembl-96/mus_musculus.tar.gz

$ tar -zxvf mus_musculus.tar.gz

Next, quantify with kallisto quant:

$ kallisto version
kallisto, version 0.46.2

$ kallisto quant -i mus_musculus/transcriptome.idx -o LS-15395_S48_E1-50 LS-15395_S48_E1-50_R1.fastq.gz LS-15395_S48_E1-50_R2.fastq.gz 

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 118,489
[index] number of k-mers: 100,614,952
[index] number of equivalence classes: 433,624
[quant] running in paired-end mode
[quant] will process pair 1: LS-15395_S48_E1-50_R1.fastq.gz
                             LS-15395_S48_E1-50_R2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 2,604,792 reads, 2,289,341 reads pseudoaligned
[quant] estimated average fragment length: 182.784
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 996 rounds

$ cat LS-15395_S48_E1-50/abundance.tsv | tail -n+2 | datamash sum 4
2289340.9621364

Now, perform a similar quantification with kallisto pseudo:

$ ls LS-15395_S48_E1-50_R1.fastq.gz | sed "s/_/\t/3" | awk '{print $1 "\t" $1 "_" $2 "\t" $1 "_R2.fastq.gz"}' > batch.tsv

$ kallisto pseudo -i mus_musculus/transcriptome.idx -o LS-15395_S48_E1-50_batch --quant --batch=batch.tsv  

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 118,489
[index] number of k-mers: 100,614,952
[index] number of equivalence classes: 433,624
[quant] running in paired-end mode
[quant] will process pair 1: LS-15395_S48_E1-50_R1.fastq.gz
                             LS-15395_S48_E1-50_R2.fastq.gz
[quant] finding pseudoalignments for all files ... done
[quant] processed 2,604,792 reads, 2,289,341 reads pseudoaligned
[quant] Running EM algorithm for each cell .. done

We can load the total counts in python:

$ import numpy as np

$ from scipy.io import mmread

$ np.sum(mmread('LS-15395_S48_E1-50_batch/matrix.abundance.mtx'))
2196198.893168244

The counts produced by kallisto quant match the number of pseudoalignments, but the number produced by kallisto pseudo do not.

@Yenaled
Copy link
Collaborator

Yenaled commented Jan 23, 2023

This was a while ago but this was fixed a while back in version 0.48.0 so I'm closing this issue. Also, pseudo is now deprecated in favor of "kallisto bus" (which can now do the exact same preprocessing as pseudo).

@Yenaled Yenaled closed this as completed Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants