Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

does kallisto require unsorted reads? #396

Open
anoronh4 opened this issue Sep 5, 2023 · 6 comments
Open

does kallisto require unsorted reads? #396

anoronh4 opened this issue Sep 5, 2023 · 6 comments

Comments

@anoronh4
Copy link

anoronh4 commented Sep 5, 2023

We have a situation in which many of our samples have UMI and we'd like to deduplicate before running Kallisto. The question is, does kallisto require unsorted reads? Our STAR-aligned bams are position sorted, so when we convert the bam to fastq the reads will also be sorted. Can these sorted reads serve as input as-is for kallisto?

@Yenaled
Copy link
Collaborator

Yenaled commented Sep 6, 2023

kallisto doesn't require unsorted reads unless you're doing paired-end (R1+R2) alignment to get TPMs (in which case the first few thousand reads are used to estimate fragment length). In most cases, you can use your reads as-is.

But why not just deduplicate UMIs using kallisto bustools?

@anoronh4
Copy link
Author

anoronh4 commented Sep 6, 2023

We have UMI in bulk RNA-seq, not scRNAseq. i guess we went for vanilla kallisto first because bustools seemed to be specifically for scRNAseq with cell and molecular barcodes, but i think my colleague found your post on running it on bulk RNA with UMI: https://www.biostars.org/p/9554392/#9559079 . ours is a little different because we don't have an R3 fastq (just R1 and R2), and we have been using UMI tools to put the UMI in the header, which obviously won't work for kallisto-D and bustools. do you have any guidance on how to prepare our bulk reads for kallisto-D/bustools? (getting the R3 read at the demultiplexing step is currently not an option).

As for our current application, it probably does matter that it's position-sorted, because we are running kallisto quant on paired end data. if we were to run in paired end mode. i guess we can use -l and -s to bypass this method of estimation.

@Yenaled
Copy link
Collaborator

Yenaled commented Sep 6, 2023

A couple of things:

  1. In the current stable version (0.50.0 -- I already merged kallisto-D into master), you can use bulk+UMIs with kallisto bus; you just have to supply the correct technology string to kallisto bus --paired -x; e.g. -1,0,0:0,0,10:0,10,0,1,0,0. -1,0,0 means ignore barcodes (every read gets the same arbitrary barcode), 0,0,10 means extract first 10 bp's in R1 as the UMI; 0,10,0,1,0,0 means extract position 10 onward in R1 as the first read in the pair (0,10,0)and extract everything in R2 as the second read in the pair (1,0,0).

  2. You can extract UMIs from the read headers directly (if the read headers are formatted as @readname RX:Z:TACGAGATCA), by formatting the technology string as -1,0,0:RX:0,10,0,1,0,0. (aka put RX into the UMI field of the technology string).

Then you can use quant-tcc (as I did in my biostars post) to get the abundance estimates (e.g. TPMs) like you would do in standard bulk RNA-seq.

These things are undocumented and I'm still writing documentation for these things.

@anoronh4
Copy link
Author

anoronh4 commented Sep 7, 2023

oh ok, so it looks like we don't need an R3 read file in that case, if i'm reading that right, which is great! i think we would prefer the first method for now since UMI-tools doesn't really follow that convention, for example:

@IID:212:FID:2:1101:1072:1000_TGCTTA 1:N:0:TAACCGGT+ATCGTCTC

where TGCTTA is the UMI. so if we went with the first option, is there also support for dual umis, for example if there's 3 bp in R1 and 3 bp in R2? In the example above, TGC comes from the beginning of R1 and TTA comes from the beginning of R2.

@Yenaled
Copy link
Collaborator

Yenaled commented Sep 7, 2023

Yes, you would use this:

-1,0,0:0,0,3,1,0,3:0,3,0,1,3,0

Means you take the first 3-bp's of R1 and R2, and align your biological read from position 3 onward in R1 and R2.

@anoronh4
Copy link
Author

anoronh4 commented Sep 8, 2023

that's super helpful, thanks so much! i'm impressed at how flexible kallisto and its related tools are. it's going to save us a lot of storage and processing time, i think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants