Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uninformative error when using -s 0 #96

Open
mschilli87 opened this issue Jan 4, 2016 · 6 comments
Open

uninformative error when using -s 0 #96

mschilli87 opened this issue Jan 4, 2016 · 6 comments

Comments

@mschilli87
Copy link

After updating from a version without --sd/-s option to one with that parameter I first tried to reproduce some old data using the same -l value as before and -s 0. I thought theoretically this should correspond to the fixed length behaviour applied before.
Obviously this is not supported and I also don't really care because the actual estimate of the SD will never be zero. However, the error message I've got was quite confusing:

Error: cannot supply mean/sd without supplying both -l and -s
Error: fragment length mean and sd must be supplied for single-end reads using -l and -s

Given that my call contained -l 300 -s 0, it was hard to understand why it was failing.
I had to inspect the code to find out that 0.0 is used as initial value that is tested against to check if the parameter was set or not.
If there really is no way to support --sd=0 (initializing to a negative value?), the error/help messages could be adjusted to tell the user that -s (and -l) have to be greater than zero.

@pimentel
Copy link
Contributor

pimentel commented Jan 4, 2016

Thanks! Good point. I'll clean this up later today.

@Thyra
Copy link

Thyra commented Mar 2, 2021

I tried to map a dataset today that actually has an SD of 0 (all reads are 94bp long, https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR7609031) and encountered the same error with kallisto version 0.46.0. I am a little confused though: Why does kallisto calculate these values itself for paired-end read sequences but not for single-end ones? Is there some biological or methodological caveat that I'm overlooking?

@mschilli87
Copy link
Author

mschilli87 commented Mar 5, 2021

@Thyra: As far as I understand (didn't look at the paper or code in ages), this boils down to the way length normalization is done in kallisto: For PE data each fragment contains information over its actual length (by the distance between the mates). So I assume the effective count for each fragement can be derived from that and them summed up per feature.
For SE reads, this information simple is not available. The legacy solution was assuming a fixed length for all features but kallisto models it by a more realistic distribution (a truncated Gaussian IIRC). Thus, population statistics are required to parameterize that distribution. In that case, the mean and sd.
For PE data, those can can of course also be calculated, but should not be required to perform the actual quantification.
I think kallisto does in fact report the mean fragment length for PE runs but I don't remember the sd being reported as well.
If you'd like to get that information for downstream analyses I guess a patch would be quite straightforward. But I'd hate the default log format to change between versions as I have some code actually parsing those (I know: my fault).
Maybe the is more information in the HDF5 output, I didn't check. But AFAIK this is soon to be retired and replaced by another format completely.
I hope my random thoughts help you out, otherwise sorry for the noise. 😉

edit: Also, on second read, please double check you don't confuse read length and fragment length. While all youe reads may be exactly 94 bp long, the fragments they have been derived from could very well have been (and like were) longer than that and varying in size.

@Thyra
Copy link

Thyra commented Mar 7, 2021

@mschilli87 Oh, I was totally unaware of the difference between fragment length and read length, thanks for pointing that out! (sorry, I'm a complete noob when it comes to transcriptomics). Do you have a suggestion on how to choose mean and SD fragment lengths for single-end SRA data then? From what I've understood there isn't really a way to calculate/estimate these parameters from single-end reads unless you have access to the raw data and not everybody might publish these values in their manuscripts either (at least not the SD)?

@mschilli87
Copy link
Author

@Thyra: Sorry for the late reply. I usually have access to Bionalazyer profiles. You could contact the authors, they might have more data than available online. Or you take an educated guess and hope for the best. I found that most of my final conclusions from DGE analyses do not depend too much on those parameters: Even if I change them quite a bit from what I believe to be 'the best' guess, most genes typically are unaffected. Just be aware that especially for shorter transcripts or so you might have some bias. Not much you can do AFAICT.

@Thyra
Copy link

Thyra commented Mar 17, 2021

@mschilli87 OK, that sounds like a reasonable strategy. THANK YOU! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants