uninformative error when using -s 0 #96

mschilli87 · 2016-01-04T10:13:45Z

After updating from a version without --sd/-s option to one with that parameter I first tried to reproduce some old data using the same -l value as before and -s 0. I thought theoretically this should correspond to the fixed length behaviour applied before.
Obviously this is not supported and I also don't really care because the actual estimate of the SD will never be zero. However, the error message I've got was quite confusing:

Error: cannot supply mean/sd without supplying both -l and -s
Error: fragment length mean and sd must be supplied for single-end reads using -l and -s

Given that my call contained -l 300 -s 0, it was hard to understand why it was failing.
I had to inspect the code to find out that 0.0 is used as initial value that is tested against to check if the parameter was set or not.
If there really is no way to support --sd=0 (initializing to a negative value?), the error/help messages could be adjusted to tell the user that -s (and -l) have to be greater than zero.

The text was updated successfully, but these errors were encountered:

pimentel · 2016-01-04T19:22:00Z

Thanks! Good point. I'll clean this up later today.

Thyra · 2021-03-02T11:18:04Z

I tried to map a dataset today that actually has an SD of 0 (all reads are 94bp long, https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR7609031) and encountered the same error with kallisto version 0.46.0. I am a little confused though: Why does kallisto calculate these values itself for paired-end read sequences but not for single-end ones? Is there some biological or methodological caveat that I'm overlooking?

mschilli87 · 2021-03-05T18:09:42Z

@Thyra: As far as I understand (didn't look at the paper or code in ages), this boils down to the way length normalization is done in kallisto: For PE data each fragment contains information over its actual length (by the distance between the mates). So I assume the effective count for each fragement can be derived from that and them summed up per feature.
For SE reads, this information simple is not available. The legacy solution was assuming a fixed length for all features but kallisto models it by a more realistic distribution (a truncated Gaussian IIRC). Thus, population statistics are required to parameterize that distribution. In that case, the mean and sd.
For PE data, those can can of course also be calculated, but should not be required to perform the actual quantification.
I think kallisto does in fact report the mean fragment length for PE runs but I don't remember the sd being reported as well.
If you'd like to get that information for downstream analyses I guess a patch would be quite straightforward. But I'd hate the default log format to change between versions as I have some code actually parsing those (I know: my fault).
Maybe the is more information in the HDF5 output, I didn't check. But AFAIK this is soon to be retired and replaced by another format completely.
I hope my random thoughts help you out, otherwise sorry for the noise. 😉

edit: Also, on second read, please double check you don't confuse read length and fragment length. While all youe reads may be exactly 94 bp long, the fragments they have been derived from could very well have been (and like were) longer than that and varying in size.

Thyra · 2021-03-07T13:27:57Z

@mschilli87 Oh, I was totally unaware of the difference between fragment length and read length, thanks for pointing that out! (sorry, I'm a complete noob when it comes to transcriptomics). Do you have a suggestion on how to choose mean and SD fragment lengths for single-end SRA data then? From what I've understood there isn't really a way to calculate/estimate these parameters from single-end reads unless you have access to the raw data and not everybody might publish these values in their manuscripts either (at least not the SD)?

mschilli87 · 2021-03-16T11:11:51Z

@Thyra: Sorry for the late reply. I usually have access to Bionalazyer profiles. You could contact the authors, they might have more data than available online. Or you take an educated guess and hope for the best. I found that most of my final conclusions from DGE analyses do not depend too much on those parameters: Even if I change them quite a bit from what I believe to be 'the best' guess, most genes typically are unaffected. Just be aware that especially for shorter transcripts or so you might have some bias. Not much you can do AFAICT.

Thyra · 2021-03-17T15:53:00Z

@mschilli87 OK, that sounds like a reasonable strategy. THANK YOU! :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uninformative error when using -s 0 #96

uninformative error when using -s 0 #96

mschilli87 commented Jan 4, 2016

pimentel commented Jan 4, 2016

Thyra commented Mar 2, 2021

mschilli87 commented Mar 5, 2021 •

edited

Loading

Thyra commented Mar 7, 2021

mschilli87 commented Mar 16, 2021

Thyra commented Mar 17, 2021

uninformative error when using -s 0 #96

uninformative error when using -s 0 #96

Comments

mschilli87 commented Jan 4, 2016

pimentel commented Jan 4, 2016

Thyra commented Mar 2, 2021

mschilli87 commented Mar 5, 2021 • edited Loading

Thyra commented Mar 7, 2021

mschilli87 commented Mar 16, 2021

Thyra commented Mar 17, 2021

mschilli87 commented Mar 5, 2021 •

edited

Loading