Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pseudobam output (very) occasionally prints two alignments in one line #70

Open
vals opened this issue Aug 18, 2015 · 1 comment
Open

Comments

@vals
Copy link
Collaborator

vals commented Aug 18, 2015

I've been looking in to using the pseudoalignment for some downstream analysis. It happens at times that two alignment records end up at the same line. This causes samtools to fail parsing the sam file.

For one file I've been running Kallisto on, this has happened all of the three times I've run it.

The first time I ran Kallisto, the line in the sam file that caused the problem was

SRR1161560.21043094:CELL_TGATGCGC:UMI_GGGG  0   ENSMUST00000183090  308 255SRR1161560.21199582:CELL_GTCTTATC:UMI_TGAA   0   ENSMUST00000119438  473 255 51M *   0   0   AGTTCCCTGGCCGCCAGAAGATCCACATCTCAAAGAAGTGGGGCTTCACCA @??BDDDDFDHFHFHGEIIII@G?DGFFBGGIEAGI@@DCGIIIIIIGIGH NH:i:8

Another read is inserted right after the mapping quality field in the sam record. I didn't see anything special about these reads in the input FASTQ file, so I tried just running Kallisto again. This time a similar thing happened but with a different read!

SRR1161559.26399301:CELL_CATCAGAA:UMI_AAAA  16  ENSMUST00000115104  22954   255 51M *   0   0   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [email protected]:CELL_TGATGGAA:UMI_AAAA 16  ENSMUST00000187290  345 255 51M *   0   0   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA BDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDFJJJHHHHHFFFFFCCC NH:i:129

Now a second read is inserted a bit in to the read quality string.

And just for good measure I tried a third time, and again this phenomenon happened.

SRR1161560.26326741:CELL_TGATTGAA:UMI_TGCG  0   ENSMUST00000082409  656 255 51M *   0   0   TCAGAGTTCTACTAAAATTTCACTTCACATCAAAACATCACTTCGGATTTG @@@DSRR1161560.26345311:CELL_CATCAGAA:UMI_AAAA  16  ENSMUST00000193553  242 255 51M *   0   0   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACA DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDFHJJJJHHGHHFFFFF@C@ NH:i:12

I had a very quick look at the code, but couldn't see where anything particularly fishy would happen. It has the characteristics of a race condition, but since Kallisto is single-core, that's probably not it.

The input is a single end 63 982 233 reads fastq file.

What could cause this?

@pmelsted
Copy link
Contributor

Thanks for reporting this, definitely looks like a race condition, although it is a bit weird. I'll see if I can reproduce this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants