Multimapper filter vs malformed SAM records - how many alignments is "too many"? #243

ascidwu · 2024-06-14T19:44:11Z

Sorry if this isn't the right place to ask this, but in the Event-level Filters section of the Arriba readthedocs, it states for multimappers that "Multi-mapping reads are reduced to a single alignment. For this purpose, only the alignment with the best alignment score is retained."
However, in read_chimeric_alignments.cpp, the comment above remove_malformed_alignments states "remove alignments when supplementary flags are missing or when there are too many/few alignment records".

My question is: how many is "too many alignment records" for a SAM record to be considered malformed, and thus not considered, vs getting passed to the multimapper filter?
Thank you for the help!

suhrig · 2024-06-14T20:26:16Z

Removing too many/few alignments in the context of malformed alignments means getting rid of alignments which Arriba cannot match properly. For example, in paired-end sequencing there should always be a first and a second mate. When Arriba finds two first mates, but only one second mate, the alignments are considered malformed. Another example is a split read with a missing supplementary alignment or two supplementary alignments. There should be exactly one.

This has nothing to do with multimapping reads (i.e., secondary alignments). Multimapping reads pass the removal of malformed alignments provided that Arriba can always match the paired-end mates and supplementary alignments.

Is this explanation clear?

ascidwu · 2024-06-14T22:07:51Z

Perfectly clear, thank you!

…

On Fri, Jun 14, 2024, 4:26 PM suhrig ***@***.***> wrote: Removing too many/few alignments in the context of malformed alignments means getting rid of alignments which Arriba cannot match properly. For example, in paired-end sequencing there should always be a first and a second mate. When Arriba finds two first mates, but only one second mate, the alignments are considered malformed. Another example is a split read with a missing supplementary alignment or two supplementary alignments. There should be exactly one. This has nothing to do with multimapping reads (i.e., secondary alignments). Multimapping reads pass the removal of malformed alignments provided that Arriba can always match the paired-end mates and supplementary alignments. Is this explanation clear? — Reply to this email directly, view it on GitHub <#243 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AXWGOOTKA4GLINI2R35BTK3ZHNGX5AVCNFSM6AAAAABJK5EKKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRYG4YTMMRYHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- *The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material. Thank you.*

ascidwu · 2024-06-15T02:45:43Z

Actually I did have one further question regarding the malformed alignments; in the samples I'm currently running, I believe Arriba is reporting an abnormal amount of malformed SAM entries; around 38M malformed entries per 100M reads (~85% of these reads are properly mapped with STAR). Firstly, is this an expected number of malformed entries, and second, is there some way to determine where the malformed alignments are coming from?

I'm using a custom reference that has a few additional contigs with heavily duplicated sequence, could reads mapping to these contigs be throwing Arriba off?

suhrig · 2024-06-15T09:30:17Z

If you're using STAR, there should be only a small percentage of malformed alignments. Which version do you use? What are your alignment parameters?

I can prepare a debug version of Arriba to get some stats about why the reads are discarded as malformed.

ascidwu · 2024-06-15T13:07:33Z

I'm running STAR with the same parameters as in the demo workflow, but outputting to a separate BAM. I can provide the STAR/Arriba log files if that would be helpful.
STAR \ --runThreadN 16 \ --genomeDir /path/to/dir --genomeLoad NoSharedMemory \ --readFilesIn read1 read2 --readFilesCommand zcat \ --outSAMtype BAM Unsorted --outSAMunmapped Within --outBAMcompression 1 \ --outFilterMultimapNmax 50 --peOverlapNbasesMin 10 --alignSplicedMateMapLminOverLmate 0.5 --alignSJstitchMismatchNmax 5 -1 5 5 \ --chimSegmentMin 10 --chimOutType WithinBAM HardClip --chimJunctionOverhangMin 10 --chimScoreDropMax 30 \ --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --chimSegmentReadGapMax 3 --chimMultimapNmax 50 \ --outFileNamePrefix outprefix

suhrig · 2024-06-15T14:41:03Z

If you're using the parameters from the demo, then that's not the problem. Which version of STAR do you use?

ascidwu · 2024-06-15T18:48:45Z

I'm on STAR version: 2.7.11b

suhrig · 2024-06-16T09:55:26Z

If you're using STAR, there should be only a small percentage of malformed alignments.

I take that back. I just ran some tests and realized that I had forgotten about a particular situation where STAR likes to make split read alignments that make no sense. Namely, when the insert size is smaller than the read length, STAR attempts to align the overhang, which is actually adapter sequence. If it finds a place in the genome where the adapter sequence aligns, then it will make a supplementary alignment. Of course, this alignment has no biological foundation, which is why Arriba discards them as malformed. It is malformed, because the end of the read which points to the adapter should never be aligned as a supplementary. Here is an example:

This happens more often with libraries where the insert size is smaller than the read length. So if your library has a small insert size, it's not unexpected to see a large number of reads affected by this. Check out the output from Arriba where it says Estimating fragment length. If the mate gap mean is small (close to zero or negative) or if the mate gap stddev is high (close to or higher than the mean), then there will be a lot of paired-end mates where the read length is bigger than the fragment size, which means the reads contain part of the adapter sequence.

ascidwu · 2024-06-17T02:47:43Z

I see, thank you. Indeed, Arriba reports a negative mate gap mean, and a high mate gap stddev. Would adapter trimming help in this case?

suhrig · 2024-06-17T06:23:19Z

Yes, probably. It should certainly make the warning about malformed alignments go away and could improve the detection rate slightly.

ascidwu · 2024-06-24T13:40:20Z

Adapter trimming with skewer successfully reduced the number of malformed alignments reported by Arriba to <0.01% of the aligned reads. Thank you for your help!

ascidwu closed this as completed Jun 15, 2024

ascidwu reopened this Jun 15, 2024

ascidwu closed this as completed Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimapper filter vs malformed SAM records - how many alignments is "too many"? #243

Multimapper filter vs malformed SAM records - how many alignments is "too many"? #243

ascidwu commented Jun 14, 2024 •

edited

Loading

suhrig commented Jun 14, 2024

ascidwu commented Jun 14, 2024 via email

ascidwu commented Jun 15, 2024 •

edited

Loading

suhrig commented Jun 15, 2024

ascidwu commented Jun 15, 2024

suhrig commented Jun 15, 2024

ascidwu commented Jun 15, 2024

suhrig commented Jun 16, 2024

ascidwu commented Jun 17, 2024

suhrig commented Jun 17, 2024

ascidwu commented Jun 24, 2024

Multimapper filter vs malformed SAM records - how many alignments is "too many"? #243

Multimapper filter vs malformed SAM records - how many alignments is "too many"? #243

Comments

ascidwu commented Jun 14, 2024 • edited Loading

suhrig commented Jun 14, 2024

ascidwu commented Jun 14, 2024 via email

ascidwu commented Jun 15, 2024 • edited Loading

suhrig commented Jun 15, 2024

ascidwu commented Jun 15, 2024

suhrig commented Jun 15, 2024

ascidwu commented Jun 15, 2024

suhrig commented Jun 16, 2024

ascidwu commented Jun 17, 2024

suhrig commented Jun 17, 2024

ascidwu commented Jun 24, 2024

ascidwu commented Jun 14, 2024 •

edited

Loading

ascidwu commented Jun 15, 2024 •

edited

Loading