Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimapper filter vs malformed SAM records - how many alignments is "too many"? #243

Closed
ascidwu opened this issue Jun 14, 2024 · 11 comments

Comments

@ascidwu
Copy link

ascidwu commented Jun 14, 2024

Sorry if this isn't the right place to ask this, but in the Event-level Filters section of the Arriba readthedocs, it states for multimappers that "Multi-mapping reads are reduced to a single alignment. For this purpose, only the alignment with the best alignment score is retained."
However, in read_chimeric_alignments.cpp, the comment above remove_malformed_alignments states "remove alignments when supplementary flags are missing or when there are too many/few alignment records".

My question is: how many is "too many alignment records" for a SAM record to be considered malformed, and thus not considered, vs getting passed to the multimapper filter?
Thank you for the help!

@suhrig
Copy link
Owner

suhrig commented Jun 14, 2024

Removing too many/few alignments in the context of malformed alignments means getting rid of alignments which Arriba cannot match properly. For example, in paired-end sequencing there should always be a first and a second mate. When Arriba finds two first mates, but only one second mate, the alignments are considered malformed. Another example is a split read with a missing supplementary alignment or two supplementary alignments. There should be exactly one.

This has nothing to do with multimapping reads (i.e., secondary alignments). Multimapping reads pass the removal of malformed alignments provided that Arriba can always match the paired-end mates and supplementary alignments.

Is this explanation clear?

@ascidwu
Copy link
Author

ascidwu commented Jun 14, 2024 via email

@ascidwu ascidwu closed this as completed Jun 15, 2024
@ascidwu ascidwu reopened this Jun 15, 2024
@ascidwu
Copy link
Author

ascidwu commented Jun 15, 2024

Actually I did have one further question regarding the malformed alignments; in the samples I'm currently running, I believe Arriba is reporting an abnormal amount of malformed SAM entries; around 38M malformed entries per 100M reads (~85% of these reads are properly mapped with STAR). Firstly, is this an expected number of malformed entries, and second, is there some way to determine where the malformed alignments are coming from?

I'm using a custom reference that has a few additional contigs with heavily duplicated sequence, could reads mapping to these contigs be throwing Arriba off?

@suhrig
Copy link
Owner

suhrig commented Jun 15, 2024

If you're using STAR, there should be only a small percentage of malformed alignments. Which version do you use? What are your alignment parameters?

I can prepare a debug version of Arriba to get some stats about why the reads are discarded as malformed.

@ascidwu
Copy link
Author

ascidwu commented Jun 15, 2024

I'm running STAR with the same parameters as in the demo workflow, but outputting to a separate BAM. I can provide the STAR/Arriba log files if that would be helpful.
STAR \ --runThreadN 16 \ --genomeDir /path/to/dir --genomeLoad NoSharedMemory \ --readFilesIn read1 read2 --readFilesCommand zcat \ --outSAMtype BAM Unsorted --outSAMunmapped Within --outBAMcompression 1 \ --outFilterMultimapNmax 50 --peOverlapNbasesMin 10 --alignSplicedMateMapLminOverLmate 0.5 --alignSJstitchMismatchNmax 5 -1 5 5 \ --chimSegmentMin 10 --chimOutType WithinBAM HardClip --chimJunctionOverhangMin 10 --chimScoreDropMax 30 \ --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --chimSegmentReadGapMax 3 --chimMultimapNmax 50 \ --outFileNamePrefix outprefix

@suhrig
Copy link
Owner

suhrig commented Jun 15, 2024

If you're using the parameters from the demo, then that's not the problem. Which version of STAR do you use?

@ascidwu
Copy link
Author

ascidwu commented Jun 15, 2024

I'm on STAR version: 2.7.11b

@suhrig
Copy link
Owner

suhrig commented Jun 16, 2024

If you're using STAR, there should be only a small percentage of malformed alignments.

I take that back. I just ran some tests and realized that I had forgotten about a particular situation where STAR likes to make split read alignments that make no sense. Namely, when the insert size is smaller than the read length, STAR attempts to align the overhang, which is actually adapter sequence. If it finds a place in the genome where the adapter sequence aligns, then it will make a supplementary alignment. Of course, this alignment has no biological foundation, which is why Arriba discards them as malformed. It is malformed, because the end of the read which points to the adapter should never be aligned as a supplementary. Here is an example:

chimeric_alignment_from_overhang

This happens more often with libraries where the insert size is smaller than the read length. So if your library has a small insert size, it's not unexpected to see a large number of reads affected by this. Check out the output from Arriba where it says Estimating fragment length. If the mate gap mean is small (close to zero or negative) or if the mate gap stddev is high (close to or higher than the mean), then there will be a lot of paired-end mates where the read length is bigger than the fragment size, which means the reads contain part of the adapter sequence.

@ascidwu
Copy link
Author

ascidwu commented Jun 17, 2024

I see, thank you. Indeed, Arriba reports a negative mate gap mean, and a high mate gap stddev. Would adapter trimming help in this case?

@suhrig
Copy link
Owner

suhrig commented Jun 17, 2024

Yes, probably. It should certainly make the warning about malformed alignments go away and could improve the detection rate slightly.

@ascidwu
Copy link
Author

ascidwu commented Jun 24, 2024

Adapter trimming with skewer successfully reduced the number of malformed alignments reported by Arriba to <0.01% of the aligned reads. Thank you for your help!

@ascidwu ascidwu closed this as completed Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants