Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MD tag generation incorrect? #37

Closed
semenko opened this issue Mar 31, 2023 · 4 comments
Closed

MD tag generation incorrect? #37

semenko opened this issue Mar 31, 2023 · 4 comments
Assignees

Comments

@semenko
Copy link

semenko commented Mar 31, 2023

I ran into an odd issue parsing biscuit alignments with pysam (pysam-developers/pysam#1180) — and I wonder if the MD tags generated by biscuit are correctly inserting a 0 after deleted bases followed by mismatches.

This is described a bit in this comment: pysam-developers/pysam#895 (comment)

Specifically I'm running into this issue looking at this tiny bam and read A00354:758:HVTT7DSX3:3:2333:29405:16219 with MD tag 18A5A5A1C3A13^C32^CCTAA10A1C1C32

@jamorrison
Copy link

According to this comment in the hts-spec issue seeking to clarify the MD tag, BWA (which BISCUIT is based on) should adhere to the spec for formatting MD tags (i.e., adding 0's after deletions that are followed by mismatches). I also looked at the reads with deletions in them in your BAM and the size of the deletions match up with the number of bases printed in the MD tag. So, I don't think the deletions are the cause of the error you're seeing.

That said, if the pysam issue discussion turns up the BISCUIT MD tag is malformed, let me know and I'll figure out a fix.

@jmarshall
Copy link

Please see the second half of pysam-developers/pysam#1180 (comment).

It would appear that the MD values emitted by biscuit, while they are syntactically correct w.r.t. the ‘0’ separators, are not quite correct. (Or perhaps that there are ambiguity characters in the reference at these positions, which biscuit is handling in some way?)

@jamorrison
Copy link

jamorrison commented Apr 6, 2023

Comparing the actual reference to the reference inferred by pysam in pysam-developers/pysam#1180 (comment), it looks like the difference is due to the bisulfite conversion.

   ||| (and more down the string)
   vvv
TAACCCTAACCCTACCCTAACCCTAACCCTAAC (reference)
TAATTTTAATTTTATTTTaATTTTaATTTTaAc (inferred from MD)

And in looking through the code, it appears that BISCUIT does not properly account for the C>T / G>A conversion for the MD tag.

I'll work on getting a fix put together for this.

EDIT: I previously mentioned that it seemed like the NM tag was "handled properly." This is partially correct, as described below.

jamorrison added a commit that referenced this issue Apr 6, 2023
This brings the BISCUIT MD tag into alignment with the MD hts-spec.
Note, the NM tag only shows the number of non-cytosine-conversion
mismatches. To recreate the NM tag produced by samtools calmd, add the
NM tag and ZC tag together.
@jamorrison
Copy link

I made a commit that now brings the BISCUIT MD tag into alignment with the hts-spec (i.e., the MD tag is the same in BISCUIT as samtools calmd). This is available on the master branch now and will be in the next release of BISCUIT. Note, that for GC-rich regions, the MD tags can be quite long (potentially up to twice as long as SEQ).

As for the NM tag, the decision was made to leave this as the number of non-cytosine-conversion mismatches, which falls under the gray area of the hts-spec (see the italicized section below):

NM:i:count

Number of differences (mismatches plus inserted and deleted bases) between the sequence and reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential matches, with everything else being a mismatch.
Note this means that ambiguity codes in both sequence and reference that match each other, such as N in both, or compatible codes such as A and R, are still counted as mismatches.
The special sequence base = will always be considered to be a match, even if the reference is ambiguous at that point.
Alignment reference skips, padding, soft and hard clipping (N, P, S and H CIGAR operations) do not count as mismatches, but insertions and deletions count as one mismatch per base.

Note that historically this has been ill-defined and both data and tools exist that disagree with this definition. (emphasis added)

The NM tag that would be given by samtools calmd can be recreated by adding the NM tag value and the ZC tag value. I will update the documentation to clarify this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants