Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reported CIGAR string is for a suboptimal alignment in some cases. #2

Open
macieksk opened this issue Dec 4, 2019 · 1 comment
Open

Comments

@macieksk
Copy link

macieksk commented Dec 4, 2019

Hi,

The example code here uses an aligner with the following scoring scheme:
Match:2, Mismatch:-1, all_gaps_penalties=1

import ssw_aligner

query='CAGACAATCAGCATGTTTCCGGCAGCGCCGGTAG'
target='TTCCACCATTTGTCCGGACCGGGC'

def get_striped_ed_aligner(query_seq,match_score=2):
    return ssw_aligner.StripedSmithWaterman(
      query_seq,
      gap_open_penalty=1,
      gap_extend_penalty=1,
      match_score=match_score,
      mismatch_score=-1,
      suppress_sequences=True,
      score_only=False,
      score_size=2, ## score is < 255 this should be 0; >255 then 1; 2 don't know
      zero_index=True,
      mask_length=0,   ## Turn off suboptimal
      mask_auto=False, ## Turn off suboptimal
    )
algn=get_striped_ed_aligner(query,match_score=2)(target)

cigcnt=defaultdict(lambda:0)
def add_tl(d,t,l): d[t]+=l; return
[add_tl(cigcnt,t,l) for l, t in algn._tuples_from_cigar()]
print(cigcnt)

print(algn.__repr__())

This results with CIGAR counts and the alignment stats:

defaultdict(<function <lambda> at 0x7f35c0ec2048>, {'M': 19, 'I': 5, 'D': 1})
{
    'optimal_alignment_score': 28,
    'suboptimal_alignment_score': 0,
    'query_begin': 7,
    'query_end': 30,
    'target_begin': 1,
    'target_end_optimal': 21,
    'target_end_suboptimal': -1,
    'cigar': '7M1I2M1D5M1I1M3I4M',
    'query_sequence': '',
    'target_sequence': ''
}

The issue here is that the CIGAR string returned in the alignment above is for a suboptimal alignment of score 26, see the computations below:

#CAGACAA |TCAGCATGTT TCCGG CAGCGCCGG| TAG
#      T |TCCACAT TTGTCCGG  A   CCGG| GC
# 'cigar':  '7M    1I2M1D 5M1I1M 3I  4M',
#                           M=5+2+2+5+1+4=19  Mismatch=2  Match=17 I=5 D=1
#                           score=2*17-2-5-1=26
print(2*17-2-5-1)
26

The optimal alignment with the reported score has a different CIGAR string, see below.

#CAGACAA |TC  AG  CATGTT  TCCGGCAGCGCCGG| TAG
#      T |TC CA   CAT TTG TCCGG A   CCGG| GC
# 'cigar':'2M1D1M1I3M1I2M1D  5M1I1M3I  4M',
#                           M=18 Mismatch=0 I=6 D=2  
#                           score=2*18-6-2=28
print(2*18-6-2)
28

Is it possible to fix it such that the reported CIGAR string is always for an optimal alignment?

@macieksk
Copy link
Author

macieksk commented Dec 4, 2019

Possibly solved in scikit-bio/scikit-bio#1679 ... though seems not quite from tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant