Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SeqLib::BamWriter.WriteRecord() seems to be truncating the qname field #33

Open
bkohrn opened this issue Dec 7, 2018 · 2 comments
Open

Comments

@bkohrn
Copy link

bkohrn commented Dec 7, 2018

When I create a new bam record, the new bam record seems to truncate the read name at 23 characters. I am using the following code (note that this is part of a larger program; I just stopped it in the middle to check the file quickly):

SeqLib::BamReader in_bam_file;
SeqLib::BamWriter temp_bam = SeqLib::BamWriter(SeqLib::SAM);
in_bam_file.Open(input);
bool testOpen = temp_bam.Open(prefix + ".temp.bam");
temp_bam.SetHeader(in_bam_file.Header());
temp_bam.WriteHeader(); 
SeqLib::BamRecord line;
SeqLib::BamRecord temp_read1_entry;
SeqLib::BamRecord temp_bam_entry;
while (in_bam_file.GetNextRecord(line)) {
    if (paired_end_count % 2 == 1) {
        temp_read1_entry = line;
    }
    if (paired_end_count % 2 == 0) {
        std::string tagName;
        SeqLib::BamRecord temp_bam_entry;
        if (temp_read1_entry.Sequence().substr(0,taglen) + temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) > line.Sequence().substr(0,taglen) + line.Sequence().substr(taglen + spacerlen, loclen)) {
            tagName = temp_read1_entry.Sequence().substr(0,taglen) + \
            temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) + \
            line.Sequence().substr(0,taglen) + \
            line.Sequence().substr(taglen + spacerlen, loclen) + "#ab";
        }
        else if (temp_read1_entry.Sequence().substr(0,taglen) + temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) < line.Sequence().substr(0,taglen) + line.Sequence().substr(taglen + spacerlen, loclen)) {
            tagName = line.Sequence().substr(0,taglen) + \
            line.Sequence().substr(taglen + spacerlen, loclen) + \
            temp_read1_entry.Sequence().substr(0,taglen) + \
            temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) + "#ba";
        }
        else if (temp_read1_entry.Sequence().substr(0,taglen) + temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) == line.Sequence().substr(0,taglen) + line.Sequence().substr(taglen + spacerlen, loclen)) {
            paired_end_count++;
            continue;
        }
        // Write entries for Read 1
        std::string tempQualities = temp_read1_entry.Qualities().substr(taglen + spacerlen);
        temp_bam_entry = temp_read1_entry;
        temp_bam_entry.AddZTag("X?", temp_read1_entry.Qname());
        temp_bam_entry.SetQname("abcdefghijklmnopqrstuvwxyz");
        temp_bam_entry.SetSequence(temp_read1_entry.Sequence().substr(taglen + spacerlen));
        temp_bam_entry.SetQualities(tempQualities, 33);
        temp_bam.WriteRecord(temp_bam_entry);
       
        // Write entries for Read 2
        tempQualities = line.Qualities().substr(taglen + spacerlen);
        temp_bam_entry = line;
        temp_bam_entry.AddZTag("X?", line.Qname());
        temp_bam_entry.SetQname("abcdefghijklmnopqrstuvwxyz");
        temp_bam_entry.SetSequence(line.Sequence().substr(taglen + spacerlen));
        temp_bam_entry.SetQualities(tempQualities, 33);
        temp_bam.WriteRecord(temp_bam_entry);
        temp_bam.Close();
        abort();

When I look at the resulting file, I see this:

@hd VN:1.5 SO:queryname
@rg ID:A SM:testNonRef
abcdefghijklmnopqrstuvw 77 * 0 0 * * 0 0 GGGGTGCCCCCAAAGCGGGCCCGTGGGGGCCTCTGGGATGATGATGATGCACCTCGGCCATCCCAATTCGAGGAGGACCTGGCA DGHIIIIIIIIIIHIIIIIHHHIIHIIIGIIIIIIIIIHIIIIIIIIIIIIIIFIIGIIIGIIIIIHHIIIHIIIIIIGIHHII RG:Z:A X?:Z:HISEQ_0537:1:1101:10000:20114#ACGCTC
abcdefghijklmnopqrstuvw 141 * 0 0 * * 0 0 AAGAGAATGAGGGTCAAGGGTCAGGGCCAGAATCACTGTAGGCCACACAGGCTGTACGGGGTGGAAGGGCAGTCCTTTGTAGGA DGHHIIIIIIIHIHHIIIIIIIIIGHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIGHIIIIIHIHHIIIIIFHFGHHIG RG:Z:A X?:Z:HISEQ_0537:1:1101:10000:20114#ACGCTC

Based on my understanding, the read name should be "abcdefghijklmnopqrstuvwxyz", but it isn't.

Thanks for any help you can give,
Brendan

@martinjvickers
Copy link

martinjvickers commented Dec 7, 2018

That looks like a valid BAM to me, albeit unaligned to a genome. The 77 * 0 0 * * 0 0 part are columns 2-9 of the BAM/SAM specification. see section 1.4.

EDIT: Ignore above, I see what you're saying now, it's missing xyz

@martinjvickers
Copy link

I'm not entirely sure what is causing it for you, but I've just edited a program I made to see if SetQname is truncating but it doesn't seem to be;

BamRecordVector results;
bwa_ga.AlignSequence(seq, id, results, hardclip, secondary_cutoff, secondary_cap);

for(auto& i : results)
{
   i.SetQname("abcdefghijklmnopqrstuvwxyz");
   writer.WriteRecord(i);
}
abcdefghijklmnopqrstuvwxyz      16      chr4    5360249 60      101M    *       0       0       TAAATTAGGTTTGGATTTTGTGATAAGTTTAATTAATTATATTTATTTTAATGTTACGATGTTATATTATANTTTAGNATGCACTTAATNNNAGTNCNNTA   *       NA:i:2  NM:i:9  SB:i:0  AS:i:80

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants