Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

truncated scaffolds from gff files #70

Open
simarilion opened this issue Oct 9, 2022 · 4 comments
Open

truncated scaffolds from gff files #70

simarilion opened this issue Oct 9, 2022 · 4 comments

Comments

@simarilion
Copy link

I'm sure I'm missing something obvious here.....

I'd like to view a long (583,249 bp long) but feature-poor scaffold (7 features). These features are described in the attached gff file which specifies the full length of the scaffold on the second line as:

##sequence-region scaffold59 1 583249

I can generate output files OK, but they don't display the entire length of the scaffold - they always stop at the end of the last feature (position 483,172), i.e. missing about 100,000 bp of straight black line (example output also attached).

Here's an example of one of my attempts to use dan_feature_viewer to get this working:


from dna_features_viewer import BiopythonTranslator
from BCBio import GFF
in_file = "scaffold59_original.gff"

in_handle = open(in_file)
gff_records = []
for rec in GFF.parse(in_handle):
#print(rec)
gff_records.append(rec)
in_handle.close()

#fig, ax = plt.subplots()

start = 0
end = 583249
record = BiopythonTranslator().translate_record(gff_records[0][start:end])
ax, _ = record.plot(figure_width=10, strand_in_label_threshold=7)
ax.figure.savefig("dna_FV_scaffold59_original_test2.png", bbox_inches="tight")


scaffold59_original_gff.txt

dna_FV_scaffold59.pdf

@simarilion
Copy link
Author

Kind of figured a workaround out - if one includes the fasta formatted sequence at the end of the gff file the entire length gets plotted, so after the last gff feature annotation include the following:

##FASTA

scaffold59
CAGTAAATTTGACTGGGTCCAAACATAGCTGGTATCACATATATTTTGCGATACCTACACACGTGATGATGATTGATTTC
ACACATCTTTAGCAAAATAAGGatttagtttgttatttaaatttgtttattttcattgtgactataatttatttttttcg
ta.....

Would be nice if this wasn't necessary as this information is present in the second line of the off file (##sequence-region scaffold59 1 583249)....

Otherwise this is a great package - thanks for providing it to the community!

@Zulko
Copy link
Member

Zulko commented Oct 10, 2022

Just in case: DnaFeatureViewer normally uses the length given by the biopython record. Maybe this is a case where the record imported by GFF doesn't have the right size (meaning the problem is not with DFV). What does len(records[0]) print?

In any case, another workaround to your fasta solution is to manually give the sequence length:

record = BiopythonTranslator().translate_record(gff_records[0])
record.sequence_length = 583249

@simarilion
Copy link
Author

simarilion commented Oct 10, 2022 via email

@Zulko
Copy link
Member

Zulko commented Oct 10, 2022

Great! To be clear, I wasn't suggesting that the length in the file is incorrect, but that the record produced by BCBio.GFF has the incorrect size, which would mean the issue is with the BCBio library and DnaFeatureViewer isn't at fault.

Could you check the value returned by len(record[0])?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants