Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene sorting compared to fasta #83

Open
stephan-nylinder opened this issue Feb 6, 2024 · 4 comments
Open

Gene sorting compared to fasta #83

stephan-nylinder opened this issue Feb 6, 2024 · 4 comments

Comments

@stephan-nylinder
Copy link

Describe the bug
When running EMBLmyGFF3 on a fasta file with a specific order of the genes, it re-orders the genes in the flat file output in what seems to be alphanumerical order based on the gene names. Input gene order is not retained.

General (please complete the following information):

  • EMBLmyGFF3 2.2
  • EMBLmyGFF3 installation/use: Manual
  • OS: macOS and Windows 11

To Reproduce
Any fasta+gff
Any input parameters

Expected behavior
To guarantee the gene order in the flat file output to be in the same order as in the fasta input

Screenshots
None

Additional context
None

@Juke34
Copy link
Collaborator

Juke34 commented Feb 6, 2024

This behavior is related to biopython. Could you tell what version of biopython you use?
see #70

@stephan-nylinder
Copy link
Author

Seems to be 1.80

@percyfal
Copy link

Chiming in here. We noted the issue of contigs not being sorted according to the input order of the assembly file. The culprit here is not biopython, but rather BCBio.GFF. In EMBLmyGFF.py you have the following. At line line 1365:

seq_dict = SeqIO.to_dict( SeqIO.parse(infasta, "fasta") )

I have verified that the input order is preserved.

Later on, at line 1484:

for record in GFF.parse(infile, base_dict=seq_dict):

is where the input order is scrambled. GFF.parse calls a function GFF.parse_in_parts which looks as follows:

def parse_in_parts(self, gff_files, base_dict=None, limit_info=None,
            target_lines=None):
        """Parse a region of a GFF file specified, returning info as generated.

        target_lines -- The number of lines in the file which should be used
        for each partial parse. This should be determined based on available
        memory.
        """
        for results in self.parse_simple(gff_files, limit_info, target_lines):
            if base_dict is None:
                cur_dict = dict()
            else:
                cur_dict = copy.deepcopy(base_dict)
            cur_dict = self._results_to_features(cur_dict, results)
            all_ids = list(cur_dict.keys())
            all_ids.sort()
            for cur_id in all_ids:
                yield cur_dict[cur_id]

By commenting out the line all_ids.sort I can preserve input order.

@Juke34
Copy link
Collaborator

Juke34 commented Mar 28, 2024

Excellent, thank you for the feedback @percyfal !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants