Gene sorting compared to fasta #83

stephan-nylinder · 2024-02-06T10:09:21Z

Describe the bug
When running EMBLmyGFF3 on a fasta file with a specific order of the genes, it re-orders the genes in the flat file output in what seems to be alphanumerical order based on the gene names. Input gene order is not retained.

General (please complete the following information):

EMBLmyGFF3 2.2
EMBLmyGFF3 installation/use: Manual
OS: macOS and Windows 11

To Reproduce
Any fasta+gff
Any input parameters

Expected behavior
To guarantee the gene order in the flat file output to be in the same order as in the fasta input

Screenshots
None

Additional context
None

Juke34 · 2024-02-06T10:31:05Z

This behavior is related to biopython. Could you tell what version of biopython you use?
see #70

stephan-nylinder · 2024-02-06T11:18:52Z

Seems to be 1.80

percyfal · 2024-03-28T10:56:24Z

Chiming in here. We noted the issue of contigs not being sorted according to the input order of the assembly file. The culprit here is not biopython, but rather BCBio.GFF. In EMBLmyGFF.py you have the following. At line line 1365:

seq_dict = SeqIO.to_dict( SeqIO.parse(infasta, "fasta") )

I have verified that the input order is preserved.

Later on, at line 1484:

for record in GFF.parse(infile, base_dict=seq_dict):

is where the input order is scrambled. GFF.parse calls a function GFF.parse_in_parts which looks as follows:

def parse_in_parts(self, gff_files, base_dict=None, limit_info=None,
            target_lines=None):
        """Parse a region of a GFF file specified, returning info as generated.

        target_lines -- The number of lines in the file which should be used
        for each partial parse. This should be determined based on available
        memory.
        """
        for results in self.parse_simple(gff_files, limit_info, target_lines):
            if base_dict is None:
                cur_dict = dict()
            else:
                cur_dict = copy.deepcopy(base_dict)
            cur_dict = self._results_to_features(cur_dict, results)
            all_ids = list(cur_dict.keys())
            all_ids.sort()
            for cur_id in all_ids:
                yield cur_dict[cur_id]

By commenting out the line all_ids.sort I can preserve input order.

Juke34 · 2024-03-28T12:12:01Z

Excellent, thank you for the feedback @percyfal !

percyfal mentioned this issue Mar 28, 2024

GFF.parse modifies input order of base_dict chapmanb/bcbb#144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gene sorting compared to fasta #83

Gene sorting compared to fasta #83

stephan-nylinder commented Feb 6, 2024

Juke34 commented Feb 6, 2024

stephan-nylinder commented Feb 6, 2024

percyfal commented Mar 28, 2024

Juke34 commented Mar 28, 2024

Gene sorting compared to fasta #83

Gene sorting compared to fasta #83

Comments

stephan-nylinder commented Feb 6, 2024

Juke34 commented Feb 6, 2024

stephan-nylinder commented Feb 6, 2024

percyfal commented Mar 28, 2024

Juke34 commented Mar 28, 2024