replace base split by regex split to fix ; inside quotes #215

DevangThakkar · 2023-05-14T06:24:43Z

This PR fixes #212 where semicolons inside quoted string values were causing issues. I was not able to replicate the AttributeStringError mentioned in the issue since the usage was not specified but this is what the outputs look like before and after the change. I used the file mentioned in the issue for this test.

BEFORE:

>>> import gffutils
>>> db = gffutils.create_db("GCF_mini.gtf", "GCF.db")
<redacted_path>/gffutils/create.py:770: UserWarning: It appears you have a gene feature in your GTF file. You may want to use the `disable_infer_genes=True` option to speed up database creation
  warnings.warn(
>>> print(list(db.all_features())[1].attributes)
gene_id: ['BSU_00010']
transcript_id: ['unassigned_transcript_1']
db_xref: ['EnsemblGenomes-Gn:BSU00010', 'EnsemblGenomes-Tr:CAB11777', 'GOA:P05648', 'InterPro:IPR001957', 'InterPro:IPR003593', 'InterPro:IPR010921', 'InterPro:IPR013159', 'InterPro:IPR013317', 'InterPro:IPR018312', 'InterPro:IPR020591', 'InterPro:IPR024633', 'InterPro:IPR027417', 'PDB:4TPS', 'SubtiList:BG10065', 'UniProtKB/Swiss-Prot:P05648']
experiment: ['publication(s) with functional evidences', ' PMID:2167836', ' 2846289', ' 12682299', ' 16120674', ' 1779750', ' 28166228']
gbkey: ['CDS']
gene: ['dnaA']
locus_tag: ['BSU_00010']
note: ['"Evidence 1a: Function from experimental evidences in the studied strain']
PubMedId:: ['2167836', ' 2846289', ' 12682299', ' 16120674', ' 1779750', ' 28166228']
Product: ['type f : factor"']
product: ['chromosomal replication initiator informational ATPase']
protein_id: ['NP_387882.1']
transl_table: ['11']
exon_number: ['1']
: []

AFTER:

>>> import gffutils
>>> db = gffutils.create_db("GCF_mini.gtf", "GCF.db")
<redacted_path>/gffutils/create.py:770: UserWarning: It appears you have a gene feature in your GTF file. You may want to use the `disable_infer_genes=True` option to speed up database creation
  warnings.warn(
>>> print(list(db.all_features())[1].attributes)
gene_id: ['BSU_00010']
transcript_id: ['unassigned_transcript_1']
db_xref: ['EnsemblGenomes-Gn:BSU00010', 'EnsemblGenomes-Tr:CAB11777', 'GOA:P05648', 'InterPro:IPR001957', 'InterPro:IPR003593', 'InterPro:IPR010921', 'InterPro:IPR013159', 'InterPro:IPR013317', 'InterPro:IPR018312', 'InterPro:IPR020591', 'InterPro:IPR024633', 'InterPro:IPR027417', 'PDB:4TPS', 'SubtiList:BG10065', 'UniProtKB/Swiss-Prot:P05648']
experiment: ['publication(s) with functional evidences', ' PMID:2167836', ' 2846289', ' 12682299', ' 16120674', ' 1779750', ' 28166228']
gbkey: ['CDS']
gene: ['dnaA']
locus_tag: ['BSU_00010']
note: ['Evidence 1a: Function from experimental evidences in the studied strain; PubMedId: 2167836', ' 2846289', ' 12682299', ' 16120674', ' 1779750', ' 28166228; Product type f : factor']
product: ['chromosomal replication initiator informational ATPase']
protein_id: ['NP_387882.1']
transl_table: ['11']
exon_number: ['1']
: []

Let me know what you think!

daler · 2023-07-04T02:27:47Z

Thanks for this. I'm worried about the performance hit though. Typically, the string splitting is what takes the most time when creating a db, and python regular expressions are notoriously slow.

This benchmarking code:

import re
import timeit
r = re.compile(f''';(?=(?:[^"]|"[^"]*")*$)''')

a = '''gene_id "BSU_00010"; transcript_id "unassigned_transcript_1"; db_xref "EnsemblGenomes-Gn:BSU00010"; db_xref "EnsemblGenomes-Tr:CAB11777"; db_xref "GOA:P05648"; db_xref "InterPro:IPR001957"; db_xref "InterPro:IPR003593"; db_xref "InterPro:IPR010921"; db_xref "InterPro:IPR013159"; db_xref "InterPro:IPR013317"; db_xref "InterPro:IPR018312";  '''

def split(a):
    a.split()

def regex(a):
    r.split(a)


print("split: " , timeit.Timer(f"split('{a}')", "from __main__ import split").timeit())
print("regex: " , timeit.Timer(f"regex('{a}')", "from __main__ import regex").timeit())

gives this on my laptop:

split:  0.7770832080277614
regex:  23.40327387099387

So for this example, the regex is 30x slower! That means a human GTF, instead of taking ~15 min to build a db, would take over 7 hrs. And based on some initial testing, the gap between the methods grows with increasing attribute string length.

Unfortunately without getting the performance closer to what str.split is, handling a corner case like this is not worth slowing everything down.

Maybe there could be a fallback option, where the regex only happens under certain circumstances? I think the issue in #212 is not hit until after parsing, but maybe there would be some way of detecting, at parse time, cases where it should try harder with the regex.

DevangThakkar · 2023-07-10T02:12:32Z

That makes sense, I had not thought about the speed implications of regex. An alternative solution is to allow for this using an additional term in the dialect. Again, given that this is an edge case, I didn't want to slow everything down for this - so I followed the logic in the wormbase_gff2 example where the user needs to infer the dialect using helpers.infer_dialect() and pass that dialect to the created db or to gffutils.create_db(... , dialect=dialect, ...)

The changes I have made are:

constants.py

add a new dialect term semicolon_in_quotes which defaults to False.

parser.py

add an argument infer_dialect_call to the function definition _split_keyvals() with a default value of False.
if a dialect was provided, regex is used only if semicolon_in_quotes is explicitly set to True.
if a dialect was not provided, regex is not used unless the function was called from helpers.infer_dialect() with the infer_dialect_call explicitly set to True. If the call was made from helpers.infer_dialect(), we see if there is a difference between the regex split and the default split. The two would be different if there is a semicolon inside quotes, in which case semicolon_in_quotes is set to True.

helpers.py

add an argument infer_dialect_call to the function call _split_keyvals() with the value set to True.

Let me know what you think.

replace base split by regex split to fix ; inside quotes

5cc2b93

daler mentioned this pull request Jul 4, 2023

Field separators in quoted attributes cause error #212

Open

DevangThakkar added 3 commits July 10, 2023 01:49

create a new dialect token to avoid slowing down everything

817bbe3

undo code duplication

0eb9617

undo code duplication

8543575

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace base split by regex split to fix ; inside quotes #215

replace base split by regex split to fix ; inside quotes #215

DevangThakkar commented May 14, 2023

daler commented Jul 4, 2023

DevangThakkar commented Jul 10, 2023

replace base split by regex split to fix ; inside quotes #215

Are you sure you want to change the base?

replace base split by regex split to fix ; inside quotes #215

Conversation

DevangThakkar commented May 14, 2023

daler commented Jul 4, 2023

DevangThakkar commented Jul 10, 2023