-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor smiles parsing code to use C++ lexer and parser (#7015) #7597
Conversation
This attempts to improve the SMILES parsing procedure to provide more information about bad inputs by pointing to general location of the offending token. The improved error messages currently only apply to syntax errors, inputs with extra close parentheses, and inputs with too large numbers, but this will provide a way to extend that in the future Some examples of the improved error messages | old version | new version | |----------------------------------------|------------------------| |syntax error while parsing: c%()ccccc%() | syntax error while parsing input:<br /> c%()ccccc%() <br /> ---^ | |syntax error while parsing: c%(100000)ccccc%(100000) | syntax error while parsing input: <br /> c%(100000)ccccc%(100000) <br /> --------^ | |syntax error while parsing: COc(c1)cccc1C#|syntax error while parsing input: <br /> COc(c1)cccc1C# <br /> -------------^ | |extra close parentheses while parsing: C) | extra close parentheses while parsing input: <br /> C) <br /> -^ | | extra close parentheses while parsing: C1C)foo | extra close parentheses while parsing input: <br /> C1C)foo <br /> ---^ | | number too large while parsing: [555555555555555555C] | number too large while parsing input: <br /> [555555555555555555C] <br /> -^ | While this required using C++ lexers and parsers generated by flex and bison, it doesn't affect the tokenization and parsing logic as a whole. It does, however, improve thread safety for SMILES parsing by removing usage of global states e.g. reducing the scope of the `debugParse` parameter through removing the use of the `yysmiles_debug` global. The C++ lexer requires the FlexLexer.h header, which is installed by flex, so I added it to the repository to support building with RDK_USE_FLEXBISON turned off.
…udes to the top to resolve that
Hi @whosayn I realized that we don't actually have any direct tests that the SMILES parser works in multiple threads.
I think it would be a useful addition to this PR. |
0a53a22
to
7473f81
Compare
@greglandrum good point! I updated the test file as suggested and everything works fine. |
Hi @whosayn I tried a general first test to see the impact of this change on the SMILES parsing: parsing 50K smiles without doing sanitization.
Using master this takes: but with this branch it is: That's about a 20% performance hit. The tests were run on my linux box, using g++ as the compiler. I haven't gone through the code looking for obvious places to speed things up, but I think something needs to be done about the slowdown before this can be merged. Paying a performance price for an important bug fix is generally ok, but in this case it's refactoring/cleanup and it doesn't seem like the price is worth it. |
@greglandrum that's a bummer. I had hoped going the C++ route would help tidy up the interface and push us in the direction of more safety, but I haven't been able to mitigate the performance issues. I'll close the associated issue and open another one to just focus on the improved logging |
Reference Issue
Implements #7015
What does this implement/fix? Explain your changes.
This attempts to improve the SMILES parsing procedure to provide more information about bad inputs by pointing to general location of the offending token. The improved error messages currently only apply to syntax errors, inputs with extra close parentheses, and inputs with too large numbers, but this will provide a way to extend that in the future
Some examples of the improved error messages
c%()ccccc%()
---^
c%(100000)ccccc%(100000)
--------^
COc(c1)cccc1C#
-------------^
C)
-^
C1C)foo
---^
[555555555555555555C]
-^
While this required using C++ lexers and parsers generated by flex and bison, it doesn't affect the tokenization and parsing logic as a whole. It does, however, improve thread safety for SMILES parsing by removing usage of global states e.g. reducing the scope of the
debugParse
parameter through removing the use of theyysmiles_debug
global.The C++ lexer requires the FlexLexer.h header, which is installed by flex, so I added it to the repository to support building with RDK_USE_FLEXBISON turned off.