-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguous hydrogen counts #12
Comments
@ChayaSt - is "corner case 1" really a problem? I agree that the toolkits are inconsistent which is perhaps "annoying" but ... your SMILES string is, er, "nonsense", isn't it? Specifically I also feel that way about "corner case 2", but my "I am not a chemist" disclaimer applies even more strongly here. In this case it just comes from the fact that "I expect we will only see iodine acting like a 'normal halogen' and being a terminal atom which always has valence one and a single bond to any other atom." So... I'm not terribly concerned if iodine ends up with weird numbers of hydrogens when it occurs in crazy environments. (Is there some reason this is "less crazy" than I think?) Thanks. |
@davidlmobley, I mostly agree with you that we can ignore these kinds of crazy corner cases, but it is annoying that different toolkits treat them differently. The bigger concern is the first 2 issues, where protonation states of a molecule with geometry is not conserved and that a toolkit is internally inconsistent or toolkits are inconsistent with where it places H when it is ambiguous. That's why I'm proposing restricting input to explicit H SMILES (it doesn't fully solve the third issue but that is something we can have a note about in the documentation). The problem is that while there is a way to check openeye molecules that the SMILES used to create the molecule has explicit hydrogen, I did not yet find an equivalent check for RDKit. |
I'd ask the RDKit issue tracker. And I think any cases where a toolkit is internally inconsistent should also be raised as bugs/issues with either toolkit's support folks. |
@ChayaSt So, right now I don't check for explicit hydrogens, and in fact I rely on the I agree that restricting to explicit H SMILES is a good idea here. I think that the corner cases are good to document, and I agree with you and David that it probably isn't critical at this stage to expect proper behavior for these specific ones. |
We want to ensure that incoming SMILES have unambiguous protonation / tautomer states.
There are several issues to look out for:
This might be an issue for inputs that include geometry and can be resolved with sticking to explicit hydrogen SMILES. Protonation state change issue choderalab/perses#375
@wiederm has seen cases where this can happen. Explicit hydrogens should also resolve this.
This is not always the case. The corner cases I found:
Corner case 1
c-1-c-c-c-c-c-1'
Using explicit hydrogen SMILES helps here.
When reading this SMILES, OE adds one H per carbon, RDKit adds 2 H per carbon.
output:
Corner case 2
'O1CC2=C(I1C)C=CC=C2'
RDKit adds 2 Hs to I
Using the same snippet as above, explicit hydrogen SMILES does not resolve the inconsistency. This is probably a bug in RDKit.
Edit: An issue already exists on this rdkit/rdkit#1569. It doesn't look like this will get fixed anytime soon.
This issue cannot always be resolved with requiring explicit hydrogen.
Given all these issues, I think it makes sense to require explicit hydrogen SMILES for CMILES.
To check if a SMILES has explicit H, I can check if
atom.GetImplicitHCount() > 0
since all implicit H are set to zero when explicit hydrogens are added.However, there is no equivalent in RDKit that I am aware of. When RDKit reads an explicit hydrogen SMILES, it seems to set all hydrogens to implicit.
@j-wags, how do you check that SMILES have explicit hydrogens for RDKit?
The text was updated successfully, but these errors were encountered: