-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokens for double letters atom (Cl and Br) #6
Comments
Hi Albert, Yes indeed, that is certainly something one could try to improve the performance. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
The model is currently working OK for me but I am just curious to know how the double letter atoms (like Cl and Br) are handled in encoding/decoding. I have looked at the one_hot_encoder module. It seems they are treated as 2 tokens (e.g "C" and "l" for chlorine atom). Please correct me if I am wrong because I could not see they are being handled as I thought they should, i.e. replacing these double-letter atoms with a dummy character before doing the one-hot encoding.
If chlorine is indeed treated as two tokens, wouldn't it confuse the network as it conflicts with the aliphatic carbon C?
Albert
The text was updated successfully, but these errors were encountered: