Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokens for double letters atom (Cl and Br) #6

Open
albertma-evotec opened this issue Mar 17, 2020 · 1 comment
Open

Tokens for double letters atom (Cl and Br) #6

albertma-evotec opened this issue Mar 17, 2020 · 1 comment

Comments

@albertma-evotec
Copy link

Hi,

The model is currently working OK for me but I am just curious to know how the double letter atoms (like Cl and Br) are handled in encoding/decoding. I have looked at the one_hot_encoder module. It seems they are treated as 2 tokens (e.g "C" and "l" for chlorine atom). Please correct me if I am wrong because I could not see they are being handled as I thought they should, i.e. replacing these double-letter atoms with a dummy character before doing the one-hot encoding.
If chlorine is indeed treated as two tokens, wouldn't it confuse the network as it conflicts with the aliphatic carbon C?

Albert

@robinlingwood
Copy link
Owner

Hi Albert,

Yes indeed, that is certainly something one could try to improve the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants