Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding new unseen molecules #11

Open
manavsingh415 opened this issue Jul 4, 2021 · 2 comments
Open

Encoding new unseen molecules #11

manavsingh415 opened this issue Jul 4, 2021 · 2 comments

Comments

@manavsingh415
Copy link

Hi. When trying to create 512 dimensional vector representations of some new molecules (that the encoder may not have seen during training), I get the following error

Traceback (most recent call last):
File "encode.py", line 56, in
encode(**args)
File "encode.py", line 35, in encode
latent = model.transform(model.vectorize(mols_in))
File "/content/latent-gan/ddc_pub/ddc_v3.py", line 1042, in vectorize
return self.smilesvec1.transform(mols_test)
File "/content/latent-gan/molvecgen/vectorizers.py", line 145, in transform
one_hot[i,j+offset,charidx] = 1
IndexError: index -201 is out of bounds for axis 1 with size 138

I am using the pretrained chembl encoder. Any ideas about how to resolve? Thanks

@muammar
Copy link

muammar commented Feb 8, 2022

Did you find a solution to this?

@muammar
Copy link

muammar commented Feb 8, 2022

Because they explicitly mention in the README that the token length limit is 128, I decided to use SmilesVectorizer from molvecgen. I removed all SMILES for which the token vector has a length larger than the limit.

Suppose your data frame is called data in the example below.

remove = []

TOKEN_LENGTH_LIMIT = 128


for index, row in tqdm(data.iterrows(), total=len(data)):
    mol = Chem.MolFromSmiles(row.SMILES)
    sm_en = SmilesVectorizer(canonical=True, augment=False)
    sm_en.fit([mol], extra_chars=["\\"])

    if sm_en.maxlength > TOKEN_LENGTH_LIMIT:
        remove.append(index)


print(
    f"There are {len(remove)} smiles with a token length larger than {TOKEN_LENGTH_LIMIT}"
)

data.drop(remove, inplace=True)
data.to_csv("preprocessed.csv", index=False, header=False)

And now it worked.

image

The other way will be that if too many molecules are discarded because they have a token length larger than 128, you retrain the autoencoder again.

Good luck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants