Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

README error #34

Open
muammar opened this issue Apr 28, 2022 · 7 comments
Open

README error #34

muammar opened this issue Apr 28, 2022 · 7 comments

Comments

@muammar
Copy link

muammar commented Apr 28, 2022

After you generate the vocabulary in the first step of the README,

python get_vocab.py --ncpu 16 < data/chembl/all.txt > vocab.txt 

the next line should be:

python preprocess.py --train data/chembl/all.txt --vocab vocab.txt --ncpu 16 --mode single

Otherwise, you get the following error:

IndexError: tuple index out of range
@orubaba
Copy link

orubaba commented Apr 30, 2022

I have a question. how long does it take for the training to conclude. I have been running the "python preprocess.py --train data/chembl/all.txt --vocab vocab.txt --ncpu 16 --mode single" for a whole day and it has not completed. Is there something I am doing wrongly?

@muammar
Copy link
Author

muammar commented May 2, 2022

I have a question. how long does it take for the training to conclude. I have been running the "python preprocess.py --train data/chembl/all.txt --vocab vocab.txt --ncpu 16 --mode single" for a whole day and it has not completed. Is there something I am doing wrongly?

That's not normal. It took a couple of hours. I had to change the number of CPUs used because it was killing the memory ram of my workstation, and I have 256 GB of RAM.

@orubaba
Copy link

orubaba commented May 3, 2022

wow. Thanks. I was relying on my 16 ram laptop to do the work. it seems that was an ambitious thought. Now I see why I wasn't getting any headway.

@muammar
Copy link
Author

muammar commented May 3, 2022

wow. Thanks. I was relying on my 16 ram laptop to do the work. it seems that was an ambitious thought. Now I see why I wasn't getting any headway.

The chembl dataset is huge, and I think the script is doing its stuff but keeping everything in memory. At some point, you will run out of RAM. There are libraries, like Dask, that could allow you to work with processes requiring a huge amount of RAM but you would need to implement it. If you read the preprocess.py script, you will realize they are doing a pickle.dump at the end of the preprocessing procedure, if you could find a way to write it before and not waiting until the end, you can clear the garbage collector and free memory.

@orubaba
Copy link

orubaba commented May 5, 2022

Thanks so much for the suggestion. I am trying to run the get_vocab.py code on a far reduced size of the chembl dataset but got this error - multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f6c291da0a0>'. Reason: 'PicklingError("Can't pickle <class 'Boost.Python.ArgumentError'>: import of module 'Boost.Python' failed" - i have checked online but i haven't worked it out.
kindly assist.

@muammar
Copy link
Author

muammar commented Jun 14, 2022

Thanks so much for the suggestion. I am trying to run the get_vocab.py code on a far reduced size of the chembl dataset but got this error - multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f6c291da0a0>'. Reason: 'PicklingError("Can't pickle <class 'Boost.Python.ArgumentError'>: import of module 'Boost.Python' failed" - i have checked online but i haven't worked it out. kindly assist.

See #33

@muammar
Copy link
Author

muammar commented Jun 14, 2022

Forget about the message above. It is not using multiprocessing at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants