Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running TextGCN processing crashes due to high RAM usage #7

Open
rcmcabral opened this issue Jun 17, 2020 · 1 comment
Open

Running TextGCN processing crashes due to high RAM usage #7

rcmcabral opened this issue Jun 17, 2020 · 1 comment

Comments

@rcmcabral
Copy link

rcmcabral commented Jun 17, 2020

Hi. I'm exploring the usage of the TextGCN implementation in the toolkit. I saw the sample using Bible Text but decided to explore using the toolkit instead as it is easier to use the package on Google Colab. I managed to clone the repo and import the library. Using the data from the Bible sample, the code runs until the train_and_fit(config) part. My setup is as follows:

config = Config(task='classification') # loads default argument parameters as above
config.train_data = 't_bbe.csv' # sets training data path
config.infer_data = 't_bbe.csv' # sets infer data path
config.num_classes = 66 # sets number of prediction classes
config.batch_size = 32
config.model_no = 0
config.lr = 0.001 # change learning rate
config.num_epochs = 10
config.max_vocab_len = 400

I set the train_data and infer_data to the same csv file first just to see if I could get the model to run but it seems that I couldn't get through preprocessing. train_and_fit(config) runs upto building document-word edges but then spikes RAM usage to 12Gb and crashes colab (running with GPU). Output before crashing is as follows:

06/16/2020 07:40:01 PM [INFO]: Loading data...
06/16/2020 07:40:01 PM [INFO]: Building datasets and graph from raw data... Note this will take quite a while...
06/16/2020 07:40:01 PM [INFO]: Preparing data...
06/16/2020 07:40:18 PM [INFO]: Calculating Tf-idf...
06/16/2020 07:40:19 PM [INFO]: Building graph (No. of document, word nodes: 62206, 400)...
06/16/2020 07:40:19 PM [INFO]: Adding document nodes to graph...
06/16/2020 07:40:19 PM [INFO]: Adding word nodes to graph...
06/16/2020 07:40:19 PM [INFO]: Building document-word edges...
100%|██████████| 62206/62206 [04:25<00:00, 234.45it/s]

I initially didn't set a value max_vocab_len but it couldn't get past 3% on building document-word edges. I limited it to 400 and was able to reach 100% but it crashes after that. I'm afraid setting it lower would essentially remove most of the data.

My actual data has around double the number of documents in the Bible sample so I was wondering if there's a way to minimize RAM consumption to get it to work without needing more than 12Gb of RAM.

--
Edit: I tried using the suggested dataset (IMDB Sentiment Classification) with max vocab of 200 but it crashes during building of adjacency matrix as well.

@plkmo
Copy link
Owner

plkmo commented Jun 19, 2020

I have updated the script such that building document-word edges is done on the fly, but this only slightly reduced RAM usage. Graph preprocessing is quite intensive as its quadratic with number of nodes (in this case 62206 + 400 word + doc nodes) so I'm afraid theres probably no way to reduce further without decreasing the number of nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants