Running TextGCN processing crashes due to high RAM usage #7

rcmcabral · 2020-06-17T09:46:15Z

Hi. I'm exploring the usage of the TextGCN implementation in the toolkit. I saw the sample using Bible Text but decided to explore using the toolkit instead as it is easier to use the package on Google Colab. I managed to clone the repo and import the library. Using the data from the Bible sample, the code runs until the train_and_fit(config) part. My setup is as follows:

config = Config(task='classification') # loads default argument parameters as above
config.train_data = 't_bbe.csv' # sets training data path
config.infer_data = 't_bbe.csv' # sets infer data path
config.num_classes = 66 # sets number of prediction classes
config.batch_size = 32
config.model_no = 0
config.lr = 0.001 # change learning rate
config.num_epochs = 10
config.max_vocab_len = 400

I set the train_data and infer_data to the same csv file first just to see if I could get the model to run but it seems that I couldn't get through preprocessing. train_and_fit(config) runs upto building document-word edges but then spikes RAM usage to 12Gb and crashes colab (running with GPU). Output before crashing is as follows:

06/16/2020 07:40:01 PM [INFO]: Loading data...
06/16/2020 07:40:01 PM [INFO]: Building datasets and graph from raw data... Note this will take quite a while...
06/16/2020 07:40:01 PM [INFO]: Preparing data...
06/16/2020 07:40:18 PM [INFO]: Calculating Tf-idf...
06/16/2020 07:40:19 PM [INFO]: Building graph (No. of document, word nodes: 62206, 400)...
06/16/2020 07:40:19 PM [INFO]: Adding document nodes to graph...
06/16/2020 07:40:19 PM [INFO]: Adding word nodes to graph...
06/16/2020 07:40:19 PM [INFO]: Building document-word edges...
100%|██████████| 62206/62206 [04:25<00:00, 234.45it/s]

I initially didn't set a value max_vocab_len but it couldn't get past 3% on building document-word edges. I limited it to 400 and was able to reach 100% but it crashes after that. I'm afraid setting it lower would essentially remove most of the data.

My actual data has around double the number of documents in the Bible sample so I was wondering if there's a way to minimize RAM consumption to get it to work without needing more than 12Gb of RAM.

--
Edit: I tried using the suggested dataset (IMDB Sentiment Classification) with max vocab of 200 but it crashes during building of adjacency matrix as well.

The text was updated successfully, but these errors were encountered:

plkmo · 2020-06-19T02:43:20Z

I have updated the script such that building document-word edges is done on the fly, but this only slightly reduced RAM usage. Graph preprocessing is quite intensive as its quadratic with number of nodes (in this case 62206 + 400 word + doc nodes) so I'm afraid theres probably no way to reduce further without decreasing the number of nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running TextGCN processing crashes due to high RAM usage #7

Running TextGCN processing crashes due to high RAM usage #7

rcmcabral commented Jun 17, 2020 •

edited

Loading

plkmo commented Jun 19, 2020

Running TextGCN processing crashes due to high RAM usage #7

Running TextGCN processing crashes due to high RAM usage #7

Comments

rcmcabral commented Jun 17, 2020 • edited Loading

plkmo commented Jun 19, 2020

rcmcabral commented Jun 17, 2020 •

edited

Loading