Why does memory keep increasing during training? #14

hugh920 · 2022-04-05T01:59:16Z

Dear author, thanks for your code.But when I reproduced this code, I found that the memory kept increasing, and finally caused the training failure of running out of memory.What is the possible reason?

dgbarclay · 2022-04-05T09:57:30Z

Likewise, trying to find a fix now.

My process keeps getting killed, even when running with 25GB RAM on Google Colab.

dgbarclay · 2022-04-05T15:45:29Z

@hugh920 could you let me know if you find a fix in the meantime 👍

hugh920 · 2022-04-06T03:01:16Z

@dgbarclay When you train, does your memory increase with each epoch?How many epochs have you reached so far?

dgbarclay · 2022-04-06T09:04:43Z

@hugh920 Mine is being killed whilst parsing the data, it doesn't reach the beginning of training. It seems to fall within the block on line 203 of util.py. Are you able to begin training? Have you modified the code?

hugh920 · 2022-04-07T07:06:07Z

@dgbarclay Mine can be trained without modifying the code. However, due to increased memory, it failed in the second round. I modified batch_size and made the model structure a little simpler so that he could continue to run. I noticed that the memory increased during the first two training rounds and stabilized after the third. I don't understand why.

dgbarclay · 2022-04-07T14:02:48Z

@hugh920 okay, I have not yet made it that far. I was running out of memory during forming the DataLoader so I'm having to refactor a little bit. Are you able to push your version so I can compare the two? It would help me out loads, cheers.

dgbarclay · 2022-04-07T18:18:35Z

@hugh920 are you able to run eval_nus_wide.sh without failure? I ultimately just need to be able to run this model to take image queries and give predictions, are you able to get the model in that state?

hugh920 · 2022-04-10T07:49:48Z

@dgbarclay I took the ALF out and just used FLF, which didn't work well. It may not be what you need.

hugh920 · 2022-04-12T12:10:10Z

@dgbarclay I also had a problem with processes being killed while loading data on other projects today. I have observed that the GPU is not utilized when loading data. It is a dataloader made by CPU, probably because the processing power of CPU is not up to it.It has nothing to do with memory size or GPU.

akshitac8 · 2022-06-21T09:12:38Z

Is the issue solved?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does memory keep increasing during training? #14

Why does memory keep increasing during training? #14

hugh920 commented Apr 5, 2022

dgbarclay commented Apr 5, 2022

dgbarclay commented Apr 5, 2022

hugh920 commented Apr 6, 2022 •

edited

Loading

dgbarclay commented Apr 6, 2022

hugh920 commented Apr 7, 2022

dgbarclay commented Apr 7, 2022 •

edited

Loading

dgbarclay commented Apr 7, 2022 •

edited

Loading

hugh920 commented Apr 10, 2022

hugh920 commented Apr 12, 2022 •

edited

Loading

akshitac8 commented Jun 21, 2022

Why does memory keep increasing during training? #14

Why does memory keep increasing during training? #14

Comments

hugh920 commented Apr 5, 2022

dgbarclay commented Apr 5, 2022

dgbarclay commented Apr 5, 2022

hugh920 commented Apr 6, 2022 • edited Loading

dgbarclay commented Apr 6, 2022

hugh920 commented Apr 7, 2022

dgbarclay commented Apr 7, 2022 • edited Loading

dgbarclay commented Apr 7, 2022 • edited Loading

hugh920 commented Apr 10, 2022

hugh920 commented Apr 12, 2022 • edited Loading

akshitac8 commented Jun 21, 2022

hugh920 commented Apr 6, 2022 •

edited

Loading

dgbarclay commented Apr 7, 2022 •

edited

Loading

dgbarclay commented Apr 7, 2022 •

edited

Loading

hugh920 commented Apr 12, 2022 •

edited

Loading