Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does memory keep increasing during training? #14

Open
hugh920 opened this issue Apr 5, 2022 · 10 comments
Open

Why does memory keep increasing during training? #14

hugh920 opened this issue Apr 5, 2022 · 10 comments

Comments

@hugh920
Copy link

hugh920 commented Apr 5, 2022

Dear author, thanks for your code.But when I reproduced this code, I found that the memory kept increasing, and finally caused the training failure of running out of memory.What is the possible reason?

@dgbarclay
Copy link

Likewise, trying to find a fix now.

My process keeps getting killed, even when running with 25GB RAM on Google Colab.

@dgbarclay
Copy link

@hugh920 could you let me know if you find a fix in the meantime 👍

@hugh920
Copy link
Author

hugh920 commented Apr 6, 2022

@dgbarclay When you train, does your memory increase with each epoch?How many epochs have you reached so far?

@dgbarclay
Copy link

@hugh920 Mine is being killed whilst parsing the data, it doesn't reach the beginning of training. It seems to fall within the block on line 203 of util.py. Are you able to begin training? Have you modified the code?

@hugh920
Copy link
Author

hugh920 commented Apr 7, 2022

@dgbarclay Mine can be trained without modifying the code. However, due to increased memory, it failed in the second round. I modified batch_size and made the model structure a little simpler so that he could continue to run. I noticed that the memory increased during the first two training rounds and stabilized after the third. I don't understand why.

@dgbarclay
Copy link

dgbarclay commented Apr 7, 2022

@hugh920 okay, I have not yet made it that far. I was running out of memory during forming the DataLoader so I'm having to refactor a little bit. Are you able to push your version so I can compare the two? It would help me out loads, cheers.

@dgbarclay
Copy link

dgbarclay commented Apr 7, 2022

@hugh920 are you able to run eval_nus_wide.sh without failure? I ultimately just need to be able to run this model to take image queries and give predictions, are you able to get the model in that state?

@hugh920
Copy link
Author

hugh920 commented Apr 10, 2022

@dgbarclay I took the ALF out and just used FLF, which didn't work well. It may not be what you need.

@hugh920
Copy link
Author

hugh920 commented Apr 12, 2022

@dgbarclay I also had a problem with processes being killed while loading data on other projects today. I have observed that the GPU is not utilized when loading data. It is a dataloader made by CPU, probably because the processing power of CPU is not up to it.It has nothing to do with memory size or GPU.

@akshitac8
Copy link
Owner

Is the issue solved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants