How much system ram is required per gpu for interhand3d dataset? #672

pablovela5620 · 2021-05-25T13:58:52Z

Looking at the log provided it looks like 8 Titan x gpus were used to train the interhand dataset with a batch size of 16 and 2 workers per gpu.

The full interhand dataset is pretty massive (<1 million images) and my understanding is that per worker and gpu one loads up the entire dataset into system ram (not gpu vram) so even with lets say a 128gb 8 gpus *2 workers = a HUGE amount of system ram. Am I understanding this correctly? I haven't had a chance to test yet

How much system ram did the machine that was used to train have? It seems super difficult to try to retrain on a multi GPU system without a really significant amount of system ram (>256gb?).

innerlee · 2021-05-25T16:52:20Z

one loads up the entire dataset into system ram

this is not the case

pablovela5620 · 2021-05-25T18:51:58Z

Understood, so I had a chance to try and train the model using the provided config. I'm using a machine with 128gb of ram and 2 A6000 gpus.

When I run on a single gpu using
python tools/train.py configs/hand3d/InterNet/interhand3d/res50_interhand3d_all_256x256.py
it uses up about 30GB of ram to load and train the network, the reason I assumed it loaded the entire dataset into system ram is the large amount of ram when using distributed training.

after tools/dist_train.sh I have the following problem.

Using the provided config with dist_train and only changing num gpus and num workers

1 gpu 2 workers ~ 68GB of ram
2 gpu's 1 worker ~ 90GB of ram
2 gpu's 2 workers ~ process killed, out of memory error (I have a total 128 GB of ram)

So with this testing, I had the following questions

How do I manage the amount of ram used without sacrificing the number of workers?
Is this typical amount of ram for this dataset?
What if I want to use the 30fps version of the dataset (13 million images vs the 1.3million so around 10 times larger)? Since my guess is this would increase the amount of ram need by a TON

I really appreciate the help!

ly015 · 2021-05-26T02:40:06Z

@zengwang430521 Could you please check this issue?

zengwang430521 · 2021-05-27T09:48:52Z

Hi @pablovela5620.
We load all annotation into memory before training, and this will cost a lot of memory. So if you find memory insufficient, you can use less workers.
And it's grateful that our implementation may not be suitable for 30-fps version now, because tit's too massive.

innerlee · 2021-05-27T12:22:25Z

@zengwang430521 The implementation could be improved.

pablovela5620 · 2021-05-27T15:30:14Z

@zengwang430521 so with the current implementation it seems like there are basically two solutions if using distributed single node training

Reduce the number of workers (in my case I can only use 1)
Buy more ram

I did notice that using distributed training with 1 gpu vs normal training with 1 gpu results in higher ram usage (68gb vs ~30gb). Not totally sure as to why. Some clarity here would be appreciated.

Also how much ram did the 8 gpu 2 worker machine use when training on the interhand3d dataset?

If I was to modify the dataset implementation (so that I could get it working with 30FPS version), it seems like its more of a design decision over the whole of mmpose hand datasets. I may be completely wrong here, and please correct me if I am, the use of xtcocotools in HandBaseDataset

from xtcocotools.coco import COCO

self.coco = COCO(ann_file)        
self.img_ids = self.coco.getImgIds()

basically loads the entire annotation into memory for any dataset that depends on it, also looking at Interhand2D/Interhand3D and others when calling def _get_db()

with open(self.camera_file, 'r') as f:            
    cameras = json.load(f)        
with open(self.joint_file, 'r') as f:            
    joints = json.load(f)

is what is eating up all the system memory inside the gt_db object. This seems consistent with all other datasets as well of first loading the entire dataset and then running the augmentation/preprocessing pipelines

So rather than loading the entire dataset, I would have to overload def __getitem__(self, idx):
to load the dataset on each call rather than all at once? Does this make sense or are there some other considerations I should be looking at and downsides of not loading all at once

* move readme_zh_cn * fix file link

* Add test of get_hooks_info() * Change to use original Runner for get_hook_info() test * Change to test after_train_iter hooks for get_hook_info() * Complement the stages list * Add logging hooks information in Runner.__init__() * Rearrange the stages list * Restore the stages to tuple type * Clean the unnecessary changes * Replace statement with TestCase's methods * add test stages in method_stages_map * change the hooks info into a f-string * return list(trigger_stages) directly * change keys of method_stages_map * Fix previous changes to method_stages_map.keys

innerlee added the question Further information is requested label May 25, 2021

innerlee assigned ly015 May 25, 2021

innerlee added enhancement New feature or request and removed question Further information is requested labels May 27, 2021

innerlee assigned zengwang430521 and unassigned ly015 May 27, 2021

pablovela5620 mentioned this issue May 28, 2021

Ram usage and dataset implementation (30FPS version) facebookresearch/InterHand2.6M#60

Closed

rollingman1 pushed a commit to rollingman1/mmpose that referenced this issue Nov 5, 2021

Move README_zh-CN.md (open-mmlab#672)

a5896a6

* move readme_zh_cn * fix file link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How much system ram is required per gpu for interhand3d dataset? #672

How much system ram is required per gpu for interhand3d dataset? #672

pablovela5620 commented May 25, 2021

innerlee commented May 25, 2021

pablovela5620 commented May 25, 2021

ly015 commented May 26, 2021

zengwang430521 commented May 27, 2021 •

edited

Loading

innerlee commented May 27, 2021

pablovela5620 commented May 27, 2021 •

edited

Loading

How much system ram is required per gpu for interhand3d dataset? #672

How much system ram is required per gpu for interhand3d dataset? #672

Comments

pablovela5620 commented May 25, 2021

innerlee commented May 25, 2021

pablovela5620 commented May 25, 2021

ly015 commented May 26, 2021

zengwang430521 commented May 27, 2021 • edited Loading

innerlee commented May 27, 2021

pablovela5620 commented May 27, 2021 • edited Loading

zengwang430521 commented May 27, 2021 •

edited

Loading

pablovela5620 commented May 27, 2021 •

edited

Loading