Multi GPU Training #42

tharinduk90 · 2020-09-24T06:23:07Z

I want to do multi_view_reconstruction for real images. so far i have got good result. But now i want to speed up the training using multi gpu. Because for complex models it take more than 10 hours with my local pc.(6 gb gpu memory pc).

For the multi gpu training i added "multi_gpu: true" for the config.yaml(ours_depth_mvs.yaml) . I used p3.8xlarge(4 gpus ,with each having 16 gb memory) from aws for multi gpu testing.The config file is as follows.

data:
path: data/DTU
ignore_image_idx: []
classes: ['scan244']
dataset_name: DTU
n_views: 51
input_type: null
train_split: null
val_split: null
test_split: null
cache_fields: True
split_model_for_images: true
depth_range: [0., 1400.]
img_extension: png
img_extension_input: jpg
depth_extension: png
mask_extension: png
model:
c_dim: 0
encoder: null
patch_size: 2
lambda_image_gradients: 1.
lambda_depth: 1.
lambda_normal: 0.1
training:
out_dir: out/multi_view_reconstruction/angel/ours_depth_mvs
n_training_points: 2048
n_eval_points: 8000
model_selection_metric: mask_intersection
model_selection_mode: maximize
batch_size: 1
batch_size_val: 1
scheduler_milestones: [3000, 5000]
scheduler_gamma: 0.5
depth_loss_on_world_points: True
validate_every: 5000
visualize_every: 10000
multi_gpu: true
generation:
upsampling_steps: 4
refinement_step: 30

But when i check the usage of the gpus, the result is as follows.

only the gpu:0 is used.

i have checked #9.

Can u help me with multi gpu training ? Can u provide guidance how can i achieve it?
Can we increase the batch_size, batch_size_val more than one for the multi_view_reconstruction ?

m-niemeyer · 2020-10-05T07:52:15Z

Hi @tharinduk90, thanks for your interest in the project!

Unfortunately, we have not thoroughly tested multi GPU training as we never used it - we always did single-GPU training. As you already mentioned, this issue might be interesting for achieving results faster with less memory consumption.
Regarding the batch sizes in the multi-view reconstruction experiments, these indicate now the number of images which are sampled - in the single-view reconstruction experiments, the batch size defines the number of objects to sample. This is set with the `split_model_for_images' argument, e.g. here.

Good luck with your research!

tharinduk90 · 2020-10-05T08:25:45Z

@m-niemeyer , thank you very much for your reply,

in the multi view experiment , if i set the batch_size =2 , batch_size_val =2 , it will give following error. (since batch size refers to number of images sampled , i expect this to work)
Is there any thing that i'm doing wrong? Can u help me with this ?

Traceback (most recent call last):
File "train.py", line 129, in
loss = trainer.train_step(batch, it)
File "/home/liveroom/3d_reconstruction/dvr/differentiable_volumetric_rendering/im2mesh/dvr/training.py", line 112, in train_step
loss = self.compute_loss(data, it=it)
File "/home/liveroom/3d_reconstruction/dvr/differentiable_volumetric_rendering/im2mesh/dvr/training.py", line 405, in compute_loss
p_world_hat_sparse, mask_pred_sparse, normals) = self.model(
File "/home/liveroom/anaconda3/envs/dvr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/liveroom/3d_reconstruction/dvr/differentiable_volumetric_rendering/im2mesh/dvr/models/init.py", line 83, in forward
normals = self.get_normals(p_world.detach(), mask_pred, c=c)
File "/home/liveroom/3d_reconstruction/dvr/differentiable_volumetric_rendering/im2mesh/dvr/models/init.py", line 117, in get_normals
c = c.unsqueeze(1).repeat(1, points.shape[1], 1)[mask]
IndexError: The shape of the mask [2, 1024] at index 0 does not match the shape of the indexed tensor [1, 1024, 0] at index 0

m-niemeyer closed this as completed Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU Training #42

Multi GPU Training #42

tharinduk90 commented Sep 24, 2020 •

edited

Loading

m-niemeyer commented Oct 5, 2020

tharinduk90 commented Oct 5, 2020

Multi GPU Training #42

Multi GPU Training #42

Comments

tharinduk90 commented Sep 24, 2020 • edited Loading

m-niemeyer commented Oct 5, 2020

tharinduk90 commented Oct 5, 2020

tharinduk90 commented Sep 24, 2020 •

edited

Loading