Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU Training #42

Closed
tharinduk90 opened this issue Sep 24, 2020 · 2 comments
Closed

Multi GPU Training #42

tharinduk90 opened this issue Sep 24, 2020 · 2 comments

Comments

@tharinduk90
Copy link

tharinduk90 commented Sep 24, 2020

I want to do multi_view_reconstruction for real images. so far i have got good result. But now i want to speed up the training using multi gpu. Because for complex models it take more than 10 hours with my local pc.(6 gb gpu memory pc).

For the multi gpu training i added "multi_gpu: true" for the config.yaml(ours_depth_mvs.yaml) . I used p3.8xlarge(4 gpus ,with each having 16 gb memory) from aws for multi gpu testing.The config file is as follows.

data:
path: data/DTU
ignore_image_idx: []
classes: ['scan244']
dataset_name: DTU
n_views: 51
input_type: null
train_split: null
val_split: null
test_split: null
cache_fields: True
split_model_for_images: true
depth_range: [0., 1400.]
img_extension: png
img_extension_input: jpg
depth_extension: png
mask_extension: png
model:
c_dim: 0
encoder: null
patch_size: 2
lambda_image_gradients: 1.
lambda_depth: 1.
lambda_normal: 0.1
training:
out_dir: out/multi_view_reconstruction/angel/ours_depth_mvs
n_training_points: 2048
n_eval_points: 8000
model_selection_metric: mask_intersection
model_selection_mode: maximize
batch_size: 1
batch_size_val: 1
scheduler_milestones: [3000, 5000]
scheduler_gamma: 0.5
depth_loss_on_world_points: True
validate_every: 5000
visualize_every: 10000
multi_gpu: true
generation:
upsampling_steps: 4
refinement_step: 30

But when i check the usage of the gpus, the result is as follows.

Capture

only the gpu:0 is used.

i have checked #9.

  1. Can u help me with multi gpu training ? Can u provide guidance how can i achieve it?
  2. Can we increase the batch_size, batch_size_val more than one for the multi_view_reconstruction ?
@m-niemeyer
Copy link
Collaborator

Hi @tharinduk90, thanks for your interest in the project!

Unfortunately, we have not thoroughly tested multi GPU training as we never used it - we always did single-GPU training. As you already mentioned, this issue might be interesting for achieving results faster with less memory consumption.
Regarding the batch sizes in the multi-view reconstruction experiments, these indicate now the number of images which are sampled - in the single-view reconstruction experiments, the batch size defines the number of objects to sample. This is set with the `split_model_for_images' argument, e.g. here.

Good luck with your research!

@tharinduk90
Copy link
Author

@m-niemeyer , thank you very much for your reply,

in the multi view experiment , if i set the batch_size =2 , batch_size_val =2 , it will give following error. (since batch size refers to number of images sampled , i expect this to work)
Is there any thing that i'm doing wrong? Can u help me with this ?

Traceback (most recent call last):
File "train.py", line 129, in
loss = trainer.train_step(batch, it)
File "/home/liveroom/3d_reconstruction/dvr/differentiable_volumetric_rendering/im2mesh/dvr/training.py", line 112, in train_step
loss = self.compute_loss(data, it=it)
File "/home/liveroom/3d_reconstruction/dvr/differentiable_volumetric_rendering/im2mesh/dvr/training.py", line 405, in compute_loss
p_world_hat_sparse, mask_pred_sparse, normals) = self.model(
File "/home/liveroom/anaconda3/envs/dvr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/liveroom/3d_reconstruction/dvr/differentiable_volumetric_rendering/im2mesh/dvr/models/init.py", line 83, in forward
normals = self.get_normals(p_world.detach(), mask_pred, c=c)
File "/home/liveroom/3d_reconstruction/dvr/differentiable_volumetric_rendering/im2mesh/dvr/models/init.py", line 117, in get_normals
c = c.unsqueeze(1).repeat(1, points.shape[1], 1)[mask]
IndexError: The shape of the mask [2, 1024] at index 0 does not match the shape of the indexed tensor [1, 1024, 0] at index 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants