Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smpl_mesh_root_align #158

Closed
ZhengdiYu opened this issue Feb 13, 2022 · 19 comments
Closed

smpl_mesh_root_align #158

ZhengdiYu opened this issue Feb 13, 2022 · 19 comments

Comments

@ZhengdiYu
Copy link

ZhengdiYu commented Feb 13, 2022

Hi, I notice that your ROMP_HRNet_32.pkl was trained on smpl_mesh_root_align=False. But in v1.yml, smpl_mesh_root_align is not set, so it's default value True.

So My questions are:

  1. (Solved✔) Firstly I found my model perform having the same issue as resnet (mesh shift), then I found the reason: Image.yml is initially designed for ROMP_HRNet_32.pkl, which was trained on smpl_mesh_root_align=False. If we want to test on image using our model trained from pre-trained model using hrnet and v1.yml, the smpl_mesh_root_align in image.yml should also be set to True, just like resnet Question about the released Resnet-50 trained models #106 . So this was solved.

  2. When should smpl_mesh_root_align be True or False? Why did you set it to True for v1.yml and resnet, although it's false for ROMP_HRNet_32.pkl? I think for 3D joints loss, it doesn't matter as long as we would do another alignment before calculating MPJPE/PAMPJPE. And for the 2D part, the weak camera parameters will be automatically learnt to project those 3D joints to align with GT_2d as long as it's consistent all the time. ~So the last question is:

  3. During fine-tuning from your model: ROMP_HRNet_32.pkl using v1_hrnet_3dpw_ft.yml. the smpl_mesh_root_align is also default value True, However, ROMP_HRNet_32.pkl was trained with smpl_mesh_root_align=True.

As we know from question1: if we use different setting of smpl_mesh_root_align, the visualization will be shifted, I think this could be a problem for training and fine-tuning.

And I tried to train with smpl_mesh_root_align from scratch, but it's ended up with error below:

Traceback (most recent call last):
  File "/home2/rctv12/miniconda3/envs/ROMP/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/rctv12/miniconda3/envs/ROMP/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/rctv12/projects/ROMP/multi-person/romp/train.py", line 148, in <module>
    main()
  File "/home2/rctv12/projects/ROMP/multi-person/romp/train.py", line 145, in main
    trainer.train()
  File "/home2/rctv12/projects/ROMP/multi-person/romp/train.py", line 33, in train
    self.train_epoch(epoch)
  File "/home2/rctv12/projects/ROMP/multi-person/romp/train.py", line 94, in train_epoch
    self.train_log_visualization(outputs, loss, run_time, data_time, losses, losses_dict, epoch, iter_index)
  File "/home2/rctv12/projects/ROMP/multi-person/romp/train.py", line 74, in train_log_visualization
    vis_cfg={'settings': ['save_img'], 'vids': vis_ids, 'save_dir':self.train_img_dir, 'save_name':save_name, 'verrors': [vis_errors], 'error_names':['E']})
  File "/home2/rctv12/projects/ROMP/multi-person/romp/lib/models/../utils/../visualization/visualization.py", line 102, in visulize_result
    rendered_imgs = self.visualize_renderer_verts_list(per_img_verts_list, images=org_imgs.copy(), trans=mesh_trans)
  File "/home2/rctv12/projects/ROMP/multi-person/romp/lib/models/../utils/../visualization/visualization.py", line 62, in visualize_renderer_verts_list
    rendered_img = self.renderer(verts, faces, colors=color, focal_length=args().focal_length, cam_params=cam_params)
  File "/home2/rctv12/projects/ROMP/multi-person/romp/lib/models/../utils/../visualization/renderer_pt3d.py", line 102, in __call__
    images = self.renderer(meshes)
  File "/home2/rctv12/miniconda3/envs/ROMP/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home2/rctv12/miniconda3/envs/ROMP/lib/python3.7/site-packages/pytorch3d/renderer/mesh/renderer.py", line 59, in forward
    fragments = self.rasterizer(meshes_world, **kwargs)
  File "/home2/rctv12/miniconda3/envs/ROMP/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home2/rctv12/miniconda3/envs/ROMP/lib/python3.7/site-packages/pytorch3d/renderer/mesh/rasterizer.py", line 168, in forward
    meshes_proj = self.transform(meshes_world, **kwargs)
  File "/home2/rctv12/miniconda3/envs/ROMP/lib/python3.7/site-packages/pytorch3d/renderer/mesh/rasterizer.py", line 147, in transform
    verts_world, eps=eps
  File "/home2/rctv12/miniconda3/envs/ROMP/lib/python3.7/site-packages/pytorch3d/transforms/transform3d.py", line 336, in transform_points
    points_out = _broadcast_bmm(points_batch, composed_matrix)
  File "/home2/rctv12/miniconda3/envs/ROMP/lib/python3.7/site-packages/pytorch3d/transforms/transform3d.py", line 753, in _broadcast_bmm
    return a.bmm(b)
RuntimeError: expected scalar type Half but found Float

I'm still debugging anyway.

@Arthur151
Copy link
Owner

Arthur151 commented Feb 13, 2022

Thanks~
About smpl_mesh_root_align, if they are consistently set during training and testing, either way is OK. You are right about this.

This bug is pretty easy to fix. It is caused by the multipliation between different data type (Half and Float). Please check the data type where bug occured and set them to .float(). The easiest way is to set all outputs with .float(), which I have done in released code. There must be some changes in your code, please set these outputs to float.

@ZhengdiYu
Copy link
Author

ZhengdiYu commented Feb 13, 2022

Thanks~ About your questions: 2. Either way is OK, you are right about this. 3. This bug is pretty easy to fix. It is caused by the multipliation between different data type (Half and Float). Please check the data type where bug occured and set them to .float(). The easiest way is to set all outputs with .float(), which I have done in released code. There must be some changes in your code, please set these outputs to float.

  1. I understand that it should be consistent. This is what I exactly want to say. The problem is that it doesn't seem to be consistent in your current training guidance. It didn't mention that we should change smpl_mesh_root_align to False when fine-tuning your ROMP_HRNet_32.pkl.

  2. I do understand what it's caused by as it's clearly written in the log. What really confused me was that I didn't do any change to your code, except setting smpl_mesh_root_align to False, I was trying to reproduce the results all the time so I won't change your code but only settings. This error only occurs when I set smpl_mesh_root_align to False.

  3. Could you confirm that you use smpl_root_mesh_align=False to train ROMP_HRNet_32.pkl? Have you encountered this before?

it seems that smpl_mesh_root_align only control this part:
image

I don't understand why would this error occur if this part is skipped. I will debug to find where has gone wrong, will be back soon~

@ZhengdiYu
Copy link
Author

ZhengdiYu commented Feb 13, 2022

Hey, I spent my afternoon and finally located the issue, OMG...hhh, it seem to be a bug of your current released code:

The output vertices of lbs is already float16, and the minus operation within the condition smpl_mesh_root_align can transfer it to float32. So this is the reason why the error only occur when smpl_mesh_root align is False, if we skip it, it would be a float16 output, resulting in the error of Pytorch3D above:

smpl.py:
image
image

Going further, that's because in the lbs:
image
image

Even if A and W are all float32, the T is still float16, resulting in a float16 verts as well. Pretty weird.

Anyway, I have fixed this now. But the questions are still confusing me:

Do you have any suggestions about the questions above (fine-tuning config and ROMP_HRNet_32.pkl are using different smpl_root_mesh_align, not consistent) ? Maybe we can further improve it after fixing the bug?

All in all, I think there are some bugs for the current released code:

  1. ROMP_HRNet_32.pkl and fine-tuning config are using inconsistent smpl_root_mesh_align
  2. output vertices of lbs in smpl.py is float16. Only when we activate smpl_root_mesh_align to do a minus operation it would become float32.
  3. Some other path bugs and design issues in my previous issue Gaussian kernel size #132

@Arthur151
Copy link
Owner

Thanks for your great work.
About the bugs,

  1. It may not be well solved. ROMP_HRNet_32.pkl is kept for replementing some previous results.
  2. Thanks for your work. I have fixed it at this line in last commit. Next time, could you please submit a pull request to fix it?
  3. The design and smpl calling are fixed. Sorry for missing that issue.

@ZhengdiYu
Copy link
Author

Thanks for your great work. About the bugs,

  1. It may not be well solved. ROMP_HRNet_32.pkl is kept for replementing some previous results.
  2. Thanks for your work. I have fixed it at this line in last commit. Next time, could you please submit a pull request to fix it?
  3. The design and smpl calling are fixed. Sorry for missing that issue.

You're welcome.

I have one more question: Since we're going to do pelvis keypoints alignment before evaluation anyway, why don't we directly align the pelvis keypoints to begin with, instead of root joint alignment?

@Arthur151
Copy link
Owner

Arthur151 commented Feb 14, 2022

You might notice that in different evaluation, we have to align to different root. Because different evaluation benchmarks have different root joint. They may share the same name (Pelvis), but they are really at different position.
To meet the pelvis defination of most in-the-wild 2D pose datasets during training, I choose the current root joint for alignment. It may not be the perfect choice, but I think it is obviously better than the original smpl pelvis.

@ZhengdiYu
Copy link
Author

ZhengdiYu commented Feb 15, 2022

@Arthur151 Great! Thanks for your great help! And I have another question about your result_parser.py:

def match_params(self, outputs, meta_data, cfg):
gt_keys = ['params', 'full_kp2d', 'kp_3d', 'subject_ids', 'valid_masks']
exclude_keys = ['heatmap','centermap','AE_joints','person_centers','all_person_detected_mask']
center_gts_info = process_gt_center(meta_data['person_centers'])
center_preds_info = self.centermap_parser.parse_centermap(outputs['center_map'])
mc_centers = self.match_gt_pred(center_gts_info, center_preds_info, outputs['center_map'].device, cfg['is_training'])
batch_ids, flat_inds, person_ids = mc_centers['batch_ids'], mc_centers['flat_inds'], mc_centers['person_ids']
if len(batch_ids)==0:
if 'new_training' in cfg:
if cfg['new_training']:
outputs['detection_flag'] = torch.Tensor([False for _ in range(len(meta_data['batch_ids']))]).cuda()
outputs['reorganize_idx'] = meta_data['batch_ids'].cuda()
return outputs, meta_data
batch_ids, flat_inds = torch.zeros(1).long().to(outputs['center_map'].device), (torch.ones(1)*self.map_size**2/2.).to(outputs['center_map'].device).long()
person_ids = batch_ids.clone()
outputs['detection_flag'] = torch.Tensor([True for _ in range(len(batch_ids))]).cuda()
if 'params_maps' in outputs and 'params_pred' not in outputs:
outputs['params_pred'] = self.parameter_sampling(outputs['params_maps'], batch_ids, flat_inds, use_transform=True)
outputs, meta_data = self.reorganize_data(outputs, meta_data, exclude_keys, gt_keys, batch_ids, person_ids)
outputs['centers_pred'] = torch.stack([flat_inds%args().centermap_size, flat_inds//args().centermap_size],1)
return outputs, meta_data

As I understand, if len(batch_ids)==0, that means there are no matched results. But why do you set batch_ids, person_ids, flat_inds to 0, 0, 1 if there are no results matched (line 58)? This would make outputs['detection_flag'] permanently True because len(batch_dis) will at least be 1.

This means that If no person was detected, we will have a meaningless parameter sampled from batch 0, flat_inds 1 to calculate loss with batch0, person0. And, the param_loss and the keypoints_loss will always be calculated as outputs['detection_flag'] is always True, which I think doesn't make any sense.

Am I missing something? Is it a feature or a bug?


  1. About the aforementioned inconsistent smpl_root_mesh_align of fine-tune config and ROMP_HRNet_32.pkl, what do you mean by "It may not be well solved"? Are you meaning that the config file shouldn't fixed to consistent? I know ROMP_HRNet_32.pkl is kept, but we can just fix the fine-tune config file to be consistent with the model, instead of changing the model. Did you get your results for Table 3. using inconsistent smpl_root_mesh_align (I did use this before to get a similar one)? I am now trying to fix the fine-tuning config to have consistent smpl_root_mesh_align as ROMP_HRNet_32.pkl to see if I can get better one.

  1. In the current released code, model_return_loss seems only be related to where to sum detection flag or not, and where to calculate PAMPJPE. But I haven't figured out why would it also be here:
    if detection_flag or args().model_return_loss:

    It seems like to force to calculate keypoints_loss and params_loss while detection flag is False, which seems not clear for me. Anyway, as I said, in the current version, detection_flag is permanently True so this doesn't really matter.

However, why is model_return_loss True for fine-tuning config bug False for v1.yml? Also, the same goes for argument "new_trianing". Are they two features of newer other version? It seems that only when "new_training" is True and detected person is 0, the detection flag could be False. But I think there is another problem: it's not consistent with the cfg 'new_training''s comment: 'learning centermap only in first few iterations for stable training.'. Because detected person could still be larger than 0, detection will be True again. By that time, the param loss will still be calculated...

Anyway, still trying to fully understand your code. Maybe fused multi-version feature makes some part a bit hard to understand now.

@Arthur151
Copy link
Owner

  1. I set it to detect at least one subject for training. ROMP is trained on multiple GPUs. Pytorch achieve this via dividing the data and gathering them back. It requires the gathered dict data have the same keys for normal concatenation. Therefore, I set it to detect at least one subject to make this pipeline work.
  2. I don't think it would make much differences.
  3. Please refer to 1.
  4. It is developed for stable training. I write the code to make it adapt/stable to more cases.

@ZhengdiYu
Copy link
Author

  1. I set it to detect at least one subject for training. ROMP is trained on multiple GPUs. Pytorch achieve this via dividing the data and gathering them back. It requires the gathered dict data have the same keys for normal concatenation. Therefore, I set it to detect at least one subject to make this pipeline work.
  2. I don't think it would make much differences.
  3. Please refer to 1.
  4. It is developed for stable training. I write the code to make it adapt/stable to more cases.

Thanks for your prompt reply:

About 1, 3: Thanks, this makes sense. But the problem still exists:

Why don't you use corresponding flat_inds of batch 0 and person 0 to sample from param_map instead of using always (torch.ones(1)*self.map_size**2/2.)? Won't this wrong position information affect training (if the actual flat_inds is actually far from the sampling position)? This is essentially sampling parameters from where it shouldn't be.

Also, this operation is in conflict with the occasion where there are person detected:

  1. If we have detected/matched at least one person, let's say a person at batch_ids:A , person_ids:B, flat_inds:C, we would use its GT position C to sample params from params_map of batch A(pred_param_maps[A, C]). And then compute loss between it and the GT param/pose of B person at batch A(GT_params[A, B]). This is obviously using its corresponding flat_inds to sample parameter.

Why don't you keep it in the same protocol to use GT position for sampling instead of using a meaning less position flat_inds=(torch.ones(1)*self.map_size**2/2.)?


About 2: Does this mean that the camera map is very easy to learn? Even if we use a wrong projection at first, all the pj_2d will be shifted from what they should be. It will be adjust to adapt the new alignment very quickly.

@Arthur151
Copy link
Owner

Arthur151 commented Feb 15, 2022

Thanks for your suggestion about directly using flat_inds. I will try it later.
The case that ROMP cannot detect a single person in the whole batch is pretty rare. It may not happen during normal training. But your suggestion makes sense. Thanks!

It does easy to learn to shift to new alignment.

@Arthur151
Copy link
Owner

@ZhengdiYu
B.T.W., I mentioned your contributions to ROMP as this. Is this OK?

@ZhengdiYu
Copy link
Author

ZhengdiYu commented Feb 17, 2022

@ZhengdiYu B.T.W., I mentioned your contributions to ROMP as this. Is this OK?

Thank you for your compliments. What I've done is just trivial compared to you and other contributors. I'm just reading code and asking question all the time. Not even helpful yet.

@ZhengdiYu
Copy link
Author

ZhengdiYu commented Feb 17, 2022

I'm new to this area, so I have more questions than others. Hope you don't mind. Thank you for your help all the time. And here comes another one about coordinate system, :

Q1. Why do we need alignment like smpl_root_mesh_alignment and sometime don't? If we set it to False, I think we're essentially learning the real 3D position in camera coordinate system, instead of root_relative position. In this case, I think we don't even need to predict a weak camera parameter, we only need the camera intrincics to convert to u, v coordinate. (The same goes for root-relative 3D pose).

Why am I asking this?: Now if I want to directly put those 3D meshes in MeshLab for visualization, if smpl_root_mesh_alignment is True, thus predicting the root_relative 3D coordinates, the pelvis of all meshes are all centered together. Is there a way to put them into correct position in 3D space?

Your visualization seems to achieve this. So just now I have looked into your visualization code. You seem to convert weak camera parameter to perspective camera parameter to convert to their original 3D position, so actually we can also get their relative position in 3D space, even if we don't have GT root information. Am I understanding correctly?

Actually, I have just tried to use your ROMP_HRNet_32.pkl to get some meshes using your image.py demo. This model should be trained with smpl_root_mesh_aligm=False, but the meshes are still all take pelvis as (0,0,0) and merged together, I don't know why. Still trying to figure out this.
image


I think current evaluation metrics in this area is not really fully considerable and reasonable for multi-person evaluation:

Q2.1 When it comes to multi-person evaluation, why don't we need to recover all the person to its real 3D position and put them in a common space to perform evaluation for multi-person at once, instead of aligning and evaluating them separately like single-person (each person is aligned by their own pelvis and evaluate with their corresponding GT)? I think this kind of evaluation doesn't take into account the relative position information. The multi person could even be in random position in the space but still have good quantitative results as they're all aligned to their own GT anyway.

Q2.2 In regard to the evaluation, how did you deal with the case of missing person? I think now the quantitative results is only calculated for the successful detection case. There might be a method missing a lot of person but still have better quantitative results.


Q3. Btw, I suddenly noticed that in our previous discussion, when you have spent one night to verify 'loading pre-trained backbone or not'. Your MPJPE in training loss seems strange, it's too small compared to your previous log, and his and my log:

@ZhengdiYu Hi, I have tried last night. Till now, the log is similar to yours. V1_hrnet_nopretrain_check_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1,2,3.log You are right. I might have load the pre-trained backbone before. Sorry for the mistake. Thanks a lot for pointing it out. I have found the reason that occurs NaN loss. It is caused by the pkp2d loss. I will try to fix this in next update.

@Arthur151
Copy link
Owner

Arthur151 commented Feb 17, 2022

About Q1, please note that our input is random internet images without camera parameters. This un-calibrated setting makes us to use weak-perspective camera model. But recently, some efforts are made to use perspective camera, like SPEC.
About Q2, maybe our new work, BEV, can solve this problem to some extends. https://arxiv.org/abs/2112.08274
Q3. That is my log, it just likes what it looks like.

@ZhengdiYu
Copy link
Author

ZhengdiYu commented Feb 18, 2022

About Q1,

  1. Yes it is, predicting camera parameters can deal with un-calibrated camera. But I was just trying to point out that if we directly learn un-aligned mesh/kp3d. This will give us the capability to put them into correct relative position rather than a set of pelvis-centered person mesh like the image I presented.
  2. How did you convert your mesh to camera coordinate system? Did you convert the weak-perspective camera to perspective camera?
  3. Why is your ROMP_HRNet_32.pkl still produces pelvis aligned mesh (demonstrated in the image I presented) even if it's trained with smpl_mesh_root_align=False?

About Q3,

Yes it's your log. I was just wondering that why your training MPJPE(47, 48) is nearly 1/3 of validation MPJPE(140, 150). In your previous log, his log, and my log. This should be 1:1 (e.g. 90:95), not 1:3. Did you use a different loss calculation?

@Arthur151
Copy link
Owner

Arthur151 commented Feb 18, 2022

  1. I choose to decouple the 3D translation and 3D pose estimation and learn them individually.
  2. Meshes are placed in 3D coordinate via predicted cam parameters
  3. I don't get what your point?

Oh, sorry, I haven't figured out the reason. I edit the code everyday. I can't remember which one causes this difference.

@ZhengdiYu
Copy link
Author

ZhengdiYu commented Feb 22, 2022

  1. I choose to decouple the 3D translation and 3D pose estimation and learn them individually.
  2. Meshes are placed in 3D coordinate via predicted cam parameters
  3. I don't get what your point?

Oh, sorry, I haven't figured out the reason. I edit the code everyday. I can't remember which one causes this difference.

Sorry for my late response:

  1. What do you mean by "decouple the 3D translation and 3D pose estimation and learn them individually."? Do you mean MPJPE/PAMPJPE loss and Param Loss? Or do you mean camera parameter and MPJPE/PAMJPE loss.
  2. I think you are mentioning this part, right?:
    def plot_multi_meshes(self, vertices, cam_params, img, mesh_colors=None, interactive_show=False, rotate_cam=False):
  3. I just wonder that why is all the saved output meshes are pelvis aligned instead of separete in the space. I used to think that if I set smpl_mesh_root_align=False, the meshes will not be merged(algined/put) together like the image I presented, but I found that even smpl_mesh_root_align=False, they are still merged together like the image. Is it a feature of SMPL?

For example, if we have two GT person annotations in the image, namely θ1,β1 and θ2,β2, if we put them into the SMPL layer and output two meshes without any alignment operation. Will they be in correct relative 3D position or simply put together around pelvis (or root) ?

@Arthur151
Copy link
Owner

Arthur151 commented Feb 22, 2022

  1. Supervising the root-relative pose can learn pose independently. In contrast, supervising the un-aligned mesh/kp3d will focus the model to learn the 3D translation and 3D pose simultanousely, which couples two factors.
  2. Yes.
  3. Yes, theta controls the pose only while beta controls the shape only. None of them can effect the root position. Therefore, all smpl meshes are always root aligned.
    I didn't use the trans of h36m dataset.

I suggest that you might need to go deep into the smpl model, like understanding the code. Then you would not have such confusions.

@ZhengdiYu
Copy link
Author

ZhengdiYu commented Feb 22, 2022

  1. Supervising the root-relative pose can learn pose independently. In contrast, supervising the un-aligned mesh/kp3d will focus the model to learn the 3D translation and 3D pose simultanousely, which couples two factors.
  2. Yes.
  3. Yes, theta controls the pose only while beta controls the shape only. None of them can effect the root position. Therefore, all smpl meshes are always root aligned.
    I didn't use the trans of h36m dataset.

I suggest that you might need to go deep into the smpl model, like understanding the code. Then you would not have such confusions.

Thanks for your quick apply. Thanks for your suggestion, I will look into it tomorrow.

And by the way, last thing to confirm,did you preprocess the h36m’ theta and beta to be root centered (0, 0, 0)?I think,without trans, its root position is not (0, 0, 0) originally and in the world coordinate system.

So just to confirm that, I guess for H36M, you only transform theta and beta to camera coordinate system, and use them directly without trans, and you didn't set the root to (0, 0, 0), right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants