Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several Issues Encountered During Model Training #1

Open
AlvinYH opened this issue Jun 14, 2024 · 4 comments
Open

Several Issues Encountered During Model Training #1

AlvinYH opened this issue Jun 14, 2024 · 4 comments

Comments

@AlvinYH
Copy link

AlvinYH commented Jun 14, 2024

Thank you for publicly releasing your code! However, I encountered several problems while training the model:

  1. At https://github.com/czh-98/STAR/blob/master/lib/dlmesh.py#L909, it appears that mask and dense face are not on the same device. I resolved this by moving the mask to the GPU.
  2. At https://github.com/czh-98/STAR/blob/master/lib/trainer.py#L691, modifying the retarget_pose attribute in the trainer class does not seem to alter its value. This causes a bug at https://github.com/czh-98/STAR/blob/master/lib/dlmesh.py#L874 because retarget_pose remains None. I'm unsure of the underlying reason, but I fixed this by encapsulating the function that sets retarget_pose within the dlmesh class.
  3. I couldn't locate data/FLAME_masks/FLAME.obj after downloading the FLAME Vertex Masks and FLAME Mediapipe Landmark files as described in the README. Could you provide specific instructions on how to obtain this file?

Thank you!

@czh-98
Copy link
Owner

czh-98 commented Jun 14, 2024

Hi, thanks for your attention.

  • For 1, I checked the device, and it does in different devices. Although it still works on my own server, I have fixed it.
  • For 2, I run the code on my server. The training script does not raise errors and will update the pose, I do not know the reasons...😂
  • For 1 and 2, I think it might be because you have a multi-GPU server, while I only tested on a single GPU server, so I would suggest running the script with CUDA_VISIBLE_DEVICES=0 python xxx.
  • For 3, I just uploaded the FLAME.obj file for convenience.

Let me know if you have any other questions :)

@Jackiemin233
Copy link

@AlvinYH For question 2, I think you may use the torch 2.0+. I have debugged the code, and find out the solution. That's because on the Line 111-113 in train.py, the self. model is compiled by torch and transferred from DLMesh to an Optimized model and cannot read initialized retarget_pose sequence. I delete these three lines and it works for me. : )

@AlvinYH
Copy link
Author

AlvinYH commented Jun 16, 2024

@czh-98 Thanks for your reply! As @Jackiemin233 mentioned, I did use torch 2.0+, and deleting lines 111-113 fixed the bug. Thank you both!
However, when using torch 2.0+, I encountered an inplace operation error:
RuntimeError: one of the variables needed for gradient computation has been modified by an in-place operation: [torch.cuda.FloatTensor [25193]], which is output 0 of LinalgVectorNormBackward0, is at version 1; expected version 0 instead.
This error traces back to line 161 in lib/guidance/shape_reg.py during the computation of the Laplacian smoothness loss. I have since downgraded to torch 1.12 and resumed training. But I'm curious about the cause of this bug and if there is a solution other than downgrading the torch version.

@czh-98
Copy link
Owner

czh-98 commented Jun 20, 2024

@czh-98 Thanks for your reply! As @Jackiemin233 mentioned, I did use torch 2.0+, and deleting lines 111-113 fixed the bug. Thank you both! However, when using torch 2.0+, I encountered an inplace operation error: RuntimeError: one of the variables needed for gradient computation has been modified by an in-place operation: [torch.cuda.FloatTensor [25193]], which is output 0 of LinalgVectorNormBackward0, is at version 1; expected version 0 instead. This error traces back to line 161 in lib/guidance/shape_reg.py during the computation of the Laplacian smoothness loss. I have since downgraded to torch 1.12 and resumed training. But I'm curious about the cause of this bug and if there is a solution other than downgrading the torch version.

I tried torch 2.0+ and noticed this issue is due to the inplace operation loss[get_flame_vertex_idx()] *= 5. I modified it to avoid such operations as a+=b to a=a+b. Then it should also work for torch 2.0+.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants