Several Issues Encountered During Model Training #1

AlvinYH · 2024-06-14T03:38:07Z

Thank you for publicly releasing your code! However, I encountered several problems while training the model:

At https://github.com/czh-98/STAR/blob/master/lib/dlmesh.py#L909, it appears that mask and dense face are not on the same device. I resolved this by moving the mask to the GPU.
At https://github.com/czh-98/STAR/blob/master/lib/trainer.py#L691, modifying the retarget_pose attribute in the trainer class does not seem to alter its value. This causes a bug at https://github.com/czh-98/STAR/blob/master/lib/dlmesh.py#L874 because retarget_pose remains None. I'm unsure of the underlying reason, but I fixed this by encapsulating the function that sets retarget_pose within the dlmesh class.
I couldn't locate data/FLAME_masks/FLAME.obj after downloading the FLAME Vertex Masks and FLAME Mediapipe Landmark files as described in the README. Could you provide specific instructions on how to obtain this file?

Thank you!

The text was updated successfully, but these errors were encountered:

czh-98 · 2024-06-14T04:58:05Z

Hi, thanks for your attention.

For 1, I checked the device, and it does in different devices. Although it still works on my own server, I have fixed it.
For 2, I run the code on my server. The training script does not raise errors and will update the pose, I do not know the reasons...😂
For 1 and 2, I think it might be because you have a multi-GPU server, while I only tested on a single GPU server, so I would suggest running the script with CUDA_VISIBLE_DEVICES=0 python xxx.
For 3, I just uploaded the FLAME.obj file for convenience.

Let me know if you have any other questions :)

Jackiemin233 · 2024-06-15T08:07:10Z

@AlvinYH For question 2, I think you may use the torch 2.0+. I have debugged the code, and find out the solution. That's because on the Line 111-113 in train.py, the self. model is compiled by torch and transferred from DLMesh to an Optimized model and cannot read initialized retarget_pose sequence. I delete these three lines and it works for me. : )

AlvinYH · 2024-06-16T06:07:48Z

@czh-98 Thanks for your reply! As @Jackiemin233 mentioned, I did use torch 2.0+, and deleting lines 111-113 fixed the bug. Thank you both!
However, when using torch 2.0+, I encountered an inplace operation error:
RuntimeError: one of the variables needed for gradient computation has been modified by an in-place operation: [torch.cuda.FloatTensor [25193]], which is output 0 of LinalgVectorNormBackward0, is at version 1; expected version 0 instead.
This error traces back to line 161 in lib/guidance/shape_reg.py during the computation of the Laplacian smoothness loss. I have since downgraded to torch 1.12 and resumed training. But I'm curious about the cause of this bug and if there is a solution other than downgrading the torch version.

czh-98 · 2024-06-20T05:21:13Z

@czh-98 Thanks for your reply! As @Jackiemin233 mentioned, I did use torch 2.0+, and deleting lines 111-113 fixed the bug. Thank you both! However, when using torch 2.0+, I encountered an inplace operation error: RuntimeError: one of the variables needed for gradient computation has been modified by an in-place operation: [torch.cuda.FloatTensor [25193]], which is output 0 of LinalgVectorNormBackward0, is at version 1; expected version 0 instead. This error traces back to line 161 in lib/guidance/shape_reg.py during the computation of the Laplacian smoothness loss. I have since downgraded to torch 1.12 and resumed training. But I'm curious about the cause of this bug and if there is a solution other than downgrading the torch version.

I tried torch 2.0+ and noticed this issue is due to the inplace operation loss[get_flame_vertex_idx()] *= 5. I modified it to avoid such operations as a+=b to a=a+b. Then it should also work for torch 2.0+.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several Issues Encountered During Model Training #1

Several Issues Encountered During Model Training #1

AlvinYH commented Jun 14, 2024

czh-98 commented Jun 14, 2024

Jackiemin233 commented Jun 15, 2024

AlvinYH commented Jun 16, 2024

czh-98 commented Jun 20, 2024

Several Issues Encountered During Model Training #1

Several Issues Encountered During Model Training #1

Comments

AlvinYH commented Jun 14, 2024

czh-98 commented Jun 14, 2024

Jackiemin233 commented Jun 15, 2024

AlvinYH commented Jun 16, 2024

czh-98 commented Jun 20, 2024