Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marlin With Wav2Lip #27

Open
vishalsantoshi opened this issue Jul 13, 2024 · 0 comments
Open

Marlin With Wav2Lip #27

vishalsantoshi opened this issue Jul 13, 2024 · 0 comments

Comments

@vishalsantoshi
Copy link

vishalsantoshi commented Jul 13, 2024

Is there any official source for this integration ? I have query bit not sure this is the right forum. @i-am-shreya @ControlNet As in this part of the paper

Lip Synchronization (LS) is another line of research that require facial region specific spatio-temporal synchronization. This downstream adaptation further elaborates the adaptation capability of MARLIN for face generation tasks. For adaptation, we replace the facial encoder module in Wav2Lip [57] with MARLIN, and adjust the temporal window accordingly i.e. from 5 frames to T frames. For evaluation, we use the LRS2 [22] dataset having 45,838 train, 1,082 val, and 1,243 test videos. Following the prior literature [57, 74], we use Lip-Sync Error-Distance (LSE-D ↓), Lip-Sync Error-Confidence (LSE-C ↑) and Frechet Inception Distance (FID ↓) [38] as evaluation matrices.

Did you folks train a wav2lip with a marlin encoder and if yes

  • The flattened face sequences are processed by the Marlin encoder's extract_features method to produce the final face feature map.
  • Only the final output of the extract_features method is used in the forward pass.

OR

  • Intermediate Feature Storage: where the extract_features method is modified to store selected intermediate outputs from the transformer blocks, ensuring the number of stored features matches the number of CNN decoder blocks.
  • During Integration with Decoder Blocks as in during the forward pass of the Wav2Lip model, the decoder blocks process the audio embeddings.
  • At each decoder block, the corresponding intermediate feature map from face_features is concatenated with the current decoder output.
  • The features are accessed in reverse order to match the original processing sequence.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant