SoundSketch-Video-LipSync

This project is based on SoundSketchs to implement Wav2lip for video lip synthesis. By using video files to generate lip shapes driven by voice, and setting a configurable enhancement method for the facial area, the synthetic lip shape (face) area image enhancement is performed to improve the clarity of the generated lip shapes. Use the DAIN frame interpolation DL algorithm to add frames to the generated video to supplement the action transition of synthetic lip shapes between frames, making the synthesized lip shapes more smooth, realistic and natural.

1.Environmental preparation(Environment)

pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
conda install ffmpeg
pip install -r requirements.txt

#If you want to use the DAIN model for frame patching, you need to install paddle.
# CUDA 11.2
python -m pip install paddlepaddle-gpu==2.3.2.post112 \
-f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

2.Project structure(Repository structure)

SoundSketch-Video-LipSync
├──checkpoints
|   ├──BFM_Fitting
|   ├──DAIN_weight
|   ├──hub
|   ├── ...
├──dian_output
|   ├── ...
├──examples
|   ├── audio
|   ├── video
├──results
|   ├── ...
├──src
|   ├── ...
├──sync_show
├──third_part
|   ├── ...
├──...
├──inference.py
├──README.md

3.Model reasoning(Inference)

python inference.py --driven_audio <audio.wav> \
                    --source_video <video.mp4> \
                    --enhancer <none,lip,face> \  #(Default lip)
                    --use_DAIN \ #(Using this function will occupy a large amount of video memory and consume a lot of time.)
             		--time_step 0.5 #(Frame insertion frequency, default 0.5, that is, 25fps—>50fps; 0.25, that is, 25fps—>100fps)

4.Synthetic effects(Results)

#The synthesis effect is displayed in the ./sync_show directory:
#original.mp4 Original video
#sync_none.mp4 No enhanced synthesis effects
#none_dain_50fps.mp4 Add frames from 25fps to 50fps using DAIN model only
#lip_dain_50fps.mp4 Enhance the lip area to make the lip shape clearer + DAIN model adds frames from 25fps to 50fps
#face_dain_50fps.mp4 Enhance the entire face area to make the lip shape clearer + DAIN model adds frames from 25fps to 50fps

#The following is a video of the generation effects of different methods
#our.mp4 Video generated by SoundSketch-Video-LipSync in this project
#SoundSketch.mp4 full video generated by SoundSketch
#retalking.mp4 Video generated by retalking
#wav2lip.mp4 Video generated by wav2lip

lip_sync.mp4

When the videos are spliced together, the frame number is unified to 25fps. The effect of interpolating frames cannot be seen. For specific details, you can see the individual videos in the ./sync_show directory for comparison.

Comparison of the effects of this project with SoundSketch, retalking, and wav2lip lip synthesis：

our	SoundSketch
our_sync.mp4	sadtalker_sync.mp4
retalking	wav2lip
retalking_sync.mp4	wa2lip_sync.mp4

The video displayed in the readme has been resized. The original video can be compared by viewing the synthesized videos of different categories in the ./sync_show directory.

5.Pre-trained model（Pretrained model）

The pretrained model looks like this:

├──checkpoints
|   ├──BFM_Fitting
|   ├──DAIN_weight
|   ├──hub
|   ├──auido2exp_00300-model.pth
|   ├──auido2pose_00140-model.pth
|   ├──epoch_20.pth
|   ├──facevid2vid_00189-model.pth.tar
|   ├──GFPGANv1.3.pth
|   ├──GPEN-BFR-512.pth
|   ├──mapping_00109-model.pth.tar
|   ├──ParseNet-latest.pth
|   ├──RetinaFace-R50.pth
|   ├──shape_predictor_68_face_landmarks.dat
|   ├──wav2lip.pth

Pre-trained model checkpoints download path:

Baidu Skydisk：https://pan.baidu.com/s/15-zjk64SGQnRT9qIduTe2A Extraction code：klfv

Google Drive：https://drive.google.com/file/d/1lW4mf5YNtS4MAD7ZkAauDDWp2N3_Qzs7/view?usp=sharing

Quark network disk：https://pan.quark.cn/s/2a1042b1d046 Extraction code：zMBP

#Download the compressed package and extract it to the project path (need to be executed when downloading Google Cloud Disk and Quark Cloud Disk)
cd SoundSketch-Video-LipSync
tar -zxvf checkpoints.tar.gz

References:

SadTalker-Video-Lip-Sync: https://github.com/Zz-ww/SadTalker-Video-Lip-Sync
SADTalker: https://github.com/Winfredy/SADTalker
VideoReTalking： https://github.com/vinthony/video-retalking
DAIN: https://arxiv.org/abs/1904.00830
PaddleGAN: https://github.com/PaddlePaddle/PaddleGAN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SoundSketch-Video-LipSync

1.Environmental preparation(Environment)

2.Project structure(Repository structure)

3.Model reasoning(Inference)

4.Synthetic effects(Results)

5.Pre-trained model（Pretrained model）

References:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
checkpoints		checkpoints
dian_output		dian_output
examples		examples
results		results
src		src
sync_show		sync_show
third_part		third_part
README.md		README.md
inference.py		inference.py
quick_demo.ipynb		quick_demo.ipynb
requirements.txt		requirements.txt

mjavadpur/SoundSketch_vidoe_lipSync

Folders and files

Latest commit

History

Repository files navigation

SoundSketch-Video-LipSync

1.Environmental preparation(Environment)

2.Project structure(Repository structure)

3.Model reasoning(Inference)

4.Synthetic effects(Results)

5.Pre-trained model（Pretrained model）

References:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages