Skip to content

Latest commit

 

History

History

benchmarks

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

A Microbenchmark for Talking-Face Synthesis

This repository contains the datasets and testing scripts for talking-face synthesis.

A microbenchmark serves as a valuable tool for researchers to conduct speedy evaluations of new algorithms. This repository can be easily customized and applied to diverse audio-visual talking-face datasets.

Datasets

In this benchmark, we collect 3 videos for English speakers and 3 videos for Chinese speakers.

File Structure

├── driving_audios
| ├── [9.3M] may_english_audio.aac
| ├── [3.3M] macron_english_trim_audio.aac
| ├── [3.5M] obama1_english_audio.aac
| ├── [780K] laoliang_chinese_50s_audio.mp3
| ├── [4.3M] luoxiang_chinese_audio.mp3
| ├── [8.9M] zuijiapaidang_chinese_audio.mp3
├── source_images
| ├── [294K] may.png
| ├── [202K] macron.png
| ├── [213M] obama1.png
| ├── [206K] zuijiapaidang.png
| ├── [175K] luoxiang.png
| ├── [204K] laoliang.png
├── reference_videos
│ ├── [56M] obama1_english.mp4, 03:38.16, 25fps, 450x450, 46 sentences
│ ├── [96M] may_english.mp4, 04:02.97, 25fps, 512x512, 35 sentences
│ ├── [24M] macron_english_trim.mp4, 00:03:31.92, 25fps, 512x512, 49 sentences
│ ├── [3.6M] laoliang_chinese_50s.mp4, 00:00:49.85, 30fps, 410x380, 40 sentences
│ ├── [14M] luoxiang_chinese.mp4, 04:40.01, 25fps, 350x500, 32 sentences
│ ├── [28M] zuijiapaidang_chinese.mp4, 09:41.98, 30fps, 460x450, 85 sentences
English Speakers
obama1_english.mp4 may_english.mp4 macron_english.mp4
<iframe src="https://drive.google.com/file/d/1g-T1nvL0KqBkInIRVSSbOvmC1LiCB36o/preview"></iframe> <iframe src="https://drive.google.com/file/d/1UMQZP7j8ORLJpHYiUMc-FexDp_SX7386/preview"></iframe> <iframe src="https://drive.google.com/file/d/1ReG45fm8wnz_a3ZJ3qOhPJGgS8LywKaS/preview"></iframe>
Chinese Speakers
laoliang_chinese.mp4 luoxiang_chinese.mp4 zuijiapaidang_chinese.mp4
<iframe src="https://drive.google.com/file/d/1jk9gX2R7KcD_Q2WF-zs7e2Es3lfKBCpK/preview"></iframe> <iframe src="https://drive.google.com/file/d/1d1haMYyA9mH0Wc1NgkEAuHtk30KpLJME/preview"></iframe> <iframe src="https://drive.google.com/file/d/1H-DhAj2K8EESbCUWvr6ylcUqKIFVJ94k/preview"></iframe>

Benchmark

To measure the performance of Wav2Lip and SadTalker, we run them on all videos and testing with the following metrics:

  • Sync↑: The confidence score from SyncNet (lip-sync);
  • PSNR↑: Peak signal-to-noise ratio (identity-preserving);
  • SSIM↑: Structural similarity for image (identity-preserving);
  • FID↓: Frchet inception distance (image quality);

Implementation (off-the-shelf tools)

  1. Sync: syncnet_python Github stars
  2. PSNR, SSIM: ffmpeg-quality-metrics Github stars
  3. FID, IS: IQA-PyTorch Github stars

Qualitative Results for One-shot Pipelines

English Speakers
obama1_Wav2Lip.mp4
PSNR: 32.287, SSIM: 0.951, FID: 18.993
may_Wav2Lip.mp4
PSNR: 32.572, SSIM: 0.936, FID: 33.941
macron_Wav2Lip.mp4
PSNR: 35.737, SSIM: 0.969, FID: 6.121
<iframe src="https://drive.google.com/file/d/159jlICcQEs5A-_bxnH752fjL49P4uzuw/preview"></iframe> <iframe src="https://drive.google.com/file/d/195V0U8rjnce4aujAI2AZhpCwqKddXHGA/preview"></iframe> <iframe src="https://drive.google.com/file/d/1Z0bIbqmVgNdECxgYLedUPVpW6uwquE1z/preview"></iframe>
Chinese Speakers
laoliang_Wav2Lip.mp4
PSNR: 31.444, SSIM: 0.939, FID: 19.192
luoxiang_Wav2Lip.mp4
PSNR: 34.367, SSIM: 0.971, FID: 23.631
zuijiapaidang_Wav2Lip.mp4
PSNR: 20.364, SSIM: 0.783, FID: 49.04
<iframe src="https://drive.google.com/file/d/1SKfceJZ_142bETjqc-FyCtem-SSFlWI4/preview"></iframe> <iframe src="https://drive.google.com/file/d/15Dt0-5rRbWiYDW4GuzfZGxK8ndjk2MOy/preview"></iframe> <iframe src="https://drive.google.com/file/d/12iFMIexJkpG9dDmatfFD9yd-LG-bk1dw/preview"></iframe>
English Speakers
obama1_SadTalker.mp4
PSNR: 20.587, SSIM: 0.754, FID: 24.051
may_SadTalker.mp4
PSNR: 19.211, SSIM: 0.701, FID: 46.182
macron_SadTalker.mp4
PSNR: 18.729, SSIM: 0.763, FID: 98.982
<iframe src="https://drive.google.com/file/d/1xw0gsxCIGJOKpdAudHM1M5mc7qFaQnBv/preview"></iframe> <iframe src="https://drive.google.com/file/d/1wAFcDyK_Yma4pBHNQZAUJzWEzIsL6rS0/preview"></iframe> <iframe src="https://drive.google.com/file/d/1y8NmIkXmgCXYKXxJKAEhYwjsh1LSiTiq/preview"></iframe>
Chinese Speakers
laoliang_SadTalker.mp4
PSNR: 18.536, SSIM: 0.672, FID: 52.362
luoxiang_SadTalker.mp4
PSNR: 14.363, SSIM: 0.598, FID: 104.221
zuijiapaidang_SadTalker.mp4
PSNR: 17.359, SSIM: 0.725, FID: 4.781
<iframe src="https://drive.google.com/file/d/1i5fu_iYkg98a6vRvPw7tg8Z2mRvp4PV3/preview"></iframe> <iframe src="https://drive.google.com/file/d/1Ln5WBpa2PMWT0vDMfB0M_Una_o5j2QL3/preview"></iframe> <iframe src="https://drive.google.com/file/d/1m8itAbvVVi5kx67_00mUo7vpTGs0gwpw/preview"></iframe>

Quantitative Results for One-shot Pipelines

English Speakers Chinese Speakers
Pipeline Sync↑ PSNR↑ SSIM↑ FID↓ Pipeline Sync↑ PSNR↑ SSIM↓ FID↓
Wav2Lip xxx 33.532 0.952 19.685 Wav2Lip xxx 28.725 0.897 30.621
SadTalker xxx 19.509 0.739 56.407 SadTaler xxx 16.753 0.665 68.120

Because NeRF based renderers (GeneFace and ER-NeRF) are person-dependent, we train them on the first 3 minutes of marcon and zuijiapaidang respectively.

Qualitative Results for Few-shot Pipelines

English Speakers
marcon_GeneFace.mp4 macron_ER-NeRF.mp4
Chinese Speakers
zuijiapaidang_GeneFace.mp4 zuijiapaidang_ER-NeRF.mp4

Quantitative Results for Few-shot Pipelines

marcon (English)zuijiapaidang (Chinese)
Pipeline Sync↑ PSNR↑ SSIM↓ FID↓ IS↑ Pipeline Sync↑ PSNR↑ SSIM↓ FID↓ IS↑
GeneFace xxx xxx xxx xxx xxx GeneFace xxx xxx xxx xxx xxx
ER-NeRF xxx xxx xxx xxx xxx ER-NeRF xxx xxx xxx xxx xxx

External Links

  1. Extract Frames using FFmpeg: A Comprehensive Guide
  2. Whisper Web: ML-powered speech recognition directly in your browser
  3. moviepy.video.fx.all.crop
  4. Trim Video: Trim or cut video of any format