Skip to content

MiscellaneousStuff/comp-vis-avhubert

Repository files navigation

Computer Vision - AV HuBERT Research

About

This is my submission for the Computer Vision module for the University of Portsmouth MEng Computer Science course. This research explores using the SOTA AV-HuBERT model, which is used for ASR for Lip Reading and also works as a useful feature extractor for input video frames audio-visual tasks.

This research explores using these extracted features for phoneme prediction and mel-spectrogram synthesis. The phoneme classification confusion is assessed to determine where classifiers fall short.

Link to: Paper

Directory Breakdown

  • av_hubert/: Meta AV HuBERT submodule
  • stable-ts/: OpenAI Whisper with word-level timestamp generation
  • lib/: Collection of utility files used for dataset preprocessing and dataset source downloading
  • split.py: Splits a source MP4 video into 10 second clips. This is because the AV HuBERT model works best with up to 10 second clips.
  • main.ipynb: Contains all of the initial experimental code for this project...
    • AV HuBERT Feature Extraction (Base, Self-Trained Large): Generate features for 10 second clips
    • SKLearn and PyTorch classifier training code
    • Dataset Handling Code (Load phonemes, audio features, raw dlib facial landmarks, OpenAI Whisper Large word-level timestamps)
    • Auxilliary mel spectrogram prediction experiments for more robust training
  • base_vox_433h.pt: AV HuBERT BASE model
  • self_large_vox_433h.pt: Self-Trained AV HuBERT LARGE model (Best performing)
  • phoneme_dict.txt: ARPABET phoneme dictionary

Models

Different models are explored over the dlib and AV HuBERT features:

  1. PyTorch Deep Neural Network (1 hidden layer deep neural network with 256 or 512 hidden dimension and ReLU activation and then softmax). Additional projection from hidden layer to predict mel-spectrogram features.
  2. Support Vector Machine (Linear, Radial Basis, Poly, Sigmoid)
  3. Random Forest

Visual Features

This work explores two main types of visual features:

  1. AV HuBERT Embeddings (Generated from their VoxCeleb3, fine tuned model base_vox_433h, self_large_vox_433h.)
    • BASE (768 dim)
    • Self-Trained LARGE (1024 dim)
  2. Base dlib facial landmarks

Datasets

Two datasets are used for this work:

  1. Jordan Peterson Lecture (30fps) This dataset has a duration of ~= 11 mins and 24 secs or 684 secs and a sequence length of ~20,000 frames.
  2. Jordan Peterson (24fps) (The False Appeal of Communism) (Shorts clip of Jordan Peterson discussing communism. Good clip to use due to variety of phonemes present within the dataset.) This dataset has a duration of ~= 51 seconds and a sequence length of 1,233 frames.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages