-
The University of Texas at Austin
- Austin, TX
- https://jasonppy.github.io/
- @PuyuanPeng
- in/puyuan-peng-a5ab8a29b
Highlights
- Pro
Stars
Official Code for SyllableLM: Learning Coarse Semantic Units for Speech Language Models
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Ego4DSounds: A diverse egocentric dataset with high action-audio correspondence
🦇 Encoder of BAT (Learning to Reason about Spatial Sounds with Large Language Models)
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents [EMNLP 2024]
Inference and training library for high-quality TTS models.
Practice tasks for the CompLING lab internship application.
The official repository of Dynamic-SUPERB.
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
[ACL 2024] Official PyTorch code for extracting features and training downstream models with emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
This is the code for the SpeechTokenizer presented in the SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. Samples are presented on
The best way to write secure and reliable applications. Write nothing; deploy nowhere.
An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io/vallex/
Code and Pretrained Models for Interspeech 2023 Paper "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong Audio Event Taggers"
INTERSPEECH 2023-2024 Papers: A complete collection of influential and exciting research papers from the INTERSPEECH 2023-24 conference. Explore the latest advances in speech and language processin…
Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable…
Promting Whisper for Audio-Visual Speech Recognition, Code-Switched Speech Recognition, and Zero-Shot Speech Translation
Syllable Segmentation and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
Phoneme segmentation using pre-trained speech models
Layer-wise analysis of self-supervised pre-trained speech representations