Skip to content

Automatically Update Text-to-speech (TTS) Papers Daily using Github Actions (Update Every 12th hours)

License

Notifications You must be signed in to change notification settings

liutaocode/TTS-arxiv-daily

Repository files navigation

Updated on 2024.11.15

Usage instructions: here

This page is modified from here

Table of Contents
  1. TTS

TTS

Publish Date Title Authors PDF Code
2024-11-12 Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models Dongrui Han et.al. 2411.07563 null
2024-11-11 Enhancing Accessibility in Special Libraries: A Study on AI-Powered Assistive Technologies for Patrons with Disabilities Snehasish Paul Shivali Chauhan et.al. 2411.06970 null
2024-11-10 Debatts: Zero-Shot Debating Text-to-Speech Synthesis Yiqiao Huang et.al. 2411.06540 null
2024-11-07 CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR Kadir Burak Buldu et.al. 2411.04671 null
2024-11-04 EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector Deok-Hyeon Cho et.al. 2411.02625 link
2024-11-09 Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis Shijia Liao et.al. 2411.01156 link
2024-10-31 Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? Ioannis Tsiamas et.al. 2410.24019 null
2024-10-30 Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis Théodor Lemerle et.al. 2410.23320 link
2024-10-29 Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech Eric Battenberg et.al. 2410.22179 null
2024-10-29 Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding Bohan Li et.al. 2410.21951 null
2024-10-29 RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis Kehan Sui et.al. 2410.21641 null
2024-10-28 Asynchronous Tool Usage for Real-Time Agents Antonio A. Ginart et.al. 2410.21620 null
2024-10-28 Enhancing TTS Stability in Hebrew using Discrete Semantic Units Ella Zeldes et.al. 2410.21502 null
2024-10-28 Mitigating Unauthorized Speech Synthesis for Voice Protection Zhisheng Zhang et.al. 2410.20742 link
2024-10-27 Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation Maohao Shen et.al. 2410.20336 null
2024-10-24 Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis Suparna De et.al. 2410.19199 null
2024-10-24 STTATTS: Unified Speech-To-Text And Text-To-Speech Model Hawau Olamide Toyin et.al. 2410.18607 link
2024-10-24 Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts ChaeHun Park et.al. 2410.18444 null
2024-10-23 ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams Srija Anand et.al. 2410.17901 null
2024-10-22 Continuous Speech Tokenizer in Text To Speech Yixing Li et.al. 2410.17081 null
2024-10-22 Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap Guanrou Yang et.al. 2410.16726 null
2024-10-21 Continuous Speech Synthesis using per-token Latent Diffusion Arnon Turetzky et.al. 2410.16048 null
2024-10-18 A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages Sujitha Sathiyamoorthy et.al. 2410.14197 null
2024-10-18 Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech Shuwei He et.al. 2410.14101 link
2024-10-17 Enhancing Crowdsourced Audio for Text-to-Speech Models José Giraldo et.al. 2410.13357 null
2024-10-17 DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech Jan Melechovsky et.al. 2410.13342 null
2024-10-17 DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis Yu Gu et.al. 2410.13288 null
2024-10-17 Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation Sreyan Ghosh et.al. 2410.13198 null
2024-10-16 ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs Rui-Chen Zheng et.al. 2410.12359 null
2024-10-14 IsoChronoMeter: A simple and effective isochronic translation evaluation metric Nikolai Rozanov et.al. 2410.11127 null
2024-10-14 DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization Yingahao Aaron Li et.al. 2410.11097 null
2024-10-12 Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling Rui Liu et.al. 2410.09524 null
2024-10-10 Unsupervised Data Validation Methods for Efficient Model Training Yurii Paniv et.al. 2410.07880 null
2024-10-15 F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching Yushen Chen et.al. 2410.06885 link
2024-10-09 Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch Teodora Răgman et.al. 2410.06787 null
2024-10-09 Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS Onkar Kishor Susladkar et.al. 2410.06608 null
2024-10-09 Can DeepFake Speech be Reliably Detected? Hongbin Liu et.al. 2410.06572 null
2024-10-07 SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech Minchan Kim et.al. 2410.04690 null
2024-10-06 HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis Yuto Nishimura et.al. 2410.04380 null
2024-10-10 SONAR: A Synthetic AI-Audio Detection Framework and Benchmark Xiang Li et.al. 2410.04324 link
2024-10-05 Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System Ze Li et.al. 2410.04017 null
2024-10-01 Recent Advances in Speech Language Models: A Survey Wenqian Cui et.al. 2410.03751 null
2024-10-04 Generative Semantic Communication for Text-to-Speech Synthesis Jiahao Zheng et.al. 2410.03459 null
2024-10-04 Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens Jinzheng Zhao et.al. 2410.03298 null
2024-10-04 Narrative Player: Reviving Data Narratives with Visuals Zekai Shao et.al. 2410.03268 null
2024-10-04 MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech Taejun Bak et.al. 2410.03192 null
2024-10-01 Augmentation through Laundering Attacks for Audio Spoof Detection Hashim Ali et.al. 2410.01108 null
2024-10-01 Zero-Shot Text-to-Speech from Continuous Text Streams Trung Dang et.al. 2410.00767 null
2024-10-01 EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control Haozhe Chen et.al. 2410.00316 link
2024-09-30 Word-wise intonation model for cross-language TTS systems Tomilov A. A. et.al. 2409.20374 null
2024-09-27 Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech Youngjae Kim et.al. 2409.18622 null
2024-09-26 Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control Ryuichi Yamamoto et.al. 2409.17452 null
2024-09-25 Exploring synthetic data for cross-speaker style transfer in style representation based TTS Lucas H. Ueda et.al. 2409.17364 null
2024-09-25 Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions Kun Zhou et.al. 2409.16681 null
2024-09-25 Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation Siyin Wang et.al. 2409.16644 null
2024-09-24 FastTalker: Jointly Generating Speech and Conversational Gestures from Text Zixin Guo et.al. 2409.16404 null
2024-09-24 Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling Ville Heilala et.al. 2409.16376 null
2024-09-24 Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech Yunji Chu et.al. 2409.16203 null
2024-09-24 NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers Nohil Park et.al. 2409.15760 null
2024-09-24 VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance Jiheum Yeom et.al. 2409.15759 null
2024-09-24 StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis Zhiyong Chen et.al. 2409.15741 null
2024-09-23 A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection Lam Pham et.al. 2409.15180 null
2024-09-23 LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation Hieu-Thi Luong et.al. 2409.14743 null
2024-09-20 Zero-shot Cross-lingual Voice Transfer for TTS Fadi Biadsy et.al. 2409.13910 null
2024-09-20 On the Feasibility of Fully AI-automated Vishing Attacks João Figueiredo et.al. 2409.13793 null
2024-09-19 Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space Sebastião Quintas et.al. 2409.12745 null
2024-09-19 Preference Alignment Improves Language Model-Based TTS Jinchuan Tian et.al. 2409.12403 null
2024-09-18 Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference Edresson Casanova et.al. 2409.12117 null
2024-09-18 Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems Anusha Prakash et.al. 2409.11915 null
2024-09-18 DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech Xin Qi et.al. 2409.11835 null
2024-09-18 Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation Haohan Guo et.al. 2409.11630 null
2024-09-17 SpMis: An Investigation of Synthetic Spoken Misinformation Detection Peizhuo Liu et.al. 2409.11308 null
2024-09-19 The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives Samee Arif et.al. 2409.11261 link
2024-09-17 Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora Francesco Nespoli et.al. 2409.11107 null
2024-09-16 Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization Xiaoxue Gao et.al. 2409.10157 null
2024-09-16 StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion Yinghao Aaron Li et.al. 2409.10058 null
2024-09-15 Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning Siqi Sun et.al. 2409.09891 null
2024-09-14 E1 TTS: Simple and Fast Non-Autoregressive TTS Zhijun Liu et.al. 2409.09351 null
2024-09-14 Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation Changjin Han et.al. 2409.09311 null
2024-09-14 SafeEar: Content Privacy-Preserving Audio Deepfake Detection Xinfeng Li et.al. 2409.09272 link
2024-09-13 AccentBox: Towards High-Fidelity Zero-Shot Accent Generation Jinzuomu Zhong et.al. 2409.09098 null
2024-09-17 HLTCOE JHU Submission to the Voice Privacy Challenge 2024 Henry Li Xinyuan et.al. 2409.08913 null
2024-09-13 Text-To-Speech Synthesis In The Wild Jee-weon Jung et.al. 2409.08711 null
2024-09-14 Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions Amila Indika et.al. 2409.07945 null
2024-09-12 Full-text Error Correction for Chinese Speech Recognition with Large Language Model Zhiyuan Tang et.al. 2409.07790 null
2024-09-11 SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis Helin Wang et.al. 2409.07556 link
2024-09-11 D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack Hong-Hanh Nguyen-Le et.al. 2409.07390 null
2024-09-11 Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT Kazuki Yamauchi et.al. 2409.07265 null
2024-09-11 Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment Tien-Hong Lo et.al. 2409.07151 null
2024-09-10 Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models Xin Jing et.al. 2409.06451 null
2024-09-10 What happens to diffusion model likelihood when your model is conditional? Mattias Cross et.al. 2409.06364 null
2024-09-10 VoiceWukong: Benchmarking Deepfake Voice Detection Ziwei Yan et.al. 2409.06348 null
2024-09-09 AS-Speech: Adaptive Style For Speech Synthesis Zhipeng Li et.al. 2409.05730 null
2024-09-09 IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS Ashwin Sankar et.al. 2409.05356 link
2024-09-10 Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion Zhengyang Chen et.al. 2409.05004 null
2024-09-01 Sample-Efficient Diffusion for Text-To-Speech Synthesis Justin Lovelace et.al. 2409.03717 link
2024-09-10 LAST: Language Model Aware Speech Tokenization Arnon Turetzky et.al. 2409.03701 null
2024-09-05 FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications Hao-Han Guo et.al. 2409.03283 null
2024-09-04 Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems Jeongmin Liu et.al. 2409.02517 null
2024-09-03 VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka Li-Wei Chen et.al. 2409.01548 null
2024-09-02 A multilingual training strategy for low resource Text to Speech Asma Amalas et.al. 2409.01217 null
2024-09-02 A Framework for Synthetic Audio Conversations Generation using Large Language Models Kaung Myat Kyaw et.al. 2409.00946 null
2024-09-02 SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis Haohan Guo et.al. 2409.00933 link
2024-09-01 MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer Yuancheng Wang et.al. 2409.00750 link
2024-08-30 SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection Ismail Rasim Ulgen et.al. 2408.17432 null
2024-08-30 AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge Kirill Borodin et.al. 2408.17352 null
2024-08-30 Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model Zhen Ye et.al. 2408.17175 link
2024-08-30 Utilizing Speaker Profiles for Impersonation Audio Detection Hao Gu et.al. 2408.17009 null
2024-08-29 Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis Zehai Tu et.al. 2408.16373 null
2024-08-28 Multi-modal Adversarial Training for Zero-Shot Voice Cloning John Janiczek et.al. 2408.15916 null
2024-08-29 Easy, Interpretable, Effective: openSMILE for voice deepfake detection Octavian Pascu et.al. 2408.15775 null
2024-08-28 VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling Yixuan Zhou et.al. 2408.15676 link
2024-08-28 VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech Heeseung Kim et.al. 2408.14739 null
2024-08-27 StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech Haowei Lou et.al. 2408.14713 null
2024-08-27 DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance Jinhyeok Yang et.al. 2408.14423 null
2024-08-26 Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard Wonjune Kang et.al. 2408.13970 null
2024-08-28 SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models Dongchao Yang et.al. 2408.13893 null
2024-08-22 Positional Description for Numerical Normalization Deepanshu Gupta et.al. 2408.12430 null
2024-08-22 VoiceX: A Text-To-Speech Framework for Custom Voices Silvan Mertes et.al. 2408.12170 null
2024-08-13 Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation Yinghao Aaron Li et.al. 2408.11849 null
2024-08-20 EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech Xin Qi et.al. 2408.10852 null
2024-08-20 SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS Karl El Hajal et.al. 2408.10771 null
2024-08-20 Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting Hyun Jin Park et.al. 2408.10463 null
2024-08-17 Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition Samuele Cornell et.al. 2408.09215 link
2024-08-14 PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation Sang-Hoon Lee et.al. 2408.07547 link
2024-08-13 SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis Osamu Take et.al. 2408.06858 link
2024-08-13 PRESENT: Zero-Shot Text-to-Prosody Control Perry Lam et.al. 2408.06827 link
2024-08-12 FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks Min Ma et.al. 2408.06227 null
2024-08-11 VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing Chunyu Qiang et.al. 2408.05758 null
2024-08-06 Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training Hawraz A. Ahmad et.al. 2408.03887 null
2024-08-03 ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features Peng Cheng et.al. 2408.01808 link
2024-08-01 Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation Xinhan Di et.al. 2408.00284 null
2024-07-18 Handling Numeric Expressions in Automatic Speech Recognition Christian Huber et.al. 2408.00004 null
2024-07-31 On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition Nick Rossenbach et.al. 2407.21476 null
2024-07-29 Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks Mahmoud Salhab et.al. 2407.18571 null
2024-07-25 On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures Nick Rossenbach et.al. 2407.17997 null
2024-07-24 Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model Jan Lehečka et.al. 2407.17167 null
2024-07-23 Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments Pai Zhu et.al. 2407.16840 null
2024-07-19 Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2 Chun Xu et.al. 2407.14212 null
2024-07-18 Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models Weiqin Li et.al. 2407.13509 null
2024-07-22 TTSDS -- Text-to-Speech Distribution Score Christoph Minixhofer et.al. 2407.12707 link
2024-07-17 Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech Haibin Wu et.al. 2407.12229 link
2024-07-16 A Language Modeling Approach to Diacritic-Free Hebrew TTS Amit Roth et.al. 2407.12206 null
2024-07-17 Learning High-Frequency Functions Made Easy with Sinusoidal Positional Encoding Chuanhao Sun et.al. 2407.09370 link
2024-07-11 Autoregressive Speech Synthesis without Vector Quantization Lingwei Meng et.al. 2407.08551 null
2024-07-10 Source Tracing of Audio Deepfake Systems Nicholas Klein et.al. 2407.08016 null
2024-07-07 ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation Ruibo Fu et.al. 2407.05421 null
2024-07-09 CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens Zhihao Du et.al. 2407.05407 null
2024-07-04 Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis Cong-Thanh Do et.al. 2407.04047 null
2024-07-04 Optimizing a-DCF for Spoofing-Robust Speaker Verification Oğuzhan Kurnaz et.al. 2407.04034 null
2024-07-04 On the Effectiveness of Acoustic BPE in Decoder-Only TTS Bohan Li et.al. 2407.03892 null
2024-07-14 CATT: Character-based Arabic Tashkeel Transformer Faris Alasmary et.al. 2407.03236 link
2024-07-02 Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization Yuchen Hu et.al. 2407.02243 null
2024-07-02 TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations Xiaoxue Gao et.al. 2407.01927 null
2024-07-01 Lightweight Zero-shot Text-to-Speech with Mixture of Adapters Kenichi Fujita et.al. 2407.01291 null
2024-06-30 NAIST Simultaneous Speech Translation System for IWSLT 2024 Yuka Ko et.al. 2407.00826 null
2024-06-30 An Attribute Interpolation Method in Speech Synthesis by Model Merging Masato Murata et.al. 2407.00766 null
2024-06-30 FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis Yinlin Guo et.al. 2407.00753 null
2024-07-02 Open-Source Conversational AI with SpeechBrain 1.0 Mirco Ravanelli et.al. 2407.00463 null
2024-06-27 Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models Borodin Kirill Nikolayevich et.al. 2406.19243 null
2024-06-27 DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability Hyun Joon Park et.al. 2406.19135 link
2024-06-26 Automatic Speech Recognition for Hindi Anish Saha et.al. 2406.18135 null
2024-06-26 A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons Tzu-Yun Hung et.al. 2406.18089 null
2024-06-29 LLM-Driven Multimodal Opinion Expression Identification Bonian Jia et.al. 2406.18088 null
2024-06-26 E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS Sefik Emre Eskimez et.al. 2406.18009 link
2024-06-25 Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment Paarth Neekhara et.al. 2406.17957 null
2024-06-22 A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge Xiaopeng Wang et.al. 2406.17801 null
2024-06-25 High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model Joun Yeop Lee et.al. 2406.17310 null
2024-06-25 Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation Yingting Li et.al. 2406.17257 null
2024-06-24 Exploring the Capability of Mamba in Speech Applications Koichi Miyazaki et.al. 2406.16808 null
2024-06-25 Towards Zero-Shot Text-To-Speech for Arabic Dialects Khai Duy Doan et.al. 2406.16751 null
2024-06-22 TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers Yakun Song et.al. 2406.15752 link
2024-06-21 InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions Yu Nakagome et.al. 2406.14890 null
2024-06-21 GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech Wenbin Wang et.al. 2406.14875 null
2024-06-21 DASB - Discrete Audio and Speech Benchmark Pooneh Mousavi et.al. 2406.14294 null
2024-06-18 Instruction Data Generation and Unsupervised Adaptation for Speech Language Models Vahid Noroozi et.al. 2406.12946 null
2024-06-17 DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer Keon Lee et.al. 2406.11427 null
2024-06-16 NAST: Noise Aware Speech Tokenization for Speech Language Models Shoval Messica et.al. 2406.11037 link
2024-06-16 Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis Xuehao Zhou et.al. 2406.10844 null
2024-06-14 Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice Shubham Gupta et.al. 2406.10422 null
2024-06-14 UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner Dongchao Yang et.al. 2406.10056 link
2024-06-14 MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model Jiatong Shi et.al. 2406.09869 null
2024-06-13 DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage Kyra Wang et.al. 2406.08820 null
2024-06-13 Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems Zhengyang Chen et.al. 2406.08812 null
2024-06-13 DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing Neha Sahipjohn et.al. 2406.08802 null
2024-06-12 Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis Wing-Zin Leung et.al. 2406.08568 link
2024-06-12 Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data Yuma Shirahata et.al. 2406.08111 null
2024-06-12 VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech Ashishkumar Gudmalwar et.al. 2406.08076 null
2024-06-12 LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning Masaya Kawamura et.al. 2406.07969 link
2024-06-12 VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment Bing Han et.al. 2406.07855 null
2024-06-12 EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech Deok-Hyeon Cho et.al. 2406.07803 link
2024-06-11 The Interspeech 2024 Challenge on Speech Processing Using Discrete Units Xuankai Chang et.al. 2406.07725 null
2024-06-11 Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? Qingkai Fang et.al. 2406.07289 null
2024-06-11 AudioMarkBench: Benchmarking Robustness of Audio Watermarking Hongbin Liu et.al. 2406.06979 link
2024-06-11 Controlling Emotion in Text-to-Speech with Natural Language Prompts Thomas Bott et.al. 2406.06406 link
2024-06-10 Meta Learning Text-to-Speech Synthesis in over 7000 Languages Florian Lux et.al. 2406.06403 link
2024-06-10 MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance Semin Kim et.al. 2406.05965 null
2024-06-11 WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark Linhan Ma et.al. 2406.05763 link
2024-06-09 An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS Xiaofei Wang et.al. 2406.05699 null
2024-06-11 Text-aware and Context-aware Expressive Audiobook Speech Synthesis Dake Guo et.al. 2406.05672 null
2024-06-08 Autoregressive Diffusion Transformer for Text-to-Speech Synthesis Zhijun Liu et.al. 2406.05551 null
2024-06-08 VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers Sanyuan Chen et.al. 2406.05370 null
2024-06-07 Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis Ryan Langman et.al. 2406.05298 null
2024-06-07 XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model Edresson Casanova et.al. 2406.04904 link
2024-06-07 TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking Junzuo Zhou et.al. 2406.04840 null
2024-06-07 Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study Chong Zhang et.al. 2406.04633 null
2024-06-06 Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis Théodor Lemerle et.al. 2406.04467 link
2024-06-06 Total-Duration-Aware Duration Modeling for Text-to-Speech Systems Sefik Emre Eskimez et.al. 2406.04281 null
2024-06-06 Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining Jinlong Xue et.al. 2406.03714 null
2024-06-06 Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model Jinlong Xue et.al. 2406.03706 null
2024-06-05 Style Mixture of Experts for Expressive Text-To-Speech Synthesis Ahad Jawaid et.al. 2406.03637 null
2024-06-07 Harder or Different? Understanding Generalization of Audio Deepfake Detection Nicolas M. Müller et.al. 2406.03512 null
2024-06-05 LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes Trung Dang et.al. 2406.02897 null
2024-06-04 Seed-TTS: A Family of High-Quality Versatile Speech Generation Models Philip Anastassiou et.al. 2406.02430 link
2024-06-05 SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models Dongchao Yang et.al. 2406.02328 null
2024-06-04 BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation Hui-Peng Du et.al. 2406.02162 null
2024-06-04 Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis Kun Zhou et.al. 2406.02009 null
2024-06-03 ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec Shengpeng Ji et.al. 2406.01205 link
2024-06-03 Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training Jan Melechovsky et.al. 2406.01018 null
2024-06-02 Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback Chen Chen et.al. 2406.00654 null
2024-05-31 Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities Vicky Zayats et.al. 2405.18669 null
2024-05-28 TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation Chenyang Le et.al. 2405.17809 link
2024-05-27 RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis Haoxiang Shi et.al. 2405.17028 null
2024-05-24 Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition Zijin Gu et.al. 2405.15216 null
2024-05-23 Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models Jingyi Chen et.al. 2405.14632 null
2024-05-22 A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction Yue Li et.al. 2405.13477 null
2024-05-20 Multi-speaker Text-to-speech Training with Speaker Anonymized Data Wen-Chin Huang et.al. 2405.11767 null
2024-05-19 VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications Mikhail Konenkov et.al. 2405.11537 null
2024-05-18 Exploring speech style spaces with language models: Emotional TTS without emotion labels Shreeram Suresh Chandra et.al. 2405.11413 null
2024-05-16 Faces that Speak: Jointly Synthesising Talking Face and Speech from Text Youngjoon Jang et.al. 2405.10272 null
2024-05-16 Building a Luganda Text-to-Speech Model From Crowdsourced Data Sulaiman Kagumire et.al. 2405.10211 null
2024-05-16 Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model Siyang Wang et.al. 2405.09768 null
2024-05-15 Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer Weifei Jin et.al. 2405.09470 null
2024-05-15 Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis Sho Inoue et.al. 2405.09171 null
2024-05-14 PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset Yang Hou et.al. 2405.08838 link
2024-04-30 Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech Hankun Wang et.al. 2404.19723 null
2024-04-29 MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis Xiang Li et.al. 2404.18398 null
2024-04-28 USAT: A Universal Speaker-Adaptive Text-to-Speech Approach Wenbin Wang et.al. 2404.18094 link
2024-04-27 TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality Tiantian Feng et.al. 2404.17983 null
2024-04-26 An RFP dataset for Real, Fake, and Partially fake audio detection Abdulazeez AlAli et.al. 2404.17721 null
2024-04-23 StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations Sen Liu et.al. 2404.14946 null
2024-04-23 Retrieval-Augmented Audio Deepfake Detection Zuheng Kang et.al. 2404.13892 null
2024-04-14 Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling Quanxiu Wang et.al. 2404.09192 null
2024-04-11 Voice-Assisted Real-Time Traffic Sign Recognition System Using Convolutional Neural Network Mayura Manawadu et.al. 2404.07807 null
2024-04-18 Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness Xincan Feng et.al. 2404.06714 link
2024-04-10 CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations Leying Zhang et.al. 2404.06690 null
2024-04-10 The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge Yiwei Guo et.al. 2404.06079 null
2024-04-07 Cross-Domain Audio Deepfake Detection: Dataset and Analysis Yuang Li et.al. 2404.04904 null
2024-04-06 HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks Yingting Li et.al. 2404.04645 link
2024-04-18 Open vocabulary keyword spotting through transfer learning from speech synthesis Kesavaraj V et.al. 2404.03914 null
2024-04-06 RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis Detai Xin et.al. 2404.03204 null
2024-04-03 CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech Jaehyeon Kim et.al. 2404.02781 null
2024-04-13 PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders Yu Pan et.al. 2404.02702 null
2024-03-31 Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation Rohan Chaudhury et.al. 2404.01339 link
2024-03-28 A Review of Multi-Modal Large Language and Vision Models Kilian Carolan et.al. 2404.01322 null
2024-04-09 KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis Adal Abilbekov et.al. 2404.01033 link
2024-03-31 CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models Xiang Li et.al. 2404.00569 link
2024-03-25 VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild Puyuan Peng et.al. 2403.16973 link
2024-03-20 Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning Shivam Ratnakant Mhaskar et.al. 2403.15469 null
2024-03-20 UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge Wataru Nakata et.al. 2403.13720 null
2024-03-20 Building speech corpus with diverse voice characteristics for its prompt-based representation Aya Watanabe et.al. 2403.13353 null
2024-03-17 Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations Claudio Pinhanez et.al. 2403.11209 null
2024-03-17 EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech Ziqi Liang et.al. 2403.08164 null
2024-03-09 HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling Chunhui Wang et.al. 2403.05989 null
2024-03-05 AttentionStitch: How Attention Solves the Speech Editing Problem Antonios Alexos et.al. 2403.04804 null
2024-03-07 Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation Sai Akarsh et.al. 2403.04178 null
2024-03-27 NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models Zeqian Ju et.al. 2403.03100 null
2024-03-04 Brilla AI: AI Contestant for the National Science and Maths Quiz George Boateng et.al. 2403.01699 link
2024-03-02 Towards Accurate Lip-to-Speech Synthesis in-the-Wild Sindhu Hegde et.al. 2403.01087 null
2024-02-29 Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data Takaaki Saeki et.al. 2402.18932 null
2024-02-26 An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation Ahmet Gunduz et.al. 2402.16380 link
2024-02-22 Efficient data selection employing Semantic Similarity-based Graph Structures for model training Roxana Petcu et.al. 2402.14888 null
2024-02-22 Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition Rendi Chevi et.al. 2402.14523 null
2024-02-19 On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models Miri Varshavsky-Hassid et.al. 2402.12423 null
2024-02-19 Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting Haolin Chen et.al. 2402.12220 link
2024-02-18 Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru Zining Wang et.al. 2402.11571 null
2024-02-14 MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech Shengpeng Ji et.al. 2402.09378 null
2024-02-15 BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data Mateusz Łajszczak et.al. 2402.08093 null
2024-03-04 Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like Naoyuki Kanda et.al. 2402.07383 null
2024-02-09 A New Approach to Voice Authenticity Nicolas M. Müller et.al. 2402.06304 null
2024-02-08 Unified Speech-Text Pretraining for Spoken Dialog Modeling Heeseung Kim et.al. 2402.05706 null
2024-02-05 Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations Álvaro Martín-Cortinas et.al. 2402.03407 null
2024-02-02 Natural language guidance of high-fidelity text-to-speech with synthetic annotations Dan Lyth et.al. 2402.01912 null
2024-01-23 Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization Wei-Ping Huang et.al. 2402.01692 null
2024-02-01 Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech Dong Yang et.al. 2402.00288 null
2024-02-01 PAM: Prompting Audio-Language Models for Audio Quality Assessment Soham Deshmukh et.al. 2402.00282 link
2024-01-31 Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2 Jiatong Shi et.al. 2401.17619 link
2024-01-28 MunTTS: A Text-to-Speech System for Mundari Varun Gumma et.al. 2401.15579 null
2024-01-30 VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech Chenpeng Du et.al. 2401.14321 null
2024-01-25 Text to speech synthesis Harini s et.al. 2401.13891 null
2024-01-25 SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation Dong Zhang et.al. 2401.13527 link
2024-01-22 Benchmarking Large Multimodal Models against Common Corruptions Jiawei Zhang et.al. 2401.11943 link
2024-01-22 Adversarial speech for voice privacy protection from Personalized Speech generation Shihao Chen et.al. 2401.11857 null
2024-02-16 Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis Vinotha R et.al. 2401.11771 null
2024-01-19 Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech Abhinav Garg et.al. 2401.10465 null
2024-02-28 MLAAD: The Multi-Language Audio Anti-Spoofing Dataset Nicolas M. Müller et.al. 2401.09512 null
2024-01-15 MCMChaos: Improvising Rap Music with MCMC Methods and Chaos Theory Robert G. Kimelman et.al. 2401.07967 null
2024-01-14 ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering Yakun Song et.al. 2401.07333 null
2024-01-12 Multi-Task Learning for Front-End Text Processing in TTS Wonjune Kang et.al. 2401.06321 link
2024-01-11 End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2 Aniket Tathe et.al. 2401.06183 null
2024-01-11 Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection Lian Huang et.al. 2401.05614 null
2024-01-10 Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters Kenichi Fujita et.al. 2401.05111 null
2024-01-07 Evaluating and Personalizing User-Perceived Quality of Text-to-Speech Voices for Delivering Mindfulness Meditation with Different Physical Embodiments Zhonghao Shi et.al. 2401.03581 null
2024-01-07 Transfer the linguistic representations from TTS to accent conversion with non-parallel data Xi Chen et.al. 2401.03538 null
2024-01-03 Incremental FastPitch: Chunk-based High Quality Text to Speech Muyang Du et.al. 2401.01755 null
2024-01-03 Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction Minchan Kim et.al. 2401.01498 null
2023-12-18 Assisting Blind People Using Object Detection with Vocal Feedback Heba Najm et.al. 2401.01362 null
2023-12-30 Boosting Large Language Model for Speech Synthesis: An Empirical Study Hongkun Hao et.al. 2401.00246 null
2024-01-01 Normalization of Lithuanian Text Using Regular Expressions Pijus Kasparaitis et.al. 2312.17660 null
2023-12-27 AE-Flow: AutoEncoder Normalizing Flow Jakub Mosiński et.al. 2312.16552 null
2023-12-22 Creating New Voices using Normalizing Flows Piotr Bilinski et.al. 2312.14569 null
2023-12-22 ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations Cheng Gong et.al. 2312.14398 null
2023-12-19 External Knowledge Augmented Polyphone Disambiguation Using Large Language Model Chen Li et.al. 2312.11920 null
2023-12-17 A review-based study on different Text-to-Speech technologies Md. Jalal Uddin Chowdhury et.al. 2312.11563 null
2024-01-31 MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis Wenhao Guan et.al. 2312.10687 null
2024-02-22 Amphion: An Open-Source Audio, Music and Speech Generation Toolkit Xueyao Zhang et.al. 2312.09911 link
2023-12-11 Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism Georgios Milis et.al. 2312.06613 link
2023-12-08 An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis Via Nielson et.al. 2312.05415 null
2023-12-06 Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis Zehua Chen et.al. 2312.03491 null
2023-12-02 Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning Raviraj Joshi et.al. 2312.01107 null
2023-12-02 Code-Mixed Text to Speech Synthesis under Low-Resource Constraints Raviraj Joshi et.al. 2312.01103 null
2023-11-29 Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes Pavel Korshunov et.al. 2311.17655 null
2024-02-06 Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech Enting Zhou et.al. 2311.14816 link
2023-12-07 Guided Flows for Generative Modeling and Decision Making Qinqing Zheng et.al. 2311.13443 null
2023-11-27 HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis Sang-Hoon Lee et.al. 2311.12454 link
2023-11-18 Utilizing Speech Emotion Recognition and Recommender Systems for Negative Emotion Handling in Therapy Chatbots Farideh Majidi et.al. 2311.11116 null
2023-11-18 Data Center Audio/Video Intelligence on Device (DAVID) -- An Edge-AI Platform for Smart-Toys Gabriel Cosache et.al. 2311.11030 null
2023-11-17 A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness Mathias Vogel et.al. 2311.10804 null
2023-11-16 Improving fairness for spoken language understanding in atypical speech with Text-to-Speech Helin Wang et.al. 2311.10149 link
2024-02-02 DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation Jianzong Wang et.al. 2311.07965 null
2023-11-12 ChatAnything: Facetime Chat with LLM-Enhanced Personas Yilin Zhao et.al. 2311.06772 null
2023-11-11 NewsGPT: ChatGPT Integration for Robot-Reporter Abdelhadi Hireche et.al. 2311.06640 link
2023-11-08 Synthetic Speaking Children -- Why We Need Them and How to Make Them Muhammad Ali Farooq et.al. 2311.06307 null
2023-09-25 Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image Minki Kang et.al. 2311.05844 null
2023-11-07 Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning Rishabh Jain et.al. 2311.04313 link
2023-11-07 Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment Jakir Hasan et.al. 2311.03792 null
2023-11-08 Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction Minchan Kim et.al. 2311.02898 null
2023-11-02 Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations Hanglei Zhang et.al. 2311.01260 null
2023-11-02 E3 TTS: Easy End-to-End Diffusion-based Text to Speech Yuan Gao et.al. 2311.00945 null
2023-10-31 An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation Yingjie Zhou et.al. 2310.20251 link
2023-10-27 Style Description based Text-to-Speech with Conditional Prosodic Layer Normalization based Diffusion GAN Neeraj Kumar et.al. 2310.18169 null
2023-10-25 ArTST: Arabic Text and Speech Transformer Hawau Olamide Toyin et.al. 2310.16621 link
2023-10-25 Generative Pre-training for Speech with Flow Matching Alexander H. Liu et.al. 2310.16338 null
2023-10-23 DPP-TTS: Diversifying prosodic features of speech via determinantal point processes Seongho Joo et.al. 2310.14663 null
2023-10-22 An overview of text-to-speech systems and media applications Mohammad Reza Hasanabadi et.al. 2310.14301 null
2023-10-14 Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling Tiberiu Boros et.al. 2310.09636 link
2023-10-14 Attentive Multi-Layer Perceptron for Non-autoregressive Generation Shuyang Jiang et.al. 2310.09512 link
2023-12-22 Crowdsourced and Automatic Speech Prominence Estimation Max Morrison et.al. 2310.08464 link
2023-10-12 On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition Nick Rossenbach et.al. 2310.08132 null
2023-10-12 Vec-Tok Speech: speech vectorization and tokenization for neural speech generation Xinfa Zhu et.al. 2310.07246 link
2023-10-10 Prosody Analysis of Audiobooks Charuta Pethe et.al. 2310.06930 null
2023-10-09 JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions Detai Xin et.al. 2310.06072 null
2024-01-09 Unified speech and gesture synthesis using flow matching Shivam Mehta et.al. 2310.05181 null
2023-10-08 Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset Ze Liu et.al. 2310.04982 null
2023-10-11 LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT Jiaming Wang et.al. 2310.04673 null
2024-01-22 Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis Jae-Sung Bae et.al. 2310.03538 null
2023-10-07 The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains Erica Cooper et.al. 2310.02640 null
2023-10-02 Towards human-like spoken dialogue generation between AI agents from written dialogue Kentaro Mitsui et.al. 2310.01088 null
2023-10-01 Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech Dareen Alharthi et.al. 2310.00706 null
2024-03-11 Fewer-token Neural Speech Codec with Time-invariant Codes Yong Ren et.al. 2310.00014 link
2024-01-31 ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech Wenhao Guan et.al. 2309.17056 null
2023-09-29 Low-Resource Self-Supervised Learning with SSL-Enhanced TTS Po-chun Hsu et.al. 2309.17020 null
2023-09-29 Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features Yuxiang Zhang et.al. 2309.16954 null
2023-12-18 High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models Chunyu Qiang et.al. 2309.15512 null
2024-01-09 BiSinger: Bilingual Singing Voice Synthesis Huali Zhou et.al. 2309.14089 link
2023-10-07 HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS Dake Guo et.al. 2309.13907 null
2023-09-24 VoiceLDM: Text-to-Speech with Environmental Context Yeonghyeon Lee et.al. 2309.13664 null
2023-09-24 Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control Aya Watanabe et.al. 2309.13509 null
2023-09-22 DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis Yu Gu et.al. 2309.12792 null
2023-09-22 Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts Shun Lei et.al. 2309.11977 null
2023-09-21 The Impact of Silence on Speech Anti-Spoofing Yuxiang Zhang et.al. 2309.11827 null
2023-09-21 Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech Rui Liu et.al. 2309.11724 link
2023-09-20 Speak While You Think: Streaming Speech Synthesis During Text Generation Avihu Dekel et.al. 2309.11210 null
2023-09-20 Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model Xinyu Zhou et.al. 2309.11000 link
2023-09-19 Exploring Speech Enhancement for Low-resource Speech Synthesis Zhaoheng Ni et.al. 2309.10795 null
2023-09-19 Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition Ziyang Ma et.al. 2309.10294 null
2023-09-17 Augmenting text for spoken language understanding with Large Language Models Roshan Sharma et.al. 2309.09390 null
2023-09-16 FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework Jianzong Wang et.al. 2309.08837 null
2023-09-15 Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech Dariusz Piotrowski et.al. 2309.08255 null
2023-09-15 HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods Hyun-seo Shin et.al. 2309.08208 link
2023-12-27 PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions Reo Shimizu et.al. 2309.08140 null
2023-09-15 Diversity-based core-set selection for text-to-speech with linguistic and acoustic features Kentaro Seki et.al. 2309.08127 null
2023-09-14 Direct Text to Speech Translation System using Acoustic Units Victoria Mingote et.al. 2309.07478 null
2023-10-07 FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec Zhihao Du et.al. 2309.07405 link
2023-09-13 DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation Zhichao Wu et.al. 2309.06787 null
2023-09-11 Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP Jinzuomu Zhong et.al. 2309.05423 link
2024-01-16 VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching Yiwei Guo et.al. 2309.05027 link
2023-09-08 Cross-Utterance Conditioned VAE for Speech Generation Yang Li et.al. 2309.04156 null
2023-09-07 Large-Scale Automatic Audiobook Creation Brendan Walsh et.al. 2309.03926 null
2023-09-11 GRASS: Unified Generation Model for Speech-to-Semantic Tasks Aobo Xia et.al. 2309.02780 null
2023-09-12 MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023 Zhihang Xu et.al. 2309.02743 null
2023-10-12 PromptTTS 2: Describing and Generating Voices with Text Prompt Yichong Leng et.al. 2309.02285 null
2023-09-04 A Comparative Analysis of Pretrained Language Models for Text-to-Speech Marcel Granero-Moya et.al. 2309.01576 null
2023-09-02 DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin Tao Li et.al. 2309.00883 null
2023-12-18 Learning Speech Representation From Contrastive Token-Acoustic Pretraining Chunyu Qiang et.al. 2309.00424 null
2023-09-01 The FruitShell French synthesis system at the Blizzard 2023 Challenge Xin Qi et.al. 2309.00223 null
2023-08-31 QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning Haohan Guo et.al. 2309.00126 null
2024-01-23 SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models Xin Zhang et.al. 2308.16692 link
2023-08-31 Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis Weiqin Li et.al. 2308.16593 null
2023-08-31 Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information Jie Chen et.al. 2308.16577 null
2023-08-31 LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech Jie Chen et.al. 2308.16569 null
2023-08-30 CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis Yi Meng et.al. 2308.16021 null
2023-09-01 The DeepZen Speech Synthesis System for Blizzard Challenge 2023 Christophe Veaux et.al. 2308.15945 null
2023-08-28 Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech Hyungchan Yoon et.al. 2308.14909 null
2023-09-04 Rep2wav: Noise Robust text-to-speech Using self-supervised representations Qiushi Zhu et.al. 2308.14553 null
2023-08-28 TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models Shengpeng Ji et.al. 2308.14430 link
2023-09-02 Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder Xuyuan Li et.al. 2308.13365 null
2023-08-24 Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations Wenbin Wang et.al. 2308.13007 null
2023-09-22 Sparks of Large Audio Models: A Survey and Outlook Siddique Latif et.al. 2308.12792 null
2023-10-25 SeamlessM4T: Massively Multilingual & Multimodal Machine Translation Seamless Communication et.al. 2308.11596 link
2023-08-31 Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models Heyang Xue et.al. 2308.10428 null
2023-08-16 AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis Hrishikesh Viswanath et.al. 2308.08577 null
2023-08-14 SpeechX: Neural Codec Language Model as a Versatile Speech Transformer Xiaofei Wang et.al. 2308.06873 null
2023-08-12 Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation Zhichao Wang et.al. 2308.06457 link
2023-09-09 AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining Haohe Liu et.al. 2308.05734 link
2023-08-09 Data Player: Automatic Generation of Data Videos with Narration-Animation Interplay Leixian Shen et.al. 2308.04703 null
2023-08-08 Towards an AI to Win Ghana's National Science and Maths Quiz George Boateng et.al. 2308.04333 link
2023-08-08 WonderFlow: Narration-Centric Design of Animated Data Videos Yun Wang et.al. 2308.04040 null
2023-08-04 Let's Give a Voice to Conversational Agents in Virtual Reality Michele Yin et.al. 2308.02665 link
2023-08-03 Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation Minsu Kim et.al. 2308.01831 link
2023-08-02 SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis Ramanan Sivaguru et.al. 2308.01018 null
2023-07-07 Artificial Eye for the Blind Abhinav Benagi et.al. 2308.00801 null
2023-07-31 Multilingual context-based pronunciation learning for Text-to-Speech Giulia Comini et.al. 2307.16709 null
2023-07-31 Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech Guangyan Zhang et.al. 2307.16679 null
2023-07-31 Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings Manuel Sam Ribeiro et.al. 2307.16643 null
2023-07-31 DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training Hyung-Seok Oh et.al. 2307.16549 link
2023-07-31 VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design Jungil Kong et.al. 2307.16430 null
2023-07-30 Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation Yuanhao Chen et.al. 2307.16199 link
2023-07-29 METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer Xinfa Zhu et.al. 2307.15951 null
2023-12-18 Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding Chunyu Qiang et.al. 2307.15484 null
2023-07-20 SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer Daegyeom Kim et.al. 2307.10550 link
2023-07-18 SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs Yinghao Aaron Li et.al. 2307.09435 null
2023-09-28 Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts Ziyue Jiang et.al. 2307.07218 null
2023-07-13 Controllable Emphasis with zero data for text-to-speech Arnaud Joly et.al. 2307.07062 null
2023-07-11 On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis Siyang Wang et.al. 2307.05132 null
2023-07-10 The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task Kun Song et.al. 2307.04630 null
2023-10-07 ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading Yujia Xiao et.al. 2307.00782 null
2023-06-28 EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech Daria Diatlova et.al. 2307.00024 link
2023-06-29 High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units Junchen Lu et.al. 2306.17005 null
2023-06-28 UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data Heeseung Kim et.al. 2306.16083 link
2023-10-19 Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale Matthew Le et.al. 2306.15687 null
2023-06-27 GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech Yahuan Cong et.al. 2306.15304 null
2023-06-25 DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech Sen Liu et.al. 2306.14145 null
2023-06-21 Visual-Aware Text-to-Speech Mohan Zhou et.al. 2306.12020 null
2023-06-21 Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer Jakub Swiatkowski et.al. 2306.11662 null
2023-06-16 Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation Kishor Kayyar Lakshminarayana et.al. 2306.10152 null
2023-06-16 CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages Frederico S. Oliveira et.al. 2306.10097 null
2023-06-14 Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation Zheng Liang et.al. 2306.08588 null
2023-06-14 Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects Xinghua Qu et.al. 2306.08219 link
2023-11-20 StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models Yinghao Aaron Li et.al. 2306.07691 null
2024-01-18 UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding Chenpeng Du et.al. 2306.07547 null
2023-06-13 PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling Ji-Sang Hwang et.al. 2306.07489 null
2023-06-09 Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech Shijun Wang et.al. 2306.05709 null
2023-06-08 VIFS: An End-to-End Variational Inference for Foley Sound Synthesis Junhyeok Lee et.al. 2306.05004 link
2023-07-11 Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge Wenhao Guan et.al. 2306.04301 null
2023-06-06 Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias Ziyue Jiang et.al. 2306.03509 null
2023-08-02 Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis Zhenhui Ye et.al. 2306.03504 null
2023-06-05 Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis Dengfeng Ke et.al. 2306.02593 null
2023-06-05 Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model Hoyeon Lee et.al. 2306.02579 null
2023-06-05 Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming Xinlei Niu et.al. 2306.02568 link
2023-06-02 Towards Robust FastSpeech 2 by Modelling Residual Multimodality Fabian Kögel et.al. 2306.01442 link
2023-05-30 Towards Selection of Text-to-speech Data to Augment ASR Training Shuo Liu et.al. 2306.00998 null
2023-06-01 EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis Haobin Tang et.al. 2306.00648 null
2023-06-01 The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech Phat Do et.al. 2306.00535 null
2023-05-31 Text-to-Speech Pipeline for Swiss German -- A comparison Tobias Bollinger et.al. 2305.19750 null
2023-05-31 XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech Linh The Nguyen et.al. 2305.19709 link
2023-06-01 PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions Guanghou Liu et.al. 2305.19522 null
2023-05-30 Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages Phat Do et.al. 2305.19396 null
2023-05-30 Make-A-Voice: Unified Voice Synthesis With Discrete Representation Rongjie Huang et.al. 2305.19269 null
2023-05-30 STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions Michel Plüss et.al. 2305.18855 null
2023-05-30 LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus Yuma Koizumi et.al. 2305.18802 null
2023-10-09 An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization Fei Kong et.al. 2305.18355 link
2023-05-29 ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation Ambuj Mehrish et.al. 2305.18028 link
2023-05-29 Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis Erik Ekstedt et.al. 2305.17971 null
2023-07-25 StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation Kun Song et.al. 2305.17732 null
2023-05-28 Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS Sewade Ogun et.al. 2305.17724 link
2023-07-19 Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing Julia Kaiwen Lau et.al. 2305.17445 link
2023-05-26 DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction Vineet Bhat et.al. 2305.16957 null
2023-05-25 Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion Rui Liu et.al. 2305.16353 link
2023-05-22 Text Generation with Speech Synthesis for ASR Data Augmentation Zhuangqun Huang et.al. 2305.16333 null
2023-05-25 VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation Tianrui Wang et.al. 2305.16107 null
2023-05-25 Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration Rustem Yeshpanov et.al. 2305.15749 link
2024-02-05 LAraBench: Benchmarking Arabic AI with Large Language Models Ahmed Abdelali et.al. 2305.14982 null
2023-05-23 EfficientSpeech: An On-Device Text to Speech Model Rowel Atienza et.al. 2305.13905 link
2023-05-23 ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models Minki Kang et.al. 2305.13831 null
2023-05-22 U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech Xin Jing et.al. 2305.13195 null
2023-05-25 EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels Kari Ali Noriy et.al. 2305.13137 link
2023-05-22 ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer Huadai Liu et.al. 2305.12708 null
2023-05-21 VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages Shivam Mhaskar et.al. 2305.12518 null
2023-05-26 Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus Detai Xin et.al. 2305.12442 link
2023-05-20 ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios Yuyue Wang et.al. 2305.12200 null
2023-05-19 MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting Neil Shah et.al. 2305.11926 null
2024-02-20 Data Redaction from Conditional Generative Models Zhifeng Kong et.al. 2305.11351 null
2023-05-18 Parameter-Efficient Learning for Text-to-Speech Accent Adaptation Li-Jen Yang et.al. 2305.11320 link
2023-05-19 Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation Martijn Bartelds et.al. 2305.10951 link
2023-09-30 Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data Yusheng Tian et.al. 2305.10891 link
2023-05-18 FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs Won Jang et.al. 2305.10823 null
2023-05-18 CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training Zhenhui Ye et.al. 2305.10763 null
2023-08-29 a unified front-end framework for english text-to-speech synthesis Zelin Ying et.al. 2305.10666 null
2023-09-19 Controllable Speaking Styles Using a Large Language Model Atli Thor Sigurgeirsson et.al. 2305.10321 null
2023-05-23 Better speech synthesis through scaling James Betker et.al. 2305.07243 link
2023-10-29 CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model Zhen Ye et.al. 2305.06908 link
2023-05-08 Accented Text-to-Speech Synthesis with Limited Data Xuehao Zhou et.al. 2305.04816 null
2023-05-03 M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis Jinlong Xue et.al. 2305.02269 null
2023-05-30 A Review of Deep Learning Techniques for Speech Processing Ambuj Mehrish et.al. 2305.00359 null
2023-04-26 Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis Ye-Xin Lu et.al. 2304.13270 null
2023-04-25 Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge Chenpeng Du et.al. 2304.13121 null
2023-04-24 Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model Kenichi Fujita et.al. 2304.11976 null
2023-04-23 DiffVoice: Text-to-Speech with Latent Diffusion Zhijun Liu et.al. 2304.11750 null
2023-04-23 SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model Jianzong Wang et.al. 2304.11547 null
2023-05-30 NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers Kai Shen et.al. 2304.09116 null
2023-04-16 A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers Juan Zuluaga-Gomez et.al. 2304.07842 null
2023-04-13 Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis Shun Lei et.al. 2304.06359 null
2023-04-10 Enhancing Speech-to-Speech Translation with Multiple TTS Targets Jiatong Shi et.al. 2304.04618 null
2023-04-07 ArmanTTS single-speaker Persian dataset Mohammd Hasan Shamgholi et.al. 2304.03585 null
2023-04-03 Ensemble prosody prediction for expressive speech synthesis Tian Huey Teh et.al. 2304.00714 null
2023-03-29 AraSpot: Arabic Spoken Command Spotting Mahmoud Salhab et.al. 2303.16621 link
2023-03-28 Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages Seongyeon Park et.al. 2303.15669 link
2023-03-27 Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis Karren Yang et.al. 2303.14885 null
2023-03-24 Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis Takuhiro Kaneko et.al. 2303.13909 null
2023-04-02 A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI Chenshuang Zhang et.al. 2303.13336 null
2023-03-20 Code-Switching Text Generation and Injection in Mandarin-English ASR Haibin Yu et.al. 2303.10949 null
2023-03-14 Controlling High-Dimensional Data With Sparse Input Dan Andrei Iliescu et.al. 2303.09446 null
2023-03-09 Text-to-ECG: 12-Lead Electrocardiogram Synthesis conditioned on Clinical Text Reports Hyunseung Chung et.al. 2303.09395 link
2023-03-15 Cross-speaker Emotion Transfer by Manipulating Speech Style Latents Suhee Jo et.al. 2303.08329 null
2023-03-14 QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis Haobin Tang et.al. 2303.07682 null
2023-03-10 An End-to-End Neural Network for Image-to-Audio Transformation Liu Chen et.al. 2303.06078 null
2023-03-09 Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation Qi Chen et.al. 2303.05322 link
2023-03-07 Do Prosody Transfer Models Transfer Prosody? Atli Thor Sigurgeirsson et.al. 2303.04289 null
2023-03-07 Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling Ziqiang Zhang et.al. 2303.03926 null
2023-03-02 Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding Yingting Li et.al. 2303.03267 link
2023-03-08 FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model Ruiqing Xue et.al. 2303.02939 null
2023-08-14 Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations Yuma Koizumi et.al. 2303.01664 null
2023-03-11 Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities Shijun Wang et.al. 2303.01508 null
2023-12-17 ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations Neil Shah et.al. 2303.01261 null
2023-03-02 LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion Chunfeng Wang et.al. 2303.01086 null
2023-03-02 Leveraging Large Text Corpora for End-to-End Speech Summarization Kohei Matsuura et.al. 2303.00978 null
2023-03-01 DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction Raviteja Anantha et.al. 2303.00171 null
2023-02-28 ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus Ajinkya Kulkarni et.al. 2303.00069 null
2023-02-28 Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners Jocelyn Huang et.al. 2302.14523 null
2023-06-12 CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis Ji-Hoon Kim et.al. 2302.14370 null
2023-05-19 UniFLG: Unified Facial Landmark Generator from Text or Speech Kentaro Mitsui et.al. 2302.14337 null
2023-02-27 Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech Jiyoung Lee et.al. 2302.13700 link
2023-02-27 Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech Dong Yang et.al. 2302.13652 null
2023-02-27 Varianceflow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow Yoonhyung Lee et.al. 2302.13458 null
2023-06-06 PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS Junhyeok Lee et.al. 2302.12391 link
2023-02-21 Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition Leyuan Qu et.al. 2302.09723 null
2023-02-23 QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion Houjian Guo et.al. 2302.08296 link
2023-02-13 Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages Sudhanshu Srivastava et.al. 2302.06227 null
2023-02-08 A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech Li-Wei Chen et.al. 2302.04215 link
2023-02-07 Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision Eugene Kharitonov et.al. 2302.03540 null
2023-02-15 MAC: A unified framework boosting low resource automatic speech recognition Zeping Min et.al. 2302.03498 null
2023-06-25 InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt Dongchao Yang et.al. 2301.13662 link
2023-03-01 UzbekTagger: The rule-based POS tagger for Uzbek language Maksud Sharipov et.al. 2301.12711 null
2023-05-27 Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining Takaaki Saeki et.al. 2301.12596 link
2023-01-31 Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker Navjot Kaur et.al. 2301.12331 link
2023-01-26 On granularity of prosodic representations in expressive text-to-speech Mikolaj Babianski et.al. 2301.11446 null
2023-01-26 Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study Massa Baali et.al. 2301.09099 link
2023-01-20 Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions Yinghao Aaron Li et.al. 2301.08810 null
2023-01-11 Modelling low-resource accents without accent-specific TTS frontend Georgi Tinchev et.al. 2301.04606 null
2022-12-11 BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm Yu-Wen Chen et.al. 2301.04120 link
2023-01-10 UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion Haogeng Liu et.al. 2301.03801 null
2023-01-10 Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation Abdullah Shahid et.al. 2301.03751 null
2023-09-19 Applying Automated Machine Translation to Educational Video Courses Linden Wang et.al. 2301.03141 null
2023-01-06 Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition David M. Chan et.al. 2301.02736 null
2023-01-05 Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers Chengyi Wang et.al. 2301.02111 link
2022-12-11 MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset Kailin Liang et.al. 2301.00657 link
2022-12-30 ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech Zehua Chen et.al. 2212.14518 null
2022-12-29 StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models Yinghao Aaron Li et.al. 2212.14227 link
2022-12-22 HMM-based data augmentation for E2E systems for building conversational speech synthesis systems Ishika Gupta et.al. 2212.11982 null
2022-12-21 ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement Wei-Ning Hsu et.al. 2212.11377 null
2022-12-20 TTS-Guided Training for Accent Conversion Without Parallel Data Yi Zhou et.al. 2212.10204 null
2023-06-28 Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling Tuomo Raitio et.al. 2212.10075 null
2022-12-16 Speech Aware Dialog System Technology Challenge (DSTC11) Hagen Soltau et.al. 2212.08704 null
2022-12-16 Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder Yusuke Yasuda et.al. 2212.08329 null
2022-12-16 Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language Yusuke Yasuda et.al. 2212.08321 null
2022-12-15 RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis Shinhyeok Oh et.al. 2212.07939 link
2022-12-14 Probing Deep Speaker Embeddings for Speaker-related Tasks Zifeng Zhao et.al. 2212.07068 null
2022-12-08 SpeechLMScore: Evaluating speech generation using speech language model Soumi Maiti et.al. 2212.04559 link
2023-04-04 Learning to Dub Movies via Hierarchical Prosody Models Gaoxiang Cong et.al. 2212.04054 link
2022-12-07 Low-Resource End-to-end Sanskrit TTS using Tacotron2, WaveGlow and Transfer Learning Ankur Debnath et.al. 2212.03558 null
2022-12-07 Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue Daxin Tan et.al. 2212.03398 null
2022-12-06 UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis Yi Lei et.al. 2212.01546 null
2022-11-30 SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech Byoung Jin Choi et.al. 2211.16866 null
2022-11-29 Controllable speech synthesis by learning discrete phoneme-level prosodic representations Nikolaos Ellinas et.al. 2211.16307 null
2023-05-25 Evaluating and reducing the distance between synthetic and real speech distributions Christoph Minixhofer et.al. 2211.16049 null
2022-11-26 Contextual Expressive Text-to-Speech Jianhong Tu et.al. 2211.14548 null
2022-12-05 Efficient Incremental Text-to-Speech on GPUs Muyang Du et.al. 2211.13939 null
2023-03-21 Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems? Xuan Shi et.al. 2211.13868 link
2022-11-23 IMaSC -- ICFOSS Malayalam Speech Corpus Deepa P Gopinath et.al. 2211.12796 null
2022-11-22 PromptTTS: Controllable Text-to-Speech with Text Descriptions Zhifang Guo et.al. 2211.12171 null
2022-11-04 Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech Xin Zhang et.al. 2211.09731 null
2023-02-17 Towards Building Text-To-Speech Systems for the Next Billion Users Gokul Karthik Kumar et.al. 2211.09536 link
2023-02-16 EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance Yiwei Guo et.al. 2211.09496 null
2022-11-17 Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone Disambiguation Chunyu Qiang et.al. 2211.09495 null
2022-11-17 NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis Hyeong-Seok Choi et.al. 2211.09407 null
2023-03-14 Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models Minki Kang et.al. 2211.09383 null
2023-01-04 Low-Resource Mongolian Speech Synthesis Based on Automatic Prosody Annotation Xin Yuan et.al. 2211.09365 null
2022-11-14 SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech Perry Lam et.al. 2211.07283 null
2023-05-24 Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing Jacob J Webber et.al. 2211.06989 null
2023-05-29 OverFlow: Putting flows on top of neural transducers for better TTS Shivam Mehta et.al. 2211.06892 link
2023-05-29 Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations Yoori Oh et.al. 2211.06160 null
2022-12-04 ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech Xiaoran Fan et.al. 2211.03545 link
2022-11-07 Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder Jan Melechovsky et.al. 2211.03316 link
2022-11-06 Parallel Attention Forcing for Machine Translation Qingyun Dou et.al. 2211.03237 null
2022-11-06 An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space Jihwan Lee et.al. 2211.03078 null
2022-11-04 NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS Dongchao Yang et.al. 2211.02448 null
2022-11-04 Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts Detai Xin et.al. 2211.02336 null
2023-04-16 Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS Ziqi Liang et.al. 2211.01948 null
2022-11-01 Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages Anusha Prakash et.al. 2211.01338 null
2023-05-28 DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP Kun Song et.al. 2211.01087 null
2022-11-22 Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement Wei Song et.al. 2211.00967 null
2022-11-01 Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers Cheng-Ping Hsieh et.al. 2211.00585 link
2023-06-11 Generating Multilingual Gender-Ambiguous Text-to-Speech Voices Konstantinos Markopoulos et.al. 2211.00375 null
2023-05-07 Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features Alexandra Vioni et.al. 2211.00342 null
2022-11-02 Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS Kun Song et.al. 2210.17349 null
2024-02-27 Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation Nikolaos Ellinas et.al. 2210.17264 null
2022-10-31 Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection Luigi Attorresi et.al. 2210.17222 null
2022-10-31 Structured State Space Decoder for Speech Recognition and Synthesis Koichi Miyazaki et.al. 2210.17098 null
2022-10-28 Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders Jason Fong et.al. 2210.16045 null
2023-02-21 Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform Masaya Kawamura et.al. 2210.15975 link
2023-02-22 Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis Yuma Shirahata et.al. 2210.15964 null
2022-10-28 Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation Nobuyuki Morioka et.al. 2210.15868 null
2023-03-15 Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech Takaaki Saeki et.al. 2210.15447 null
2022-10-27 Explicit Intensity Control for Accented Text-to-speech Rui Liu et.al. 2210.15364 null
2022-10-27 FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis Yifan Hu et.al. 2210.15360 link
2022-10-26 Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection Kentaro Seki et.al. 2210.14850 null
2022-10-25 Semi-Supervised Learning Based on Reference Model for Low-resource TTS Xulong Zhang et.al. 2210.14723 null
2022-10-26 Cover Reproducible Steganography via Deep Generative Models Kejiang Chen et.al. 2210.14632 null
2022-10-26 Improving Speech-to-Speech Translation Through Unlabeled Text Xuan-Phi Nguyen et.al. 2210.14514 null
2022-10-26 The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge Yuhao Liang et.al. 2210.14448 null
2022-10-25 Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data Xulong Zhang et.al. 2210.13803 null
2023-09-17 HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation Chunhui Wang et.al. 2210.12740 null
2022-10-21 Low-Resource Multilingual and Zero-Shot Multispeaker TTS Florian Lux et.al. 2210.12223 link
2022-10-21 Adaptive re-calibration of channel-wise features for Adversarial Audio Classification Vardhan Dongre et.al. 2210.11722 null
2022-10-20 Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS Chunyu Qiang et.al. 2210.11429 null
2022-10-17 Towards Relation Extraction From Speech Tongtong Wu et.al. 2210.08759 link
2023-02-08 Generating Synthetic Speech from SpokenVocab for Speech Translation Jinming Zhao et.al. 2210.08174 link
2022-10-17 LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge Yan Jia et.al. 2210.07749 null
2022-10-20 Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy Sarina Meyer et.al. 2210.07002 link
2022-10-13 Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar Aolan Sun et.al. 2210.06877 null
2022-10-12 Can we use Common Voice to train a Multi-Speaker TTS system? Sewade Ogun et.al. 2210.06370 null
2023-06-01 SQuId: Measuring Speech Naturalness in Many Languages Thibault Sellam et.al. 2210.06324 null
2022-11-22 Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech Byoung Jin Choi et.al. 2210.05979 null
2022-10-06 An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era Andreas Triantafyllopoulos et.al. 2210.03538 null
2022-09-29 Facial Landmark Predictions with Applications to Metaverse Qiao Han et.al. 2209.14698 link
2022-09-26 Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech Yusuke Nakai et.al. 2209.12549 null
2022-09-22 EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models Perry Lam et.al. 2209.10890 null
2022-09-22 MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline Yifan Hu et.al. 2209.10848 link
2022-09-22 Controllable Accented Text-to-Speech Synthesis Rui Liu et.al. 2209.10804 null
2022-09-16 TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection Davide Salvi et.al. 2209.08000 null
2022-09-14 Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset Michael Chinen et.al. 2209.06358 null
2022-09-08 SANIP: Shopping Assistant and Navigation for the visually impaired Shubham Deshmukh et.al. 2209.03570 null
2022-09-07 Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech Huu-Tien Dang et.al. 2209.02971 null
2022-09-02 Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model Jennifer Drexler Fox et.al. 2209.01250 null
2022-08-28 Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks Lev Finkelstein et.al. 2208.13183 null
2022-10-04 Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale Aditya Agarwal et.al. 2208.09796 null
2022-08-21 Visualising Model Training via Vowel Space for Text-To-Speech Systems Binu Abeysinghe et.al. 2208.09775 link
2022-08-15 Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0 Mohammed Salah Al-Radhi et.al. 2208.07122 null
2022-12-28 Speech Synthesis with Mixed Emotions Kun Zhou et.al. 2208.05890 null
2022-08-03 A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis Qibing Bai et.al. 2208.02189 null
2022-07-29 Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation Giulia Comini et.al. 2207.14607 null
2022-07-25 Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis Raul Fernandez et.al. 2207.12262 null
2022-07-01 A Polyphone BERT for Polyphone Disambiguation in Mandarin Chinese Song Zhang et.al. 2207.12089 null
2022-07-20 When Is TTS Augmentation Through a Pivot Language Useful? Nathaniel Robinson et.al. 2207.09889 link
2022-07-11 LIP: Lightweight Intelligent Preprocessor for meaningful text-to-speech Harshvardhan Anand et.al. 2207.07118 null
2022-07-13 ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech Rongjie Huang et.al. 2207.06389 link
2022-07-13 Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech Zhengxi Liu et.al. 2207.06088 null
2022-07-13 SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate Nabarun Goswami et.al. 2207.06011 null
2022-07-13 Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS Yookyung Shin et.al. 2207.06000 null
2022-07-13 A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System Yi-Chiao Wu et.al. 2207.05913 null
2022-07-12 Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition Rodolfo Zevallos et.al. 2207.05498 null
2022-07-12 End-to-end speech recognition modeling from de-identified data Martin Flechl et.al. 2207.05469 null
2022-07-11 Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data Naoki Makishima et.al. 2207.04659 null
2022-07-11 DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders Yanqing Liu et.al. 2207.04646 null
2023-01-02 Dreamento: an open-source dream engineering toolbox for sleep EEG wearables Mahdad Jafarzadeh Esfahani et.al. 2207.03977 link
2022-07-07 BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus Josh Meyer et.al. 2207.03546 link
2022-07-05 Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion Yi Lei et.al. 2207.01832 null
2022-07-04 BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model Brooke Stephenson et.al. 2207.01718 null
2022-07-04 Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS) Ariadna Sanchez et.al. 2207.01547 null
2022-07-04 Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS) Ziyao Zhang et.al. 2207.01507 null
2023-03-13 DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech Keon Lee et.al. 2207.01063 link
2022-07-02 Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need Daniel Korzekwa et.al. 2207.00774 null
2022-07-01 Building African Voices Perez Ogayo et.al. 2207.00688 link
2022-07-01 Automatic Evaluation of Speaker Similarity Deja Kamil et.al. 2207.00344 null
2022-08-03 Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding Wei-Ping Huang et.al. 2206.15427 null
2022-06-30 R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS Kyle Kastner et.al. 2206.15276 null
2022-07-01 Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems Hyun-Wook Yoon et.al. 2206.15067 null
2022-06-30 TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder Eunwoo Song et.al. 2206.14984 null
2022-06-29 Improving Deliberation by Text-Only and Semi-Supervised Training Ke Hu et.al. 2206.14716 null
2022-06-29 Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody Peter Makarov et.al. 2206.14643 null
2022-06-28 Expressive, Variable, and Controllable Duration Modelling in TTS Ammar Abbas et.al. 2206.14165 null
2022-06-28 Comparison of Speech Representations for the MOS Prediction System Aki Kunikoshi et.al. 2206.13817 null
2022-06-22 A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data Raviraj Joshi et.al. 2206.13240 null
2022-06-25 Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations Chin-Cheng Hsu et.al. 2206.12662 null
2022-10-21 Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech Florian Lux et.al. 2206.12229 link
2022-06-24 SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech Hyunjae Cho et.al. 2206.12132 null
2022-06-24 End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue Kentaro Mitsui et.al. 2206.12040 null
2022-05-29 Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning Sameea Naeem et.al. 2206.11860 null
2022-06-21 Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS Kenta Udagawa et.al. 2206.10256 null
2022-06-24 Towards Optimizing OCR for Accessibility Peya Mowar et.al. 2206.10254 null
2022-06-16 Automatic Prosody Annotation with Pre-Trained Text-Speech Model Ziqian Dai et.al. 2206.07956 link
2022-11-16 NatiQ: An End-to-end Text-to-Speech System for Arabic Ahmed Abdelali et.al. 2206.07373 null
2022-06-15 Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning Rui Liu et.al. 2206.07229 link
2022-12-12 A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation Junhui Zhang et.al. 2206.04922 null
2022-06-09 Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos Alexander Waibel et.al. 2206.04523 null
2022-06-07 FlexLip: A Controllable Text-to-Lip System Dan Oneata et.al. 2206.03206 null
2022-10-11 UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder Jiachen Lian et.al. 2206.02512 null
2023-10-19 Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech Ziyue Jiang et.al. 2206.02147 link
2022-11-02 AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation Kun Song et.al. 2206.00208 null
2022-05-31 Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish Alp Öktem et.al. 2205.15599 link
2023-11-20 StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis Yinghao Aaron Li et.al. 2205.15439 link
2022-05-30 Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data Sungwon Kim et.al. 2205.15370 null
2022-05-26 QSpeech: Low-Qubit Quantum Speech Application Toolkit Zhenhou Hong et.al. 2205.13221 link
2022-11-10 T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation Paul-Ambroise Duquenne et.al. 2205.12216 null
2022-05-20 PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit Hui Zhang et.al. 2205.12007 link
2022-05-24 TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS Xulong Zhang et.al. 2205.11824 null
2022-10-12 GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Rongjie Huang et.al. 2205.07211 link
2022-05-13 Talking Face Generation with Multilingual TTS Hyoung-Kyu Song et.al. 2205.06421 null
2022-05-10 NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality Xu Tan et.al. 2205.04421 link
2022-05-09 Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech Yang Li et.al. 2205.04120 link
2022-05-09 ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence Sangshin Oh et.al. 2205.04104 null
2022-07-14 Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss Efthymios Georgiou et.al. 2204.13437 null
2022-04-25 SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech Zhenhui Ye et.al. 2204.11792 null
2022-04-22 LibriS2S: A German-English Speech-to-Speech Translation Corpus Pedro Jeuris et.al. 2204.10593 link
2022-07-05 Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation Ryo Terashima et.al. 2204.10020 null
2022-04-21 FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis Rongjie Huang et.al. 2204.09934 link
2022-04-20 Audio Deep Fake Detection System with Neural Stitching for ADD 2022 Rui Yan et.al. 2204.08720 null
2022-04-14 Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech Cong Zhang et.al. 2204.07228 null
2022-12-09 Study of Indian English Pronunciation Variabilities relative to Received Pronunciation Priyanshi Pal et.al. 2204.06502 null
2022-04-12 Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch Hanbin Bae et.al. 2204.05753 null
2023-01-30 The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance Lin Zhang et.al. 2204.05177 null
2022-10-27 Fine-grained Noise Control for Multispeaker Speech Synthesis Karolos Nikitaras et.al. 2204.05070 null
2022-08-31 Karaoker: Alignment-free singing voice synthesis with speech training data Panos Kakoulidis et.al. 2204.04127 null
2022-08-15 Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech Jae-Sung Bae et.al. 2204.04004 null
2022-04-07 Arabic Text-To-Speech (TTS) Data Preparation Hala Al Masri et.al. 2204.03255 null
2022-04-07 Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis Yutian Wang et.al. 2204.03238 null
2022-08-24 SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis Georgia Maniati et.al. 2204.03040 null
2022-09-13 Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation Sravya Popuri et.al. 2204.02967 null
2022-07-02 Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification Jin Woo Lee et.al. 2204.02639 null
2023-08-28 Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech Hyungchan Yoon et.al. 2204.02172 null
2022-09-07 Deliberation Model for On-Device Spoken Language Understanding Duc Le et.al. 2204.01893 null
2022-12-14 Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck Youngsik Eom et.al. 2204.01387 null
2022-11-11 Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis Yixuan Zhou et.al. 2204.00990 null
2022-06-30 VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature Chenpeng Du et.al. 2204.00768 null
2022-04-01 AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios Yihan Wu et.al. 2204.00436 null
2022-04-01 Text-To-Speech Data Augmentation for Low Resource Speech Recognition Rodolfo Zevallos et.al. 2204.00291 null
2022-07-19 Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech Guangyan Zhang et.al. 2203.17190 null
2022-03-31 An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer Wenlin Dai et.al. 2203.16954 link
2022-07-11 WavThruVec: Latent speech representation as intermediate features for neural speech synthesis Hubert Siuzdak et.al. 2203.16930 null
2022-03-31 A Character-level Span-based Model for Mandarin Prosodic Structure Prediction Xueyuan Chen et.al. 2203.16922 link
2022-07-01 JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech Dan Lim et.al. 2203.16852 link
2022-03-31 Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset Zehui Yang et.al. 2203.16844 null
2022-03-31 NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism Jingbei Li et.al. 2203.16838 link
2022-03-31 Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition Anirudh Gupta et.al. 2203.16823 null
2022-04-21 Does Audio Deepfake Detection Generalize? Nicolas M. Müller et.al. 2203.16263 null
2022-03-30 End to End Lip Synchronization with a Temporal AutoEncoder Yoav Shalev et.al. 2203.16224 link
2022-08-15 Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition Junrui Ni et.al. 2203.15796 link
2022-06-29 DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning Takaaki Saeki et.al. 2203.15683 null
2022-11-05 Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation Rendi Chevi et.al. 2203.15643 link
2022-10-06 Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus Minchan Kim et.al. 2203.15447 null
2022-07-11 VoiceMe: Personalized voice generation in TTS Pol van Rijn et.al. 2203.15379 link

(back to top)

About

Automatically Update Text-to-speech (TTS) Papers Daily using Github Actions (Update Every 12th hours)

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages