GitHub - liutaocode/TTS-arxiv-daily: Automatically Update Text-to-speech (TTS) Papers Daily using Github Actions (Update Every 12th hours)

Updated on 2024.11.15

Usage instructions: here

This page is modified from here

Table of Contents

TTS

TTS

Publish Date	Title	Authors	PDF	Code
2024-11-12	Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models	Dongrui Han et.al.	2411.07563	null
2024-11-11	Enhancing Accessibility in Special Libraries: A Study on AI-Powered Assistive Technologies for Patrons with Disabilities	Snehasish Paul Shivali Chauhan et.al.	2411.06970	null
2024-11-10	Debatts: Zero-Shot Debating Text-to-Speech Synthesis	Yiqiao Huang et.al.	2411.06540	null
2024-11-07	CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR	Kadir Burak Buldu et.al.	2411.04671	null
2024-11-04	EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector	Deok-Hyeon Cho et.al.	2411.02625	link
2024-11-09	Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis	Shijia Liao et.al.	2411.01156	link
2024-10-31	Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?	Ioannis Tsiamas et.al.	2410.24019	null
2024-10-30	Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis	Théodor Lemerle et.al.	2410.23320	link
2024-10-29	Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech	Eric Battenberg et.al.	2410.22179	null
2024-10-29	Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding	Bohan Li et.al.	2410.21951	null
2024-10-29	RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis	Kehan Sui et.al.	2410.21641	null
2024-10-28	Asynchronous Tool Usage for Real-Time Agents	Antonio A. Ginart et.al.	2410.21620	null
2024-10-28	Enhancing TTS Stability in Hebrew using Discrete Semantic Units	Ella Zeldes et.al.	2410.21502	null
2024-10-28	Mitigating Unauthorized Speech Synthesis for Voice Protection	Zhisheng Zhang et.al.	2410.20742	link
2024-10-27	Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation	Maohao Shen et.al.	2410.20336	null
2024-10-24	Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis	Suparna De et.al.	2410.19199	null
2024-10-24	STTATTS: Unified Speech-To-Text And Text-To-Speech Model	Hawau Olamide Toyin et.al.	2410.18607	link
2024-10-24	Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts	ChaeHun Park et.al.	2410.18444	null
2024-10-23	ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams	Srija Anand et.al.	2410.17901	null
2024-10-22	Continuous Speech Tokenizer in Text To Speech	Yixing Li et.al.	2410.17081	null
2024-10-22	Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap	Guanrou Yang et.al.	2410.16726	null
2024-10-21	Continuous Speech Synthesis using per-token Latent Diffusion	Arnon Turetzky et.al.	2410.16048	null
2024-10-18	A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages	Sujitha Sathiyamoorthy et.al.	2410.14197	null
2024-10-18	Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech	Shuwei He et.al.	2410.14101	link
2024-10-17	Enhancing Crowdsourced Audio for Text-to-Speech Models	José Giraldo et.al.	2410.13357	null
2024-10-17	DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech	Jan Melechovsky et.al.	2410.13342	null
2024-10-17	DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis	Yu Gu et.al.	2410.13288	null
2024-10-17	Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation	Sreyan Ghosh et.al.	2410.13198	null
2024-10-16	ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs	Rui-Chen Zheng et.al.	2410.12359	null
2024-10-14	IsoChronoMeter: A simple and effective isochronic translation evaluation metric	Nikolai Rozanov et.al.	2410.11127	null
2024-10-14	DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization	Yingahao Aaron Li et.al.	2410.11097	null
2024-10-12	Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling	Rui Liu et.al.	2410.09524	null
2024-10-10	Unsupervised Data Validation Methods for Efficient Model Training	Yurii Paniv et.al.	2410.07880	null
2024-10-15	F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching	Yushen Chen et.al.	2410.06885	link
2024-10-09	Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch	Teodora Răgman et.al.	2410.06787	null
2024-10-09	Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS	Onkar Kishor Susladkar et.al.	2410.06608	null
2024-10-09	Can DeepFake Speech be Reliably Detected?	Hongbin Liu et.al.	2410.06572	null
2024-10-07	SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech	Minchan Kim et.al.	2410.04690	null
2024-10-06	HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis	Yuto Nishimura et.al.	2410.04380	null
2024-10-10	SONAR: A Synthetic AI-Audio Detection Framework and Benchmark	Xiang Li et.al.	2410.04324	link
2024-10-05	Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System	Ze Li et.al.	2410.04017	null
2024-10-01	Recent Advances in Speech Language Models: A Survey	Wenqian Cui et.al.	2410.03751	null
2024-10-04	Generative Semantic Communication for Text-to-Speech Synthesis	Jiahao Zheng et.al.	2410.03459	null
2024-10-04	Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens	Jinzheng Zhao et.al.	2410.03298	null
2024-10-04	Narrative Player: Reviving Data Narratives with Visuals	Zekai Shao et.al.	2410.03268	null
2024-10-04	MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech	Taejun Bak et.al.	2410.03192	null
2024-10-01	Augmentation through Laundering Attacks for Audio Spoof Detection	Hashim Ali et.al.	2410.01108	null
2024-10-01	Zero-Shot Text-to-Speech from Continuous Text Streams	Trung Dang et.al.	2410.00767	null
2024-10-01	EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control	Haozhe Chen et.al.	2410.00316	link
2024-09-30	Word-wise intonation model for cross-language TTS systems	Tomilov A. A. et.al.	2409.20374	null
2024-09-27	Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech	Youngjae Kim et.al.	2409.18622	null
2024-09-26	Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control	Ryuichi Yamamoto et.al.	2409.17452	null
2024-09-25	Exploring synthetic data for cross-speaker style transfer in style representation based TTS	Lucas H. Ueda et.al.	2409.17364	null
2024-09-25	Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions	Kun Zhou et.al.	2409.16681	null
2024-09-25	Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation	Siyin Wang et.al.	2409.16644	null
2024-09-24	FastTalker: Jointly Generating Speech and Conversational Gestures from Text	Zixin Guo et.al.	2409.16404	null
2024-09-24	Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling	Ville Heilala et.al.	2409.16376	null
2024-09-24	Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech	Yunji Chu et.al.	2409.16203	null
2024-09-24	NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers	Nohil Park et.al.	2409.15760	null
2024-09-24	VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance	Jiheum Yeom et.al.	2409.15759	null
2024-09-24	StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis	Zhiyong Chen et.al.	2409.15741	null
2024-09-23	A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection	Lam Pham et.al.	2409.15180	null
2024-09-23	LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation	Hieu-Thi Luong et.al.	2409.14743	null
2024-09-20	Zero-shot Cross-lingual Voice Transfer for TTS	Fadi Biadsy et.al.	2409.13910	null
2024-09-20	On the Feasibility of Fully AI-automated Vishing Attacks	João Figueiredo et.al.	2409.13793	null
2024-09-19	Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space	Sebastião Quintas et.al.	2409.12745	null
2024-09-19	Preference Alignment Improves Language Model-Based TTS	Jinchuan Tian et.al.	2409.12403	null
2024-09-18	Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference	Edresson Casanova et.al.	2409.12117	null
2024-09-18	Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems	Anusha Prakash et.al.	2409.11915	null
2024-09-18	DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech	Xin Qi et.al.	2409.11835	null
2024-09-18	Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation	Haohan Guo et.al.	2409.11630	null
2024-09-17	SpMis: An Investigation of Synthetic Spoken Misinformation Detection	Peizhuo Liu et.al.	2409.11308	null
2024-09-19	The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives	Samee Arif et.al.	2409.11261	link
2024-09-17	Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora	Francesco Nespoli et.al.	2409.11107	null
2024-09-16	Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization	Xiaoxue Gao et.al.	2409.10157	null
2024-09-16	StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion	Yinghao Aaron Li et.al.	2409.10058	null
2024-09-15	Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning	Siqi Sun et.al.	2409.09891	null
2024-09-14	E1 TTS: Simple and Fast Non-Autoregressive TTS	Zhijun Liu et.al.	2409.09351	null
2024-09-14	Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation	Changjin Han et.al.	2409.09311	null
2024-09-14	SafeEar: Content Privacy-Preserving Audio Deepfake Detection	Xinfeng Li et.al.	2409.09272	link
2024-09-13	AccentBox: Towards High-Fidelity Zero-Shot Accent Generation	Jinzuomu Zhong et.al.	2409.09098	null
2024-09-17	HLTCOE JHU Submission to the Voice Privacy Challenge 2024	Henry Li Xinyuan et.al.	2409.08913	null
2024-09-13	Text-To-Speech Synthesis In The Wild	Jee-weon Jung et.al.	2409.08711	null
2024-09-14	Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions	Amila Indika et.al.	2409.07945	null
2024-09-12	Full-text Error Correction for Chinese Speech Recognition with Large Language Model	Zhiyuan Tang et.al.	2409.07790	null
2024-09-11	SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis	Helin Wang et.al.	2409.07556	link
2024-09-11	D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack	Hong-Hanh Nguyen-Le et.al.	2409.07390	null
2024-09-11	Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT	Kazuki Yamauchi et.al.	2409.07265	null
2024-09-11	Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment	Tien-Hong Lo et.al.	2409.07151	null
2024-09-10	Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models	Xin Jing et.al.	2409.06451	null
2024-09-10	What happens to diffusion model likelihood when your model is conditional?	Mattias Cross et.al.	2409.06364	null
2024-09-10	VoiceWukong: Benchmarking Deepfake Voice Detection	Ziwei Yan et.al.	2409.06348	null
2024-09-09	AS-Speech: Adaptive Style For Speech Synthesis	Zhipeng Li et.al.	2409.05730	null
2024-09-09	IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS	Ashwin Sankar et.al.	2409.05356	link
2024-09-10	Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion	Zhengyang Chen et.al.	2409.05004	null
2024-09-01	Sample-Efficient Diffusion for Text-To-Speech Synthesis	Justin Lovelace et.al.	2409.03717	link
2024-09-10	LAST: Language Model Aware Speech Tokenization	Arnon Turetzky et.al.	2409.03701	null
2024-09-05	FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications	Hao-Han Guo et.al.	2409.03283	null
2024-09-04	Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems	Jeongmin Liu et.al.	2409.02517	null
2024-09-03	VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka	Li-Wei Chen et.al.	2409.01548	null
2024-09-02	A multilingual training strategy for low resource Text to Speech	Asma Amalas et.al.	2409.01217	null
2024-09-02	A Framework for Synthetic Audio Conversations Generation using Large Language Models	Kaung Myat Kyaw et.al.	2409.00946	null
2024-09-02	SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis	Haohan Guo et.al.	2409.00933	link
2024-09-01	MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer	Yuancheng Wang et.al.	2409.00750	link
2024-08-30	SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection	Ismail Rasim Ulgen et.al.	2408.17432	null
2024-08-30	AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge	Kirill Borodin et.al.	2408.17352	null
2024-08-30	Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model	Zhen Ye et.al.	2408.17175	link
2024-08-30	Utilizing Speaker Profiles for Impersonation Audio Detection	Hao Gu et.al.	2408.17009	null
2024-08-29	Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis	Zehai Tu et.al.	2408.16373	null
2024-08-28	Multi-modal Adversarial Training for Zero-Shot Voice Cloning	John Janiczek et.al.	2408.15916	null
2024-08-29	Easy, Interpretable, Effective: openSMILE for voice deepfake detection	Octavian Pascu et.al.	2408.15775	null
2024-08-28	VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling	Yixuan Zhou et.al.	2408.15676	link
2024-08-28	VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech	Heeseung Kim et.al.	2408.14739	null
2024-08-27	StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech	Haowei Lou et.al.	2408.14713	null
2024-08-27	DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance	Jinhyeok Yang et.al.	2408.14423	null
2024-08-26	Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard	Wonjune Kang et.al.	2408.13970	null
2024-08-28	SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models	Dongchao Yang et.al.	2408.13893	null
2024-08-22	Positional Description for Numerical Normalization	Deepanshu Gupta et.al.	2408.12430	null
2024-08-22	VoiceX: A Text-To-Speech Framework for Custom Voices	Silvan Mertes et.al.	2408.12170	null
2024-08-13	Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation	Yinghao Aaron Li et.al.	2408.11849	null
2024-08-20	EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech	Xin Qi et.al.	2408.10852	null
2024-08-20	SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS	Karl El Hajal et.al.	2408.10771	null
2024-08-20	Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting	Hyun Jin Park et.al.	2408.10463	null
2024-08-17	Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition	Samuele Cornell et.al.	2408.09215	link
2024-08-14	PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation	Sang-Hoon Lee et.al.	2408.07547	link
2024-08-13	SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis	Osamu Take et.al.	2408.06858	link
2024-08-13	PRESENT: Zero-Shot Text-to-Prosody Control	Perry Lam et.al.	2408.06827	link
2024-08-12	FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks	Min Ma et.al.	2408.06227	null
2024-08-11	VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing	Chunyu Qiang et.al.	2408.05758	null
2024-08-06	Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training	Hawraz A. Ahmad et.al.	2408.03887	null
2024-08-03	ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features	Peng Cheng et.al.	2408.01808	link
2024-08-01	Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation	Xinhan Di et.al.	2408.00284	null
2024-07-18	Handling Numeric Expressions in Automatic Speech Recognition	Christian Huber et.al.	2408.00004	null
2024-07-31	On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition	Nick Rossenbach et.al.	2407.21476	null
2024-07-29	Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks	Mahmoud Salhab et.al.	2407.18571	null
2024-07-25	On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures	Nick Rossenbach et.al.	2407.17997	null
2024-07-24	Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model	Jan Lehečka et.al.	2407.17167	null
2024-07-23	Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments	Pai Zhu et.al.	2407.16840	null
2024-07-19	Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2	Chun Xu et.al.	2407.14212	null
2024-07-18	Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models	Weiqin Li et.al.	2407.13509	null
2024-07-22	TTSDS -- Text-to-Speech Distribution Score	Christoph Minixhofer et.al.	2407.12707	link
2024-07-17	Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech	Haibin Wu et.al.	2407.12229	link
2024-07-16	A Language Modeling Approach to Diacritic-Free Hebrew TTS	Amit Roth et.al.	2407.12206	null
2024-07-17	Learning High-Frequency Functions Made Easy with Sinusoidal Positional Encoding	Chuanhao Sun et.al.	2407.09370	link
2024-07-11	Autoregressive Speech Synthesis without Vector Quantization	Lingwei Meng et.al.	2407.08551	null
2024-07-10	Source Tracing of Audio Deepfake Systems	Nicholas Klein et.al.	2407.08016	null
2024-07-07	ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation	Ruibo Fu et.al.	2407.05421	null
2024-07-09	CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens	Zhihao Du et.al.	2407.05407	null
2024-07-04	Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis	Cong-Thanh Do et.al.	2407.04047	null
2024-07-04	Optimizing a-DCF for Spoofing-Robust Speaker Verification	Oğuzhan Kurnaz et.al.	2407.04034	null
2024-07-04	On the Effectiveness of Acoustic BPE in Decoder-Only TTS	Bohan Li et.al.	2407.03892	null
2024-07-14	CATT: Character-based Arabic Tashkeel Transformer	Faris Alasmary et.al.	2407.03236	link
2024-07-02	Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization	Yuchen Hu et.al.	2407.02243	null
2024-07-02	TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations	Xiaoxue Gao et.al.	2407.01927	null
2024-07-01	Lightweight Zero-shot Text-to-Speech with Mixture of Adapters	Kenichi Fujita et.al.	2407.01291	null
2024-06-30	NAIST Simultaneous Speech Translation System for IWSLT 2024	Yuka Ko et.al.	2407.00826	null
2024-06-30	An Attribute Interpolation Method in Speech Synthesis by Model Merging	Masato Murata et.al.	2407.00766	null
2024-06-30	FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis	Yinlin Guo et.al.	2407.00753	null
2024-07-02	Open-Source Conversational AI with SpeechBrain 1.0	Mirco Ravanelli et.al.	2407.00463	null
2024-06-27	Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models	Borodin Kirill Nikolayevich et.al.	2406.19243	null
2024-06-27	DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability	Hyun Joon Park et.al.	2406.19135	link
2024-06-26	Automatic Speech Recognition for Hindi	Anish Saha et.al.	2406.18135	null
2024-06-26	A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons	Tzu-Yun Hung et.al.	2406.18089	null
2024-06-29	LLM-Driven Multimodal Opinion Expression Identification	Bonian Jia et.al.	2406.18088	null
2024-06-26	E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS	Sefik Emre Eskimez et.al.	2406.18009	link
2024-06-25	Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment	Paarth Neekhara et.al.	2406.17957	null
2024-06-22	A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge	Xiaopeng Wang et.al.	2406.17801	null
2024-06-25	High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model	Joun Yeop Lee et.al.	2406.17310	null
2024-06-25	Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation	Yingting Li et.al.	2406.17257	null
2024-06-24	Exploring the Capability of Mamba in Speech Applications	Koichi Miyazaki et.al.	2406.16808	null
2024-06-25	Towards Zero-Shot Text-To-Speech for Arabic Dialects	Khai Duy Doan et.al.	2406.16751	null
2024-06-22	TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers	Yakun Song et.al.	2406.15752	link
2024-06-21	InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions	Yu Nakagome et.al.	2406.14890	null
2024-06-21	GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech	Wenbin Wang et.al.	2406.14875	null
2024-06-21	DASB - Discrete Audio and Speech Benchmark	Pooneh Mousavi et.al.	2406.14294	null
2024-06-18	Instruction Data Generation and Unsupervised Adaptation for Speech Language Models	Vahid Noroozi et.al.	2406.12946	null
2024-06-17	DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer	Keon Lee et.al.	2406.11427	null
2024-06-16	NAST: Noise Aware Speech Tokenization for Speech Language Models	Shoval Messica et.al.	2406.11037	link
2024-06-16	Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis	Xuehao Zhou et.al.	2406.10844	null
2024-06-14	Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice	Shubham Gupta et.al.	2406.10422	null
2024-06-14	UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner	Dongchao Yang et.al.	2406.10056	link
2024-06-14	MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model	Jiatong Shi et.al.	2406.09869	null
2024-06-13	DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage	Kyra Wang et.al.	2406.08820	null
2024-06-13	Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems	Zhengyang Chen et.al.	2406.08812	null
2024-06-13	DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing	Neha Sahipjohn et.al.	2406.08802	null
2024-06-12	Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis	Wing-Zin Leung et.al.	2406.08568	link
2024-06-12	Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data	Yuma Shirahata et.al.	2406.08111	null
2024-06-12	VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech	Ashishkumar Gudmalwar et.al.	2406.08076	null
2024-06-12	LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning	Masaya Kawamura et.al.	2406.07969	link
2024-06-12	VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment	Bing Han et.al.	2406.07855	null
2024-06-12	EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech	Deok-Hyeon Cho et.al.	2406.07803	link
2024-06-11	The Interspeech 2024 Challenge on Speech Processing Using Discrete Units	Xuankai Chang et.al.	2406.07725	null
2024-06-11	Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?	Qingkai Fang et.al.	2406.07289	null
2024-06-11	AudioMarkBench: Benchmarking Robustness of Audio Watermarking	Hongbin Liu et.al.	2406.06979	link
2024-06-11	Controlling Emotion in Text-to-Speech with Natural Language Prompts	Thomas Bott et.al.	2406.06406	link
2024-06-10	Meta Learning Text-to-Speech Synthesis in over 7000 Languages	Florian Lux et.al.	2406.06403	link
2024-06-10	MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance	Semin Kim et.al.	2406.05965	null
2024-06-11	WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark	Linhan Ma et.al.	2406.05763	link
2024-06-09	An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS	Xiaofei Wang et.al.	2406.05699	null
2024-06-11	Text-aware and Context-aware Expressive Audiobook Speech Synthesis	Dake Guo et.al.	2406.05672	null
2024-06-08	Autoregressive Diffusion Transformer for Text-to-Speech Synthesis	Zhijun Liu et.al.	2406.05551	null
2024-06-08	VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers	Sanyuan Chen et.al.	2406.05370	null
2024-06-07	Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis	Ryan Langman et.al.	2406.05298	null
2024-06-07	XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model	Edresson Casanova et.al.	2406.04904	link
2024-06-07	TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking	Junzuo Zhou et.al.	2406.04840	null
2024-06-07	Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study	Chong Zhang et.al.	2406.04633	null
2024-06-06	Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis	Théodor Lemerle et.al.	2406.04467	link
2024-06-06	Total-Duration-Aware Duration Modeling for Text-to-Speech Systems	Sefik Emre Eskimez et.al.	2406.04281	null
2024-06-06	Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining	Jinlong Xue et.al.	2406.03714	null
2024-06-06	Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model	Jinlong Xue et.al.	2406.03706	null
2024-06-05	Style Mixture of Experts for Expressive Text-To-Speech Synthesis	Ahad Jawaid et.al.	2406.03637	null
2024-06-07	Harder or Different? Understanding Generalization of Audio Deepfake Detection	Nicolas M. Müller et.al.	2406.03512	null
2024-06-05	LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes	Trung Dang et.al.	2406.02897	null
2024-06-04	Seed-TTS: A Family of High-Quality Versatile Speech Generation Models	Philip Anastassiou et.al.	2406.02430	link
2024-06-05	SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models	Dongchao Yang et.al.	2406.02328	null
2024-06-04	BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation	Hui-Peng Du et.al.	2406.02162	null
2024-06-04	Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis	Kun Zhou et.al.	2406.02009	null
2024-06-03	ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec	Shengpeng Ji et.al.	2406.01205	link
2024-06-03	Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training	Jan Melechovsky et.al.	2406.01018	null
2024-06-02	Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback	Chen Chen et.al.	2406.00654	null
2024-05-31	Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities	Vicky Zayats et.al.	2405.18669	null
2024-05-28	TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation	Chenyang Le et.al.	2405.17809	link
2024-05-27	RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis	Haoxiang Shi et.al.	2405.17028	null
2024-05-24	Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition	Zijin Gu et.al.	2405.15216	null
2024-05-23	Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models	Jingyi Chen et.al.	2405.14632	null
2024-05-22	A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction	Yue Li et.al.	2405.13477	null
2024-05-20	Multi-speaker Text-to-speech Training with Speaker Anonymized Data	Wen-Chin Huang et.al.	2405.11767	null
2024-05-19	VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications	Mikhail Konenkov et.al.	2405.11537	null
2024-05-18	Exploring speech style spaces with language models: Emotional TTS without emotion labels	Shreeram Suresh Chandra et.al.	2405.11413	null
2024-05-16	Faces that Speak: Jointly Synthesising Talking Face and Speech from Text	Youngjoon Jang et.al.	2405.10272	null
2024-05-16	Building a Luganda Text-to-Speech Model From Crowdsourced Data	Sulaiman Kagumire et.al.	2405.10211	null
2024-05-16	Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model	Siyang Wang et.al.	2405.09768	null
2024-05-15	Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer	Weifei Jin et.al.	2405.09470	null
2024-05-15	Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis	Sho Inoue et.al.	2405.09171	null
2024-05-14	PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset	Yang Hou et.al.	2405.08838	link
2024-04-30	Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech	Hankun Wang et.al.	2404.19723	null
2024-04-29	MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis	Xiang Li et.al.	2404.18398	null
2024-04-28	USAT: A Universal Speaker-Adaptive Text-to-Speech Approach	Wenbin Wang et.al.	2404.18094	link
2024-04-27	TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality	Tiantian Feng et.al.	2404.17983	null
2024-04-26	An RFP dataset for Real, Fake, and Partially fake audio detection	Abdulazeez AlAli et.al.	2404.17721	null
2024-04-23	StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations	Sen Liu et.al.	2404.14946	null
2024-04-23	Retrieval-Augmented Audio Deepfake Detection	Zuheng Kang et.al.	2404.13892	null
2024-04-14	Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling	Quanxiu Wang et.al.	2404.09192	null
2024-04-11	Voice-Assisted Real-Time Traffic Sign Recognition System Using Convolutional Neural Network	Mayura Manawadu et.al.	2404.07807	null
2024-04-18	Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness	Xincan Feng et.al.	2404.06714	link
2024-04-10	CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations	Leying Zhang et.al.	2404.06690	null
2024-04-10	The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge	Yiwei Guo et.al.	2404.06079	null
2024-04-07	Cross-Domain Audio Deepfake Detection: Dataset and Analysis	Yuang Li et.al.	2404.04904	null
2024-04-06	HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks	Yingting Li et.al.	2404.04645	link
2024-04-18	Open vocabulary keyword spotting through transfer learning from speech synthesis	Kesavaraj V et.al.	2404.03914	null
2024-04-06	RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis	Detai Xin et.al.	2404.03204	null
2024-04-03	CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech	Jaehyeon Kim et.al.	2404.02781	null
2024-04-13	PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders	Yu Pan et.al.	2404.02702	null
2024-03-31	Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation	Rohan Chaudhury et.al.	2404.01339	link
2024-03-28	A Review of Multi-Modal Large Language and Vision Models	Kilian Carolan et.al.	2404.01322	null
2024-04-09	KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis	Adal Abilbekov et.al.	2404.01033	link
2024-03-31	CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models	Xiang Li et.al.	2404.00569	link
2024-03-25	VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild	Puyuan Peng et.al.	2403.16973	link
2024-03-20	Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning	Shivam Ratnakant Mhaskar et.al.	2403.15469	null
2024-03-20	UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge	Wataru Nakata et.al.	2403.13720	null
2024-03-20	Building speech corpus with diverse voice characteristics for its prompt-based representation	Aya Watanabe et.al.	2403.13353	null
2024-03-17	Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations	Claudio Pinhanez et.al.	2403.11209	null
2024-03-17	EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech	Ziqi Liang et.al.	2403.08164	null
2024-03-09	HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling	Chunhui Wang et.al.	2403.05989	null
2024-03-05	AttentionStitch: How Attention Solves the Speech Editing Problem	Antonios Alexos et.al.	2403.04804	null
2024-03-07	Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation	Sai Akarsh et.al.	2403.04178	null
2024-03-27	NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models	Zeqian Ju et.al.	2403.03100	null
2024-03-04	Brilla AI: AI Contestant for the National Science and Maths Quiz	George Boateng et.al.	2403.01699	link
2024-03-02	Towards Accurate Lip-to-Speech Synthesis in-the-Wild	Sindhu Hegde et.al.	2403.01087	null
2024-02-29	Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data	Takaaki Saeki et.al.	2402.18932	null
2024-02-26	An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation	Ahmet Gunduz et.al.	2402.16380	link
2024-02-22	Efficient data selection employing Semantic Similarity-based Graph Structures for model training	Roxana Petcu et.al.	2402.14888	null
2024-02-22	Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition	Rendi Chevi et.al.	2402.14523	null
2024-02-19	On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models	Miri Varshavsky-Hassid et.al.	2402.12423	null
2024-02-19	Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting	Haolin Chen et.al.	2402.12220	link
2024-02-18	Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru	Zining Wang et.al.	2402.11571	null
2024-02-14	MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech	Shengpeng Ji et.al.	2402.09378	null
2024-02-15	BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data	Mateusz Łajszczak et.al.	2402.08093	null
2024-03-04	Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like	Naoyuki Kanda et.al.	2402.07383	null
2024-02-09	A New Approach to Voice Authenticity	Nicolas M. Müller et.al.	2402.06304	null
2024-02-08	Unified Speech-Text Pretraining for Spoken Dialog Modeling	Heeseung Kim et.al.	2402.05706	null
2024-02-05	Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations	Álvaro Martín-Cortinas et.al.	2402.03407	null
2024-02-02	Natural language guidance of high-fidelity text-to-speech with synthetic annotations	Dan Lyth et.al.	2402.01912	null
2024-01-23	Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization	Wei-Ping Huang et.al.	2402.01692	null
2024-02-01	Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech	Dong Yang et.al.	2402.00288	null
2024-02-01	PAM: Prompting Audio-Language Models for Audio Quality Assessment	Soham Deshmukh et.al.	2402.00282	link
2024-01-31	Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2	Jiatong Shi et.al.	2401.17619	link
2024-01-28	MunTTS: A Text-to-Speech System for Mundari	Varun Gumma et.al.	2401.15579	null
2024-01-30	VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech	Chenpeng Du et.al.	2401.14321	null
2024-01-25	Text to speech synthesis	Harini s et.al.	2401.13891	null
2024-01-25	SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation	Dong Zhang et.al.	2401.13527	link
2024-01-22	Benchmarking Large Multimodal Models against Common Corruptions	Jiawei Zhang et.al.	2401.11943	link
2024-01-22	Adversarial speech for voice privacy protection from Personalized Speech generation	Shihao Chen et.al.	2401.11857	null
2024-02-16	Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis	Vinotha R et.al.	2401.11771	null
2024-01-19	Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech	Abhinav Garg et.al.	2401.10465	null
2024-02-28	MLAAD: The Multi-Language Audio Anti-Spoofing Dataset	Nicolas M. Müller et.al.	2401.09512	null
2024-01-15	MCMChaos: Improvising Rap Music with MCMC Methods and Chaos Theory	Robert G. Kimelman et.al.	2401.07967	null
2024-01-14	ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering	Yakun Song et.al.	2401.07333	null
2024-01-12	Multi-Task Learning for Front-End Text Processing in TTS	Wonjune Kang et.al.	2401.06321	link
2024-01-11	End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2	Aniket Tathe et.al.	2401.06183	null
2024-01-11	Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection	Lian Huang et.al.	2401.05614	null
2024-01-10	Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters	Kenichi Fujita et.al.	2401.05111	null
2024-01-07	Evaluating and Personalizing User-Perceived Quality of Text-to-Speech Voices for Delivering Mindfulness Meditation with Different Physical Embodiments	Zhonghao Shi et.al.	2401.03581	null
2024-01-07	Transfer the linguistic representations from TTS to accent conversion with non-parallel data	Xi Chen et.al.	2401.03538	null
2024-01-03	Incremental FastPitch: Chunk-based High Quality Text to Speech	Muyang Du et.al.	2401.01755	null
2024-01-03	Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction	Minchan Kim et.al.	2401.01498	null
2023-12-18	Assisting Blind People Using Object Detection with Vocal Feedback	Heba Najm et.al.	2401.01362	null
2023-12-30	Boosting Large Language Model for Speech Synthesis: An Empirical Study	Hongkun Hao et.al.	2401.00246	null
2024-01-01	Normalization of Lithuanian Text Using Regular Expressions	Pijus Kasparaitis et.al.	2312.17660	null
2023-12-27	AE-Flow: AutoEncoder Normalizing Flow	Jakub Mosiński et.al.	2312.16552	null
2023-12-22	Creating New Voices using Normalizing Flows	Piotr Bilinski et.al.	2312.14569	null
2023-12-22	ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations	Cheng Gong et.al.	2312.14398	null
2023-12-19	External Knowledge Augmented Polyphone Disambiguation Using Large Language Model	Chen Li et.al.	2312.11920	null
2023-12-17	A review-based study on different Text-to-Speech technologies	Md. Jalal Uddin Chowdhury et.al.	2312.11563	null
2024-01-31	MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis	Wenhao Guan et.al.	2312.10687	null
2024-02-22	Amphion: An Open-Source Audio, Music and Speech Generation Toolkit	Xueyao Zhang et.al.	2312.09911	link
2023-12-11	Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism	Georgios Milis et.al.	2312.06613	link
2023-12-08	An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis	Via Nielson et.al.	2312.05415	null
2023-12-06	Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis	Zehua Chen et.al.	2312.03491	null
2023-12-02	Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning	Raviraj Joshi et.al.	2312.01107	null
2023-12-02	Code-Mixed Text to Speech Synthesis under Low-Resource Constraints	Raviraj Joshi et.al.	2312.01103	null
2023-11-29	Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes	Pavel Korshunov et.al.	2311.17655	null
2024-02-06	Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech	Enting Zhou et.al.	2311.14816	link
2023-12-07	Guided Flows for Generative Modeling and Decision Making	Qinqing Zheng et.al.	2311.13443	null
2023-11-27	HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis	Sang-Hoon Lee et.al.	2311.12454	link
2023-11-18	Utilizing Speech Emotion Recognition and Recommender Systems for Negative Emotion Handling in Therapy Chatbots	Farideh Majidi et.al.	2311.11116	null
2023-11-18	Data Center Audio/Video Intelligence on Device (DAVID) -- An Edge-AI Platform for Smart-Toys	Gabriel Cosache et.al.	2311.11030	null
2023-11-17	A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness	Mathias Vogel et.al.	2311.10804	null
2023-11-16	Improving fairness for spoken language understanding in atypical speech with Text-to-Speech	Helin Wang et.al.	2311.10149	link
2024-02-02	DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation	Jianzong Wang et.al.	2311.07965	null
2023-11-12	ChatAnything: Facetime Chat with LLM-Enhanced Personas	Yilin Zhao et.al.	2311.06772	null
2023-11-11	NewsGPT: ChatGPT Integration for Robot-Reporter	Abdelhadi Hireche et.al.	2311.06640	link
2023-11-08	Synthetic Speaking Children -- Why We Need Them and How to Make Them	Muhammad Ali Farooq et.al.	2311.06307	null
2023-09-25	Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image	Minki Kang et.al.	2311.05844	null
2023-11-07	Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning	Rishabh Jain et.al.	2311.04313	link
2023-11-07	Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment	Jakir Hasan et.al.	2311.03792	null
2023-11-08	Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction	Minchan Kim et.al.	2311.02898	null
2023-11-02	Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations	Hanglei Zhang et.al.	2311.01260	null
2023-11-02	E3 TTS: Easy End-to-End Diffusion-based Text to Speech	Yuan Gao et.al.	2311.00945	null
2023-10-31	An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation	Yingjie Zhou et.al.	2310.20251	link
2023-10-27	Style Description based Text-to-Speech with Conditional Prosodic Layer Normalization based Diffusion GAN	Neeraj Kumar et.al.	2310.18169	null
2023-10-25	ArTST: Arabic Text and Speech Transformer	Hawau Olamide Toyin et.al.	2310.16621	link
2023-10-25	Generative Pre-training for Speech with Flow Matching	Alexander H. Liu et.al.	2310.16338	null
2023-10-23	DPP-TTS: Diversifying prosodic features of speech via determinantal point processes	Seongho Joo et.al.	2310.14663	null
2023-10-22	An overview of text-to-speech systems and media applications	Mohammad Reza Hasanabadi et.al.	2310.14301	null
2023-10-14	Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling	Tiberiu Boros et.al.	2310.09636	link
2023-10-14	Attentive Multi-Layer Perceptron for Non-autoregressive Generation	Shuyang Jiang et.al.	2310.09512	link
2023-12-22	Crowdsourced and Automatic Speech Prominence Estimation	Max Morrison et.al.	2310.08464	link
2023-10-12	On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition	Nick Rossenbach et.al.	2310.08132	null
2023-10-12	Vec-Tok Speech: speech vectorization and tokenization for neural speech generation	Xinfa Zhu et.al.	2310.07246	link
2023-10-10	Prosody Analysis of Audiobooks	Charuta Pethe et.al.	2310.06930	null
2023-10-09	JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions	Detai Xin et.al.	2310.06072	null
2024-01-09	Unified speech and gesture synthesis using flow matching	Shivam Mehta et.al.	2310.05181	null
2023-10-08	Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset	Ze Liu et.al.	2310.04982	null
2023-10-11	LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT	Jiaming Wang et.al.	2310.04673	null
2024-01-22	Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis	Jae-Sung Bae et.al.	2310.03538	null
2023-10-07	The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains	Erica Cooper et.al.	2310.02640	null
2023-10-02	Towards human-like spoken dialogue generation between AI agents from written dialogue	Kentaro Mitsui et.al.	2310.01088	null
2023-10-01	Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech	Dareen Alharthi et.al.	2310.00706	null
2024-03-11	Fewer-token Neural Speech Codec with Time-invariant Codes	Yong Ren et.al.	2310.00014	link
2024-01-31	ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech	Wenhao Guan et.al.	2309.17056	null
2023-09-29	Low-Resource Self-Supervised Learning with SSL-Enhanced TTS	Po-chun Hsu et.al.	2309.17020	null
2023-09-29	Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features	Yuxiang Zhang et.al.	2309.16954	null
2023-12-18	High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models	Chunyu Qiang et.al.	2309.15512	null
2024-01-09	BiSinger: Bilingual Singing Voice Synthesis	Huali Zhou et.al.	2309.14089	link
2023-10-07	HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS	Dake Guo et.al.	2309.13907	null
2023-09-24	VoiceLDM: Text-to-Speech with Environmental Context	Yeonghyeon Lee et.al.	2309.13664	null
2023-09-24	Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control	Aya Watanabe et.al.	2309.13509	null
2023-09-22	DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis	Yu Gu et.al.	2309.12792	null
2023-09-22	Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts	Shun Lei et.al.	2309.11977	null
2023-09-21	The Impact of Silence on Speech Anti-Spoofing	Yuxiang Zhang et.al.	2309.11827	null
2023-09-21	Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech	Rui Liu et.al.	2309.11724	link
2023-09-20	Speak While You Think: Streaming Speech Synthesis During Text Generation	Avihu Dekel et.al.	2309.11210	null
2023-09-20	Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model	Xinyu Zhou et.al.	2309.11000	link
2023-09-19	Exploring Speech Enhancement for Low-resource Speech Synthesis	Zhaoheng Ni et.al.	2309.10795	null
2023-09-19	Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition	Ziyang Ma et.al.	2309.10294	null
2023-09-17	Augmenting text for spoken language understanding with Large Language Models	Roshan Sharma et.al.	2309.09390	null
2023-09-16	FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework	Jianzong Wang et.al.	2309.08837	null
2023-09-15	Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech	Dariusz Piotrowski et.al.	2309.08255	null
2023-09-15	HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods	Hyun-seo Shin et.al.	2309.08208	link
2023-12-27	PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions	Reo Shimizu et.al.	2309.08140	null
2023-09-15	Diversity-based core-set selection for text-to-speech with linguistic and acoustic features	Kentaro Seki et.al.	2309.08127	null
2023-09-14	Direct Text to Speech Translation System using Acoustic Units	Victoria Mingote et.al.	2309.07478	null
2023-10-07	FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec	Zhihao Du et.al.	2309.07405	link
2023-09-13	DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation	Zhichao Wu et.al.	2309.06787	null
2023-09-11	Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP	Jinzuomu Zhong et.al.	2309.05423	link
2024-01-16	VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching	Yiwei Guo et.al.	2309.05027	link
2023-09-08	Cross-Utterance Conditioned VAE for Speech Generation	Yang Li et.al.	2309.04156	null
2023-09-07	Large-Scale Automatic Audiobook Creation	Brendan Walsh et.al.	2309.03926	null
2023-09-11	GRASS: Unified Generation Model for Speech-to-Semantic Tasks	Aobo Xia et.al.	2309.02780	null
2023-09-12	MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023	Zhihang Xu et.al.	2309.02743	null
2023-10-12	PromptTTS 2: Describing and Generating Voices with Text Prompt	Yichong Leng et.al.	2309.02285	null
2023-09-04	A Comparative Analysis of Pretrained Language Models for Text-to-Speech	Marcel Granero-Moya et.al.	2309.01576	null
2023-09-02	DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin	Tao Li et.al.	2309.00883	null
2023-12-18	Learning Speech Representation From Contrastive Token-Acoustic Pretraining	Chunyu Qiang et.al.	2309.00424	null
2023-09-01	The FruitShell French synthesis system at the Blizzard 2023 Challenge	Xin Qi et.al.	2309.00223	null
2023-08-31	QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning	Haohan Guo et.al.	2309.00126	null
2024-01-23	SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models	Xin Zhang et.al.	2308.16692	link
2023-08-31	Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis	Weiqin Li et.al.	2308.16593	null
2023-08-31	Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information	Jie Chen et.al.	2308.16577	null
2023-08-31	LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech	Jie Chen et.al.	2308.16569	null
2023-08-30	CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis	Yi Meng et.al.	2308.16021	null
2023-09-01	The DeepZen Speech Synthesis System for Blizzard Challenge 2023	Christophe Veaux et.al.	2308.15945	null
2023-08-28	Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech	Hyungchan Yoon et.al.	2308.14909	null
2023-09-04	Rep2wav: Noise Robust text-to-speech Using self-supervised representations	Qiushi Zhu et.al.	2308.14553	null
2023-08-28	TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models	Shengpeng Ji et.al.	2308.14430	link
2023-09-02	Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder	Xuyuan Li et.al.	2308.13365	null
2023-08-24	Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations	Wenbin Wang et.al.	2308.13007	null
2023-09-22	Sparks of Large Audio Models: A Survey and Outlook	Siddique Latif et.al.	2308.12792	null
2023-10-25	SeamlessM4T: Massively Multilingual & Multimodal Machine Translation	Seamless Communication et.al.	2308.11596	link
2023-08-31	Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models	Heyang Xue et.al.	2308.10428	null
2023-08-16	AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis	Hrishikesh Viswanath et.al.	2308.08577	null
2023-08-14	SpeechX: Neural Codec Language Model as a Versatile Speech Transformer	Xiaofei Wang et.al.	2308.06873	null
2023-08-12	Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation	Zhichao Wang et.al.	2308.06457	link
2023-09-09	AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining	Haohe Liu et.al.	2308.05734	link
2023-08-09	Data Player: Automatic Generation of Data Videos with Narration-Animation Interplay	Leixian Shen et.al.	2308.04703	null
2023-08-08	Towards an AI to Win Ghana's National Science and Maths Quiz	George Boateng et.al.	2308.04333	link
2023-08-08	WonderFlow: Narration-Centric Design of Animated Data Videos	Yun Wang et.al.	2308.04040	null
2023-08-04	Let's Give a Voice to Conversational Agents in Virtual Reality	Michele Yin et.al.	2308.02665	link
2023-08-03	Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation	Minsu Kim et.al.	2308.01831	link
2023-08-02	SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis	Ramanan Sivaguru et.al.	2308.01018	null
2023-07-07	Artificial Eye for the Blind	Abhinav Benagi et.al.	2308.00801	null
2023-07-31	Multilingual context-based pronunciation learning for Text-to-Speech	Giulia Comini et.al.	2307.16709	null
2023-07-31	Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech	Guangyan Zhang et.al.	2307.16679	null
2023-07-31	Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings	Manuel Sam Ribeiro et.al.	2307.16643	null
2023-07-31	DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training	Hyung-Seok Oh et.al.	2307.16549	link
2023-07-31	VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design	Jungil Kong et.al.	2307.16430	null
2023-07-30	Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation	Yuanhao Chen et.al.	2307.16199	link
2023-07-29	METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer	Xinfa Zhu et.al.	2307.15951	null
2023-12-18	Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding	Chunyu Qiang et.al.	2307.15484	null
2023-07-20	SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer	Daegyeom Kim et.al.	2307.10550	link
2023-07-18	SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs	Yinghao Aaron Li et.al.	2307.09435	null
2023-09-28	Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts	Ziyue Jiang et.al.	2307.07218	null
2023-07-13	Controllable Emphasis with zero data for text-to-speech	Arnaud Joly et.al.	2307.07062	null
2023-07-11	On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis	Siyang Wang et.al.	2307.05132	null
2023-07-10	The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task	Kun Song et.al.	2307.04630	null
2023-10-07	ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading	Yujia Xiao et.al.	2307.00782	null
2023-06-28	EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech	Daria Diatlova et.al.	2307.00024	link
2023-06-29	High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units	Junchen Lu et.al.	2306.17005	null
2023-06-28	UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data	Heeseung Kim et.al.	2306.16083	link
2023-10-19	Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale	Matthew Le et.al.	2306.15687	null
2023-06-27	GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech	Yahuan Cong et.al.	2306.15304	null
2023-06-25	DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech	Sen Liu et.al.	2306.14145	null
2023-06-21	Visual-Aware Text-to-Speech	Mohan Zhou et.al.	2306.12020	null
2023-06-21	Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer	Jakub Swiatkowski et.al.	2306.11662	null
2023-06-16	Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation	Kishor Kayyar Lakshminarayana et.al.	2306.10152	null
2023-06-16	CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages	Frederico S. Oliveira et.al.	2306.10097	null
2023-06-14	Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation	Zheng Liang et.al.	2306.08588	null
2023-06-14	Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects	Xinghua Qu et.al.	2306.08219	link
2023-11-20	StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models	Yinghao Aaron Li et.al.	2306.07691	null
2024-01-18	UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding	Chenpeng Du et.al.	2306.07547	null
2023-06-13	PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling	Ji-Sang Hwang et.al.	2306.07489	null
2023-06-09	Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech	Shijun Wang et.al.	2306.05709	null
2023-06-08	VIFS: An End-to-End Variational Inference for Foley Sound Synthesis	Junhyeok Lee et.al.	2306.05004	link
2023-07-11	Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge	Wenhao Guan et.al.	2306.04301	null
2023-06-06	Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias	Ziyue Jiang et.al.	2306.03509	null
2023-08-02	Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis	Zhenhui Ye et.al.	2306.03504	null
2023-06-05	Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis	Dengfeng Ke et.al.	2306.02593	null
2023-06-05	Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model	Hoyeon Lee et.al.	2306.02579	null
2023-06-05	Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming	Xinlei Niu et.al.	2306.02568	link
2023-06-02	Towards Robust FastSpeech 2 by Modelling Residual Multimodality	Fabian Kögel et.al.	2306.01442	link
2023-05-30	Towards Selection of Text-to-speech Data to Augment ASR Training	Shuo Liu et.al.	2306.00998	null
2023-06-01	EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis	Haobin Tang et.al.	2306.00648	null
2023-06-01	The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech	Phat Do et.al.	2306.00535	null
2023-05-31	Text-to-Speech Pipeline for Swiss German -- A comparison	Tobias Bollinger et.al.	2305.19750	null
2023-05-31	XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech	Linh The Nguyen et.al.	2305.19709	link
2023-06-01	PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions	Guanghou Liu et.al.	2305.19522	null
2023-05-30	Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages	Phat Do et.al.	2305.19396	null
2023-05-30	Make-A-Voice: Unified Voice Synthesis With Discrete Representation	Rongjie Huang et.al.	2305.19269	null
2023-05-30	STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions	Michel Plüss et.al.	2305.18855	null
2023-05-30	LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus	Yuma Koizumi et.al.	2305.18802	null
2023-10-09	An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization	Fei Kong et.al.	2305.18355	link
2023-05-29	ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation	Ambuj Mehrish et.al.	2305.18028	link
2023-05-29	Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis	Erik Ekstedt et.al.	2305.17971	null
2023-07-25	StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation	Kun Song et.al.	2305.17732	null
2023-05-28	Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS	Sewade Ogun et.al.	2305.17724	link
2023-07-19	Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing	Julia Kaiwen Lau et.al.	2305.17445	link
2023-05-26	DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction	Vineet Bhat et.al.	2305.16957	null
2023-05-25	Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion	Rui Liu et.al.	2305.16353	link
2023-05-22	Text Generation with Speech Synthesis for ASR Data Augmentation	Zhuangqun Huang et.al.	2305.16333	null
2023-05-25	VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation	Tianrui Wang et.al.	2305.16107	null
2023-05-25	Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration	Rustem Yeshpanov et.al.	2305.15749	link
2024-02-05	LAraBench: Benchmarking Arabic AI with Large Language Models	Ahmed Abdelali et.al.	2305.14982	null
2023-05-23	EfficientSpeech: An On-Device Text to Speech Model	Rowel Atienza et.al.	2305.13905	link
2023-05-23	ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models	Minki Kang et.al.	2305.13831	null
2023-05-22	U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech	Xin Jing et.al.	2305.13195	null
2023-05-25	EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels	Kari Ali Noriy et.al.	2305.13137	link
2023-05-22	ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer	Huadai Liu et.al.	2305.12708	null
2023-05-21	VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages	Shivam Mhaskar et.al.	2305.12518	null
2023-05-26	Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus	Detai Xin et.al.	2305.12442	link
2023-05-20	ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios	Yuyue Wang et.al.	2305.12200	null
2023-05-19	MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting	Neil Shah et.al.	2305.11926	null
2024-02-20	Data Redaction from Conditional Generative Models	Zhifeng Kong et.al.	2305.11351	null
2023-05-18	Parameter-Efficient Learning for Text-to-Speech Accent Adaptation	Li-Jen Yang et.al.	2305.11320	link
2023-05-19	Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation	Martijn Bartelds et.al.	2305.10951	link
2023-09-30	Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data	Yusheng Tian et.al.	2305.10891	link
2023-05-18	FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs	Won Jang et.al.	2305.10823	null
2023-05-18	CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training	Zhenhui Ye et.al.	2305.10763	null
2023-08-29	a unified front-end framework for english text-to-speech synthesis	Zelin Ying et.al.	2305.10666	null
2023-09-19	Controllable Speaking Styles Using a Large Language Model	Atli Thor Sigurgeirsson et.al.	2305.10321	null
2023-05-23	Better speech synthesis through scaling	James Betker et.al.	2305.07243	link
2023-10-29	CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model	Zhen Ye et.al.	2305.06908	link
2023-05-08	Accented Text-to-Speech Synthesis with Limited Data	Xuehao Zhou et.al.	2305.04816	null
2023-05-03	M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis	Jinlong Xue et.al.	2305.02269	null
2023-05-30	A Review of Deep Learning Techniques for Speech Processing	Ambuj Mehrish et.al.	2305.00359	null
2023-04-26	Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis	Ye-Xin Lu et.al.	2304.13270	null
2023-04-25	Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge	Chenpeng Du et.al.	2304.13121	null
2023-04-24	Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model	Kenichi Fujita et.al.	2304.11976	null
2023-04-23	DiffVoice: Text-to-Speech with Latent Diffusion	Zhijun Liu et.al.	2304.11750	null
2023-04-23	SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model	Jianzong Wang et.al.	2304.11547	null
2023-05-30	NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers	Kai Shen et.al.	2304.09116	null
2023-04-16	A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers	Juan Zuluaga-Gomez et.al.	2304.07842	null
2023-04-13	Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis	Shun Lei et.al.	2304.06359	null
2023-04-10	Enhancing Speech-to-Speech Translation with Multiple TTS Targets	Jiatong Shi et.al.	2304.04618	null
2023-04-07	ArmanTTS single-speaker Persian dataset	Mohammd Hasan Shamgholi et.al.	2304.03585	null
2023-04-03	Ensemble prosody prediction for expressive speech synthesis	Tian Huey Teh et.al.	2304.00714	null
2023-03-29	AraSpot: Arabic Spoken Command Spotting	Mahmoud Salhab et.al.	2303.16621	link
2023-03-28	Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages	Seongyeon Park et.al.	2303.15669	link
2023-03-27	Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis	Karren Yang et.al.	2303.14885	null
2023-03-24	Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis	Takuhiro Kaneko et.al.	2303.13909	null
2023-04-02	A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI	Chenshuang Zhang et.al.	2303.13336	null
2023-03-20	Code-Switching Text Generation and Injection in Mandarin-English ASR	Haibin Yu et.al.	2303.10949	null
2023-03-14	Controlling High-Dimensional Data With Sparse Input	Dan Andrei Iliescu et.al.	2303.09446	null
2023-03-09	Text-to-ECG: 12-Lead Electrocardiogram Synthesis conditioned on Clinical Text Reports	Hyunseung Chung et.al.	2303.09395	link
2023-03-15	Cross-speaker Emotion Transfer by Manipulating Speech Style Latents	Suhee Jo et.al.	2303.08329	null
2023-03-14	QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis	Haobin Tang et.al.	2303.07682	null
2023-03-10	An End-to-End Neural Network for Image-to-Audio Transformation	Liu Chen et.al.	2303.06078	null
2023-03-09	Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation	Qi Chen et.al.	2303.05322	link
2023-03-07	Do Prosody Transfer Models Transfer Prosody?	Atli Thor Sigurgeirsson et.al.	2303.04289	null
2023-03-07	Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling	Ziqiang Zhang et.al.	2303.03926	null
2023-03-02	Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding	Yingting Li et.al.	2303.03267	link
2023-03-08	FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model	Ruiqing Xue et.al.	2303.02939	null
2023-08-14	Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations	Yuma Koizumi et.al.	2303.01664	null
2023-03-11	Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities	Shijun Wang et.al.	2303.01508	null
2023-12-17	ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations	Neil Shah et.al.	2303.01261	null
2023-03-02	LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion	Chunfeng Wang et.al.	2303.01086	null
2023-03-02	Leveraging Large Text Corpora for End-to-End Speech Summarization	Kohei Matsuura et.al.	2303.00978	null
2023-03-01	DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction	Raviteja Anantha et.al.	2303.00171	null
2023-02-28	ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus	Ajinkya Kulkarni et.al.	2303.00069	null
2023-02-28	Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners	Jocelyn Huang et.al.	2302.14523	null
2023-06-12	CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis	Ji-Hoon Kim et.al.	2302.14370	null
2023-05-19	UniFLG: Unified Facial Landmark Generator from Text or Speech	Kentaro Mitsui et.al.	2302.14337	null
2023-02-27	Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech	Jiyoung Lee et.al.	2302.13700	link
2023-02-27	Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech	Dong Yang et.al.	2302.13652	null
2023-02-27	Varianceflow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow	Yoonhyung Lee et.al.	2302.13458	null
2023-06-06	PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS	Junhyeok Lee et.al.	2302.12391	link
2023-02-21	Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition	Leyuan Qu et.al.	2302.09723	null
2023-02-23	QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion	Houjian Guo et.al.	2302.08296	link
2023-02-13	Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages	Sudhanshu Srivastava et.al.	2302.06227	null
2023-02-08	A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech	Li-Wei Chen et.al.	2302.04215	link
2023-02-07	Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision	Eugene Kharitonov et.al.	2302.03540	null
2023-02-15	MAC: A unified framework boosting low resource automatic speech recognition	Zeping Min et.al.	2302.03498	null
2023-06-25	InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt	Dongchao Yang et.al.	2301.13662	link
2023-03-01	UzbekTagger: The rule-based POS tagger for Uzbek language	Maksud Sharipov et.al.	2301.12711	null
2023-05-27	Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining	Takaaki Saeki et.al.	2301.12596	link
2023-01-31	Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker	Navjot Kaur et.al.	2301.12331	link
2023-01-26	On granularity of prosodic representations in expressive text-to-speech	Mikolaj Babianski et.al.	2301.11446	null
2023-01-26	Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study	Massa Baali et.al.	2301.09099	link
2023-01-20	Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions	Yinghao Aaron Li et.al.	2301.08810	null
2023-01-11	Modelling low-resource accents without accent-specific TTS frontend	Georgi Tinchev et.al.	2301.04606	null
2022-12-11	BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm	Yu-Wen Chen et.al.	2301.04120	link
2023-01-10	UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion	Haogeng Liu et.al.	2301.03801	null
2023-01-10	Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation	Abdullah Shahid et.al.	2301.03751	null
2023-09-19	Applying Automated Machine Translation to Educational Video Courses	Linden Wang et.al.	2301.03141	null
2023-01-06	Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition	David M. Chan et.al.	2301.02736	null
2023-01-05	Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers	Chengyi Wang et.al.	2301.02111	link
2022-12-11	MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset	Kailin Liang et.al.	2301.00657	link
2022-12-30	ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech	Zehua Chen et.al.	2212.14518	null
2022-12-29	StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models	Yinghao Aaron Li et.al.	2212.14227	link
2022-12-22	HMM-based data augmentation for E2E systems for building conversational speech synthesis systems	Ishika Gupta et.al.	2212.11982	null
2022-12-21	ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement	Wei-Ning Hsu et.al.	2212.11377	null
2022-12-20	TTS-Guided Training for Accent Conversion Without Parallel Data	Yi Zhou et.al.	2212.10204	null
2023-06-28	Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling	Tuomo Raitio et.al.	2212.10075	null
2022-12-16	Speech Aware Dialog System Technology Challenge (DSTC11)	Hagen Soltau et.al.	2212.08704	null
2022-12-16	Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder	Yusuke Yasuda et.al.	2212.08329	null
2022-12-16	Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language	Yusuke Yasuda et.al.	2212.08321	null
2022-12-15	RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis	Shinhyeok Oh et.al.	2212.07939	link
2022-12-14	Probing Deep Speaker Embeddings for Speaker-related Tasks	Zifeng Zhao et.al.	2212.07068	null
2022-12-08	SpeechLMScore: Evaluating speech generation using speech language model	Soumi Maiti et.al.	2212.04559	link
2023-04-04	Learning to Dub Movies via Hierarchical Prosody Models	Gaoxiang Cong et.al.	2212.04054	link
2022-12-07	Low-Resource End-to-end Sanskrit TTS using Tacotron2, WaveGlow and Transfer Learning	Ankur Debnath et.al.	2212.03558	null
2022-12-07	Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue	Daxin Tan et.al.	2212.03398	null
2022-12-06	UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis	Yi Lei et.al.	2212.01546	null
2022-11-30	SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech	Byoung Jin Choi et.al.	2211.16866	null
2022-11-29	Controllable speech synthesis by learning discrete phoneme-level prosodic representations	Nikolaos Ellinas et.al.	2211.16307	null
2023-05-25	Evaluating and reducing the distance between synthetic and real speech distributions	Christoph Minixhofer et.al.	2211.16049	null
2022-11-26	Contextual Expressive Text-to-Speech	Jianhong Tu et.al.	2211.14548	null
2022-12-05	Efficient Incremental Text-to-Speech on GPUs	Muyang Du et.al.	2211.13939	null
2023-03-21	Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?	Xuan Shi et.al.	2211.13868	link
2022-11-23	IMaSC -- ICFOSS Malayalam Speech Corpus	Deepa P Gopinath et.al.	2211.12796	null
2022-11-22	PromptTTS: Controllable Text-to-Speech with Text Descriptions	Zhifang Guo et.al.	2211.12171	null
2022-11-04	Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech	Xin Zhang et.al.	2211.09731	null
2023-02-17	Towards Building Text-To-Speech Systems for the Next Billion Users	Gokul Karthik Kumar et.al.	2211.09536	link
2023-02-16	EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance	Yiwei Guo et.al.	2211.09496	null
2022-11-17	Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone Disambiguation	Chunyu Qiang et.al.	2211.09495	null
2022-11-17	NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis	Hyeong-Seok Choi et.al.	2211.09407	null
2023-03-14	Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models	Minki Kang et.al.	2211.09383	null
2023-01-04	Low-Resource Mongolian Speech Synthesis Based on Automatic Prosody Annotation	Xin Yuan et.al.	2211.09365	null
2022-11-14	SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech	Perry Lam et.al.	2211.07283	null
2023-05-24	Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing	Jacob J Webber et.al.	2211.06989	null
2023-05-29	OverFlow: Putting flows on top of neural transducers for better TTS	Shivam Mehta et.al.	2211.06892	link
2023-05-29	Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations	Yoori Oh et.al.	2211.06160	null
2022-12-04	ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech	Xiaoran Fan et.al.	2211.03545	link
2022-11-07	Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder	Jan Melechovsky et.al.	2211.03316	link
2022-11-06	Parallel Attention Forcing for Machine Translation	Qingyun Dou et.al.	2211.03237	null
2022-11-06	An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space	Jihwan Lee et.al.	2211.03078	null
2022-11-04	NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS	Dongchao Yang et.al.	2211.02448	null
2022-11-04	Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts	Detai Xin et.al.	2211.02336	null
2023-04-16	Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS	Ziqi Liang et.al.	2211.01948	null
2022-11-01	Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages	Anusha Prakash et.al.	2211.01338	null
2023-05-28	DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP	Kun Song et.al.	2211.01087	null
2022-11-22	Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement	Wei Song et.al.	2211.00967	null
2022-11-01	Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers	Cheng-Ping Hsieh et.al.	2211.00585	link
2023-06-11	Generating Multilingual Gender-Ambiguous Text-to-Speech Voices	Konstantinos Markopoulos et.al.	2211.00375	null
2023-05-07	Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features	Alexandra Vioni et.al.	2211.00342	null
2022-11-02	Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS	Kun Song et.al.	2210.17349	null
2024-02-27	Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation	Nikolaos Ellinas et.al.	2210.17264	null
2022-10-31	Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection	Luigi Attorresi et.al.	2210.17222	null
2022-10-31	Structured State Space Decoder for Speech Recognition and Synthesis	Koichi Miyazaki et.al.	2210.17098	null
2022-10-28	Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders	Jason Fong et.al.	2210.16045	null
2023-02-21	Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform	Masaya Kawamura et.al.	2210.15975	link
2023-02-22	Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis	Yuma Shirahata et.al.	2210.15964	null
2022-10-28	Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation	Nobuyuki Morioka et.al.	2210.15868	null
2023-03-15	Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech	Takaaki Saeki et.al.	2210.15447	null
2022-10-27	Explicit Intensity Control for Accented Text-to-speech	Rui Liu et.al.	2210.15364	null
2022-10-27	FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis	Yifan Hu et.al.	2210.15360	link
2022-10-26	Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection	Kentaro Seki et.al.	2210.14850	null
2022-10-25	Semi-Supervised Learning Based on Reference Model for Low-resource TTS	Xulong Zhang et.al.	2210.14723	null
2022-10-26	Cover Reproducible Steganography via Deep Generative Models	Kejiang Chen et.al.	2210.14632	null
2022-10-26	Improving Speech-to-Speech Translation Through Unlabeled Text	Xuan-Phi Nguyen et.al.	2210.14514	null
2022-10-26	The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge	Yuhao Liang et.al.	2210.14448	null
2022-10-25	Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data	Xulong Zhang et.al.	2210.13803	null
2023-09-17	HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation	Chunhui Wang et.al.	2210.12740	null
2022-10-21	Low-Resource Multilingual and Zero-Shot Multispeaker TTS	Florian Lux et.al.	2210.12223	link
2022-10-21	Adaptive re-calibration of channel-wise features for Adversarial Audio Classification	Vardhan Dongre et.al.	2210.11722	null
2022-10-20	Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS	Chunyu Qiang et.al.	2210.11429	null
2022-10-17	Towards Relation Extraction From Speech	Tongtong Wu et.al.	2210.08759	link
2023-02-08	Generating Synthetic Speech from SpokenVocab for Speech Translation	Jinming Zhao et.al.	2210.08174	link
2022-10-17	LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge	Yan Jia et.al.	2210.07749	null
2022-10-20	Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy	Sarina Meyer et.al.	2210.07002	link
2022-10-13	Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar	Aolan Sun et.al.	2210.06877	null
2022-10-12	Can we use Common Voice to train a Multi-Speaker TTS system?	Sewade Ogun et.al.	2210.06370	null
2023-06-01	SQuId: Measuring Speech Naturalness in Many Languages	Thibault Sellam et.al.	2210.06324	null
2022-11-22	Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech	Byoung Jin Choi et.al.	2210.05979	null
2022-10-06	An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era	Andreas Triantafyllopoulos et.al.	2210.03538	null
2022-09-29	Facial Landmark Predictions with Applications to Metaverse	Qiao Han et.al.	2209.14698	link
2022-09-26	Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech	Yusuke Nakai et.al.	2209.12549	null
2022-09-22	EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models	Perry Lam et.al.	2209.10890	null
2022-09-22	MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline	Yifan Hu et.al.	2209.10848	link
2022-09-22	Controllable Accented Text-to-Speech Synthesis	Rui Liu et.al.	2209.10804	null
2022-09-16	TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection	Davide Salvi et.al.	2209.08000	null
2022-09-14	Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset	Michael Chinen et.al.	2209.06358	null
2022-09-08	SANIP: Shopping Assistant and Navigation for the visually impaired	Shubham Deshmukh et.al.	2209.03570	null
2022-09-07	Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech	Huu-Tien Dang et.al.	2209.02971	null
2022-09-02	Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model	Jennifer Drexler Fox et.al.	2209.01250	null
2022-08-28	Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks	Lev Finkelstein et.al.	2208.13183	null
2022-10-04	Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale	Aditya Agarwal et.al.	2208.09796	null
2022-08-21	Visualising Model Training via Vowel Space for Text-To-Speech Systems	Binu Abeysinghe et.al.	2208.09775	link
2022-08-15	Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0	Mohammed Salah Al-Radhi et.al.	2208.07122	null
2022-12-28	Speech Synthesis with Mixed Emotions	Kun Zhou et.al.	2208.05890	null
2022-08-03	A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis	Qibing Bai et.al.	2208.02189	null
2022-07-29	Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation	Giulia Comini et.al.	2207.14607	null
2022-07-25	Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis	Raul Fernandez et.al.	2207.12262	null
2022-07-01	A Polyphone BERT for Polyphone Disambiguation in Mandarin Chinese	Song Zhang et.al.	2207.12089	null
2022-07-20	When Is TTS Augmentation Through a Pivot Language Useful?	Nathaniel Robinson et.al.	2207.09889	link
2022-07-11	LIP: Lightweight Intelligent Preprocessor for meaningful text-to-speech	Harshvardhan Anand et.al.	2207.07118	null
2022-07-13	ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech	Rongjie Huang et.al.	2207.06389	link
2022-07-13	Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech	Zhengxi Liu et.al.	2207.06088	null
2022-07-13	SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate	Nabarun Goswami et.al.	2207.06011	null
2022-07-13	Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS	Yookyung Shin et.al.	2207.06000	null
2022-07-13	A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System	Yi-Chiao Wu et.al.	2207.05913	null
2022-07-12	Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition	Rodolfo Zevallos et.al.	2207.05498	null
2022-07-12	End-to-end speech recognition modeling from de-identified data	Martin Flechl et.al.	2207.05469	null
2022-07-11	Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data	Naoki Makishima et.al.	2207.04659	null
2022-07-11	DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders	Yanqing Liu et.al.	2207.04646	null
2023-01-02	Dreamento: an open-source dream engineering toolbox for sleep EEG wearables	Mahdad Jafarzadeh Esfahani et.al.	2207.03977	link
2022-07-07	BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus	Josh Meyer et.al.	2207.03546	link
2022-07-05	Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion	Yi Lei et.al.	2207.01832	null
2022-07-04	BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model	Brooke Stephenson et.al.	2207.01718	null
2022-07-04	Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS)	Ariadna Sanchez et.al.	2207.01547	null
2022-07-04	Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)	Ziyao Zhang et.al.	2207.01507	null
2023-03-13	DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech	Keon Lee et.al.	2207.01063	link
2022-07-02	Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need	Daniel Korzekwa et.al.	2207.00774	null
2022-07-01	Building African Voices	Perez Ogayo et.al.	2207.00688	link
2022-07-01	Automatic Evaluation of Speaker Similarity	Deja Kamil et.al.	2207.00344	null
2022-08-03	Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding	Wei-Ping Huang et.al.	2206.15427	null
2022-06-30	R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS	Kyle Kastner et.al.	2206.15276	null
2022-07-01	Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems	Hyun-Wook Yoon et.al.	2206.15067	null
2022-06-30	TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder	Eunwoo Song et.al.	2206.14984	null
2022-06-29	Improving Deliberation by Text-Only and Semi-Supervised Training	Ke Hu et.al.	2206.14716	null
2022-06-29	Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody	Peter Makarov et.al.	2206.14643	null
2022-06-28	Expressive, Variable, and Controllable Duration Modelling in TTS	Ammar Abbas et.al.	2206.14165	null
2022-06-28	Comparison of Speech Representations for the MOS Prediction System	Aki Kunikoshi et.al.	2206.13817	null
2022-06-22	A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data	Raviraj Joshi et.al.	2206.13240	null
2022-06-25	Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations	Chin-Cheng Hsu et.al.	2206.12662	null
2022-10-21	Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech	Florian Lux et.al.	2206.12229	link
2022-06-24	SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech	Hyunjae Cho et.al.	2206.12132	null
2022-06-24	End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue	Kentaro Mitsui et.al.	2206.12040	null
2022-05-29	Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning	Sameea Naeem et.al.	2206.11860	null
2022-06-21	Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS	Kenta Udagawa et.al.	2206.10256	null
2022-06-24	Towards Optimizing OCR for Accessibility	Peya Mowar et.al.	2206.10254	null
2022-06-16	Automatic Prosody Annotation with Pre-Trained Text-Speech Model	Ziqian Dai et.al.	2206.07956	link
2022-11-16	NatiQ: An End-to-end Text-to-Speech System for Arabic	Ahmed Abdelali et.al.	2206.07373	null
2022-06-15	Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning	Rui Liu et.al.	2206.07229	link
2022-12-12	A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation	Junhui Zhang et.al.	2206.04922	null
2022-06-09	Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos	Alexander Waibel et.al.	2206.04523	null
2022-06-07	FlexLip: A Controllable Text-to-Lip System	Dan Oneata et.al.	2206.03206	null
2022-10-11	UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder	Jiachen Lian et.al.	2206.02512	null
2023-10-19	Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech	Ziyue Jiang et.al.	2206.02147	link
2022-11-02	AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation	Kun Song et.al.	2206.00208	null
2022-05-31	Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish	Alp Öktem et.al.	2205.15599	link
2023-11-20	StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis	Yinghao Aaron Li et.al.	2205.15439	link
2022-05-30	Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data	Sungwon Kim et.al.	2205.15370	null
2022-05-26	QSpeech: Low-Qubit Quantum Speech Application Toolkit	Zhenhou Hong et.al.	2205.13221	link
2022-11-10	T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation	Paul-Ambroise Duquenne et.al.	2205.12216	null
2022-05-20	PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit	Hui Zhang et.al.	2205.12007	link
2022-05-24	TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS	Xulong Zhang et.al.	2205.11824	null
2022-10-12	GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech	Rongjie Huang et.al.	2205.07211	link
2022-05-13	Talking Face Generation with Multilingual TTS	Hyoung-Kyu Song et.al.	2205.06421	null
2022-05-10	NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality	Xu Tan et.al.	2205.04421	link
2022-05-09	Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech	Yang Li et.al.	2205.04120	link
2022-05-09	ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence	Sangshin Oh et.al.	2205.04104	null
2022-07-14	Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss	Efthymios Georgiou et.al.	2204.13437	null
2022-04-25	SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech	Zhenhui Ye et.al.	2204.11792	null
2022-04-22	LibriS2S: A German-English Speech-to-Speech Translation Corpus	Pedro Jeuris et.al.	2204.10593	link
2022-07-05	Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation	Ryo Terashima et.al.	2204.10020	null
2022-04-21	FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis	Rongjie Huang et.al.	2204.09934	link
2022-04-20	Audio Deep Fake Detection System with Neural Stitching for ADD 2022	Rui Yan et.al.	2204.08720	null
2022-04-14	Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech	Cong Zhang et.al.	2204.07228	null
2022-12-09	Study of Indian English Pronunciation Variabilities relative to Received Pronunciation	Priyanshi Pal et.al.	2204.06502	null
2022-04-12	Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch	Hanbin Bae et.al.	2204.05753	null
2023-01-30	The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance	Lin Zhang et.al.	2204.05177	null
2022-10-27	Fine-grained Noise Control for Multispeaker Speech Synthesis	Karolos Nikitaras et.al.	2204.05070	null
2022-08-31	Karaoker: Alignment-free singing voice synthesis with speech training data	Panos Kakoulidis et.al.	2204.04127	null
2022-08-15	Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech	Jae-Sung Bae et.al.	2204.04004	null
2022-04-07	Arabic Text-To-Speech (TTS) Data Preparation	Hala Al Masri et.al.	2204.03255	null
2022-04-07	Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis	Yutian Wang et.al.	2204.03238	null
2022-08-24	SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis	Georgia Maniati et.al.	2204.03040	null
2022-09-13	Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation	Sravya Popuri et.al.	2204.02967	null
2022-07-02	Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification	Jin Woo Lee et.al.	2204.02639	null
2023-08-28	Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech	Hyungchan Yoon et.al.	2204.02172	null
2022-09-07	Deliberation Model for On-Device Spoken Language Understanding	Duc Le et.al.	2204.01893	null
2022-12-14	Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck	Youngsik Eom et.al.	2204.01387	null
2022-11-11	Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis	Yixuan Zhou et.al.	2204.00990	null
2022-06-30	VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature	Chenpeng Du et.al.	2204.00768	null
2022-04-01	AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios	Yihan Wu et.al.	2204.00436	null
2022-04-01	Text-To-Speech Data Augmentation for Low Resource Speech Recognition	Rodolfo Zevallos et.al.	2204.00291	null
2022-07-19	Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech	Guangyan Zhang et.al.	2203.17190	null
2022-03-31	An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer	Wenlin Dai et.al.	2203.16954	link
2022-07-11	WavThruVec: Latent speech representation as intermediate features for neural speech synthesis	Hubert Siuzdak et.al.	2203.16930	null
2022-03-31	A Character-level Span-based Model for Mandarin Prosodic Structure Prediction	Xueyuan Chen et.al.	2203.16922	link
2022-07-01	JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech	Dan Lim et.al.	2203.16852	link
2022-03-31	Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset	Zehui Yang et.al.	2203.16844	null
2022-03-31	NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism	Jingbei Li et.al.	2203.16838	link
2022-03-31	Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition	Anirudh Gupta et.al.	2203.16823	null
2022-04-21	Does Audio Deepfake Detection Generalize?	Nicolas M. Müller et.al.	2203.16263	null
2022-03-30	End to End Lip Synchronization with a Temporal AutoEncoder	Yoav Shalev et.al.	2203.16224	link
2022-08-15	Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition	Junrui Ni et.al.	2203.15796	link
2022-06-29	DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning	Takaaki Saeki et.al.	2203.15683	null
2022-11-05	Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation	Rendi Chevi et.al.	2203.15643	link
2022-10-06	Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus	Minchan Kim et.al.	2203.15447	null
2022-07-11	VoiceMe: Personalized voice generation in TTS	Pol van Rijn et.al.	2203.15379	link

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 376 Commits
.github		.github
assets		assets
docs		docs
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
daily_arxiv.py		daily_arxiv.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Updated on 2024.11.15

TTS

About

Releases

Packages

Languages

License

liutaocode/TTS-arxiv-daily

Folders and files

Latest commit

History

Repository files navigation

Updated on 2024.11.15

TTS

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages