PSLM: Parallel Generation of Text and Speech with LLMs
for Low-Latency Spoken Dialogue Systems

Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada

rinna Co., Ltd., Tokyo, Japan
{kemits,kohmi,towaka,yuhono,keisawada}@rinna.co.jp
Abstract

Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech. Our experiments on spoken question answering tasks demonstrate that our approach improves latency while maintaining the quality of response content. Additionally, we show that latency can be further reduced by generating speech in multiple sequences. Demo samples are available at https://rinnakk.github.io/research/publications/PSLM.

PSLM: Parallel Generation of Text and Speech with LLMs
for Low-Latency Spoken Dialogue Systems


Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada rinna Co., Ltd., Tokyo, Japan {kemits,kohmi,towaka,yuhono,keisawada}@rinna.co.jp


1 Introduction

Spoken dialogue systems have been developed for many years to achieve natural human-computer interaction (McTear, 2002; Jokinen and McTear, 2009; Chen et al., 2017). Traditionally, these systems consist of several components: Automatic Speech Recognition (ASR), Response Generation (RG), and Text-to-Speech (TTS). Various methods for RG have been proposed with the advancements in Large Language Models (LLMs) (Wang et al., 2023a; Yi et al., 2024). More recently, the application of LLMs to ASR (e.g., Wang et al. 2023b; Hono et al. 2024; Fathullah et al. 2024) and TTS (Wang et al., 2023b; Hao et al., 2023) has attracted much attention, leading to the development of multimodal LLMs capable of end-to-end spoken language communication (Zhang et al., 2023; Nachmani et al., 2024).

Zhang et al. (2023) proposed SpeechGPT, an LLM that receives speech questions (SQ) as speech tokens, which are discrete representations extracted from raw waveforms, and sequentially generates text questions (TQ), text answers (TA), and speech answers (SA). Figure 1 (a) illustrates their approach called Chain-of-Modality (CoM) prompting. Spectron (Nachmani et al., 2024) follows this prompting style but directly handles speech spectrograms. Although these methods can generate high-quality responses, they face two major challenges in terms of response latency. First, generating SA requires the prior generation of TQ and TA. Second, speech sequences are much longer than text sequences111Actual sequence lengths are provided in Appendix A..

Refer to caption
Figure 1: (a) Chain-of-Modality prompting necessitates generating text questions (TQ) and text answers (TA) from speech questions (SQ) before producing speech answers (SA). (b) Our Parallel Speech Language Model (PSLM) enables the parallel decoding of TA and SA, reducing overall latency. (c) Introducing multiple speech streams further accelerates the generation of SA.

In this study, we propose Parallel Speech Language Model (PSLM), an LLM with multiple input-output sequences to handle both text and speech tokens, enabling their parallel generation. To emphasize their parallel processing capabilities, we will refer to these sequences as “streams”. As described in Figure 1 (b), PSLM begins to generate SA immediately after the end of SQ tokens, which can reduce overall latency. This leads to our first research question (RQ1): Can PSLM improve latency while maintaining the response quality achieved by CoM prompting? Additionally, we address the second challenge by introducing multiple speech streams to decode multiple speech tokens in a single step, as described in Figure 1 (c). This brings us to the second research question (RQ2): Do multiple speech streams sacrifice the response quality? Addressing these questions will pave the way for more advanced and responsive applications of spoken dialogue systems.

2 PSLM

2.1 Speech Discretization

Speech Tokenization

Extracting discrete speech tokens from raw waveforms enables language models to handle speech in the same manner as text tokens. Self-supervised learning has been widely used for speech tokenization due to its ability to extract spoken content from raw waveforms (e.g., Rubenstein et al. 2023; Chou et al. 2023; Hassid et al. 2023). Following Zhang et al. (2023), we employ Hidden-Unit BERT (HuBERT) (Hsu et al., 2021) for speech tokenization.

Speech Detokenization

In contrast to text tokenization, which is uniquely recoverable, speech tokenization largely discards the information of raw waveforms. Two major approaches have been proposed to solve this problem. The first approach uses a neural vocoder for directly reconstructing raw waveforms from speech tokens (e.g., Zhang et al. 2023; Chou et al. 2023; Hassid et al. 2023). The second approach uses a pretrained neural audio codec, which requires an additional module to predict the codec’s tokens (e.g., Rubenstein et al. 2023; Zhang et al. 2024). We adopt the first approach to reduce overall latency using HiFi-GAN (Kong et al., 2020), a non-autoregressive neural vocoder that efficiently generates high-fidelity waveforms.

2.2 Integrating LMs with a Speech Stream

PSLM is built on top of a pretrained decoder-only Transformer (Vaswani et al., 2017). An overview of the PSLM architecture is provided in Figure 2. We add new input embedding and output projection layers to process speech tokens, while the structure of the intermediate Transformer layers remains unchanged. The embeddings of text and speech tokens are summed before being fed to the Transformer layers. The hidden features from the final Transformer layer are passed to two output projection layers to calculate the logits of the next text and speech tokens. We randomly initialize the weights of new embedding and projection layers.

A challenge of joint text-speech modeling lies in the mismatch in their lengths. In this study, we simply right-pad TQ and TA sequences with special [TEXT-PAD] tokens to align their lengths with those of the SQ and SA sequences, respectively. In a preliminary experiment on the CoM-based architecture, we attempted to generate text tokens and their corresponding speech tokens alternatively in a similar manner to ELLA-V (Song et al., 2024); however, this approach led to frequent mispronunciation. This is mainly because, in our case, the text is represented by tokens rather than phonemes; in some languages, the pronunciation of a character often changes according to subsequent characters, and a certain amount of lookahead is necessary to achieve accurate pronunciation. In contrast, our alignment strategy allows the model to focus on text token generation initially and then refer to the generated text when producing the majority of speech tokens, leading to more accurate pronunciation.

Our PSLM is trained by minimizing the sum of cross entropy losses for each stream. We include prompt tokens, comprising TQ and SQ, in the loss calculation. During inference, PSLM receives these prompt tokens and generates TA and SA in parallel. Text and speech tokens are sampled independently from their respective distributions.

Refer to caption
Figure 2: Architecture of PSLM.

2.3 Introducing Multiple Speech Streams

For further acceleration, we introduce multiple speech streams to PSLM. Assume that PSLM has 1+S1𝑆1+S1 + italic_S streams, one for text tokens and S𝑆Sitalic_S for speech tokens. Given the original speech token sequence of length N𝑁Nitalic_N, the s𝑠sitalic_s-th speech stream consists of the speech tokens with indices s,s+S,s+2S,,s+MS𝑠𝑠𝑆𝑠2𝑆𝑠𝑀𝑆s,s+S,s+2S,...,s+MSitalic_s , italic_s + italic_S , italic_s + 2 italic_S , … , italic_s + italic_M italic_S, where s{1,,S}𝑠1𝑆s\in\{1,\ldots,S\}italic_s ∈ { 1 , … , italic_S } and M=N/S1𝑀𝑁𝑆1M=\lfloor N/S\rfloor-1italic_M = ⌊ italic_N / italic_S ⌋ - 1. Compared to simply increasing the batch size, where the system’s throughput improves but the latency for each instance remains unchanged, our approach reduces the sequence length handled by the Transformer layers to 1/S1𝑆1/S1 / italic_S, leading to an approximate S𝑆Sitalic_S-fold speedup even in the single-instance scenario.

During training, simply summing the cross entropy losses for each stream makes the loss of text tokens less dominant, leading to poor text generation quality. Therefore, we introduce a weighted loss, where we multiply the loss for speech streams by 1/S1𝑆1/S1 / italic_S to balance the weight of losses for text and speech streams.

2.4 Streaming Inference with HiFi-GAN

Following Chen et al. (2022), we use HiFi-GAN for streaming inference; specifically, we provide partial speech tokens to generate waveform fragments. In this study, we use non-causal convolution to maintain high speech quality. Therefore, the first speech fragment can be generated once Noffset=R/2+1subscript𝑁offset𝑅21N_{\textrm{offset}}=\lfloor R/2\rfloor+1italic_N start_POSTSUBSCRIPT offset end_POSTSUBSCRIPT = ⌊ italic_R / 2 ⌋ + 1 tokens are decoded, where R𝑅Ritalic_R denotes the receptive field of HiFi-GAN. Implementation details can be found in Appendix B.

2.5 Overall Latency

We define latency as the delay between the end of the user’s utterance and the system’s initial response. The latency of conventional CoM-based systems LCoMsubscript𝐿CoML_{\textrm{CoM}}italic_L start_POSTSUBSCRIPT CoM end_POSTSUBSCRIPT can be represented as follows:

LCoMsubscript𝐿CoM\displaystyle L_{\textrm{CoM}}italic_L start_POSTSUBSCRIPT CoM end_POSTSUBSCRIPT =Ds2t+DSQ+NdecP+Dt2sabsentsubscript𝐷s2tsubscript𝐷SQsubscript𝑁dec𝑃subscript𝐷t2s\displaystyle=D_{\textrm{s2t}}+D_{\textrm{SQ}}+\frac{N_{\textrm{dec}}}{P}+D_{% \textrm{t2s}}= italic_D start_POSTSUBSCRIPT s2t end_POSTSUBSCRIPT + italic_D start_POSTSUBSCRIPT SQ end_POSTSUBSCRIPT + divide start_ARG italic_N start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT end_ARG start_ARG italic_P end_ARG + italic_D start_POSTSUBSCRIPT t2s end_POSTSUBSCRIPT (1)
Ndecsubscript𝑁dec\displaystyle N_{\textrm{dec}}italic_N start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT =NTQ+NTA+Noffsetabsentsubscript𝑁TQsubscript𝑁TAsubscript𝑁offset\displaystyle=N_{\textrm{TQ}}+N_{\textrm{TA}}+N_{\textrm{offset}}= italic_N start_POSTSUBSCRIPT TQ end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT TA end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT offset end_POSTSUBSCRIPT (2)

where Ds2tsubscript𝐷s2tD_{\textrm{s2t}}italic_D start_POSTSUBSCRIPT s2t end_POSTSUBSCRIPT, DSQsubscript𝐷SQD_{\textrm{SQ}}italic_D start_POSTSUBSCRIPT SQ end_POSTSUBSCRIPT, and Dt2ssubscript𝐷t2sD_{\textrm{t2s}}italic_D start_POSTSUBSCRIPT t2s end_POSTSUBSCRIPT denote the delays of speech tokenization, the prefill phase in LMs, and speech detokenization, respectively; NTQsubscript𝑁TQN_{\textrm{TQ}}italic_N start_POSTSUBSCRIPT TQ end_POSTSUBSCRIPT and NTAsubscript𝑁TAN_{\textrm{TA}}italic_N start_POSTSUBSCRIPT TA end_POSTSUBSCRIPT denote the number of tokens in TQ and TA, respectively; and P𝑃Pitalic_P denotes the tokens per second (TPS) during the decode phase in LMs.

Our PSLM eliminates the need for generating TQ and TA beforehand, although it requires to run external ASR to obtain TQ. Hence, its latency LPSLMsubscript𝐿PSLML_{\textrm{PSLM}}italic_L start_POSTSUBSCRIPT PSLM end_POSTSUBSCRIPT can be represented as follows:

LPSLM=DASR+DSQ+NoffsetPS+Dt2ssubscript𝐿PSLMsubscript𝐷ASRsubscript𝐷SQsubscript𝑁offset𝑃𝑆subscript𝐷t2s\displaystyle L_{\textrm{PSLM}}=D_{\textrm{ASR}}+D_{\textrm{SQ}}+\frac{N_{% \textrm{offset}}}{P\cdot S}+D_{\textrm{t2s}}italic_L start_POSTSUBSCRIPT PSLM end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT ASR end_POSTSUBSCRIPT + italic_D start_POSTSUBSCRIPT SQ end_POSTSUBSCRIPT + divide start_ARG italic_N start_POSTSUBSCRIPT offset end_POSTSUBSCRIPT end_ARG start_ARG italic_P ⋅ italic_S end_ARG + italic_D start_POSTSUBSCRIPT t2s end_POSTSUBSCRIPT (3)

where DASRsubscript𝐷ASRD_{\textrm{ASR}}italic_D start_POSTSUBSCRIPT ASR end_POSTSUBSCRIPT denotes the ASR delay. Here Ds2tsubscript𝐷s2tD_{\textrm{s2t}}italic_D start_POSTSUBSCRIPT s2t end_POSTSUBSCRIPT is omitted because speech tokenization can be performed in parallel with ASR.

3 Experimental Setup

3.1 Dataset

We used an internal dataset comprising 1.8M written QA pairs for training all models. Since some of these samples, which were primarily crawled from the internet, were deemed unsuitable for evaluation, we used a publicly available Japanese dataset (Hayashibe, 2023) for evaluation. This dataset was manually reviewed and consists of 669 diverse written QA pairs. We further filtered the evaluation set by excluding samples whose TQ or TA exceeded 140 characters, the maximum number of characters observed in the training set. The final evaluation set contained 396 samples. For both the training and evaluation sets, we constructed a spoken question answering (SQA) dataset by synthesizing SQ and SA using a well-trained single-speaker TTS system based on VITS (Kim et al., 2021).

3.2 Configuration

Tokenization and Detokenization

For text tokenization, we used the tokenizer with a vocabulary size of 151,936 from rinna/nekomata-7b222https://huggingface.co/rinna/nekomata-7b. For speech tokenization, we applied k𝑘kitalic_k-means clustering with k=512𝑘512k=512italic_k = 512 to 12-th layer features from rinna/japanese-hubert-base333https://huggingface.co/rinna/japanese-hubert-base (Sawada et al., 2024), obtaining 50 speech tokens per second. For speech detokenization, we trained discrete unit-based HiFi-GAN (Polyak et al., 2021) using pairs of synthesized speech waveforms of SQ and SA and their corresponding speech tokens. For ASR, Whisper large-v3 (Radford et al., 2023) with faster-whisper444https://github.com/SYSTRAN/faster-whisper was used throughout our experiments.

Language Modeling

We used rinna/nekomata-7b, a 32-layer 4096-hidden-size Transformer LM that was continuously pretrained from Qwen-7B (Bai et al., 2023) on Japanese text, as the backbone of our models. We implemented our models using the GPT-NeoX library (Andonian et al., 2023). Unless otherwise noted, models were trained for 50k steps with a batch size of 16 on 8 NVIDIA A100 GPUs using an Adam optimizer (Kingma and Ba, 2015) with a peak learning rate set to 1e-5. During inference, we set the temperature to 0.8 and applied top-k𝑘kitalic_k and top-p𝑝pitalic_p sampling with k=60𝑘60k=60italic_k = 60 and p=0.8𝑝0.8p=0.8italic_p = 0.8.

3.3 Baselines

We involved three CoM-based baselines, which share the model weights but differ in their prompts during decoding: (1) CoM-SQ receives only SQ, (2) CoM-ASR receives SQ and transcribed TQ, and (3) CoM receives SQ and gold TQ. In our preliminary experiments, the three-stage training (Zhang et al., 2023) was not effective in our configuration; thus, we trained the model using the same configuration as described in Section 3.2.

3.4 Evaluation Metrics

ChatGPT Scores

We used OpenAI’s GPT-3.5 Turbo API to evaluate response quality on a 5-point scale from 1 (bad) to 5 (excellent). The prompt is described in Appendix C. We report the scores for TA and the transcription of SA as T-score and S-score, respectively.

Character Error Rate (CER)

We calculated the character error rate between the generated TA and the transcription of SA to assess their alignment.

Failure Rate (FR)

We counted failure cases such as (1) no [EOS] token was generated before the total sequence length reached 2048, or (2) tokens were generated in the wrong modality, i.e., speech tokens in TQ and TA, or text tokens in SA.

Latency

We simulated latency according to Equations 2 and 3 for each sample in the evaluation set, and reported the median values. We set Ds2t=0.05subscript𝐷s2t0.05D_{\textrm{s2t}}=0.05italic_D start_POSTSUBSCRIPT s2t end_POSTSUBSCRIPT = 0.05, DSQ=0.05subscript𝐷SQ0.05D_{\textrm{SQ}}=0.05italic_D start_POSTSUBSCRIPT SQ end_POSTSUBSCRIPT = 0.05, DASR=0.2subscript𝐷ASR0.2D_{\textrm{ASR}}=0.2italic_D start_POSTSUBSCRIPT ASR end_POSTSUBSCRIPT = 0.2, and Dt2s=0.01subscript𝐷t2s0.01D_{\textrm{t2s}}=0.01italic_D start_POSTSUBSCRIPT t2s end_POSTSUBSCRIPT = 0.01 based on measurements taken on a single NVIDIA A100 GPU. For the TPS value P𝑃Pitalic_P, the actual TPS varies depending on computing resources and optimization; 70 TPS was achieved with vLLM (Kwon et al., 2023) optimization, and 25 TPS without it. Meanwhile, for streaming inference with HiFi-GAN, LMs need to generate 50 speech tokens per second. Therefore, we set P𝑃Pitalic_P to 50 in our simulations to match this requirement.

Table 1: Automatic evaluation results. T-score and S-score represent the ChatGPT-based score for TA and transcribed SA, respectively. FR denotes the failure rate. Latency values in parentheses represent inputs involving gold TQ.
Method Input modality Output Modality T-score\uparrow S-score\uparrow FR\downarrow CER\downarrow Latency [s]\downarrow
Ground Truth 4.00±plus-or-minus\pm±0.02 3.58±plus-or-minus\pm±0.06 7.35
CoM SQ \rightarrow TQ (Gold) TA \rightarrow SA 3.50±plus-or-minus\pm±0.09 3.27±plus-or-minus\pm±0.09 12.12 6.28 (0.67)
PSLM SQ, TQ (Gold) TA, SA 3.50±plus-or-minus\pm±0.08 3.22±plus-or-minus\pm±0.09 5.05 5.25 (0.34)
CoM-SQ SQ TQ \rightarrow TA \rightarrow SA 3.12±plus-or-minus\pm±0.11 2.94±plus-or-minus\pm±0.10 15.91 7.83 1.03
CoM-ASR SQ \rightarrow TQ (ASR) TA \rightarrow SA 3.27±plus-or-minus\pm±0.10 3.07±plus-or-minus\pm±0.09 13.13 6.18 0.92
PSLM-ASR SQ, TQ (ASR) TA, SA 3.34±plus-or-minus\pm±0.09 3.05±plus-or-minus\pm±0.10 6.31 6.05 0.54
PSLM-2x SQ, TQ (Gold) TA, SA 3.50±plus-or-minus\pm±0.08 3.20±plus-or-minus\pm±0.09 4.29 6.39 (0.20)
PSLM-3x SQ, TQ (Gold) TA, SA 3.28±plus-or-minus\pm±0.10 2.99±plus-or-minus\pm±0.10 7.07 6.09 (0.15)

Human Rating

We also conducted two subjective evaluations: one for text and the other for speech. In the text evaluation, we presented pairs of gold TQ and generated TA, and raters evaluated the naturalness of TA based on the same criteria used in the ChatGPT-based evaluation (Text Naturalness). In the speech evaluation, we presented gold SQ and generated SA successively, along with their TQ and TA, and asked the raters to evaluate (1) how natural the SA is as the speech of the TA (Speech Naturalness), and (2) whether the response is fast enough (Speed Score). For better reproducibility, we provide the actual instruction used for speech evaluation in Appendix D. The duration of silence between SQ and SA was simulated in the manner described in Section 2.5, except for the Ground Truth where the silence duration was set to 200ms, the average turn-taking gap in human conversation (Levinson and Torreira, 2015). Scores were rated on a 5-point scale. Fifty samples were randomly chosen from the evaluation set, and twenty in-house workers rated twenty samples each.

4 Results and Discussion

4.1 Automatic Evaluation

Comparison with Baselines

To answer RQ1, we compared the proposed method in two conditions, PSLM and PSLM-ASR, with the baselines described in Section 3.3. PSLM receives SQ and gold TQ, while PSLM-ASR receives SQ and transcribed TQ. Table 1 summarizes the results. When gold TQ was given, PSLM achieved comparable scores to CoM and significantly improved latency. A similar trend was observed under more practical conditions where gold TQ was not available (PSLM-ASR vs. CoM-ASR). However, their scores were lower than those with gold TQ, and CoM-SQ faced greater degradation. These results suggest that ASR performance is crucial for response quality, and CoM-SQ seems to have produced more ASR errors than Whisper. Nevertheless, we conclude that PSLM maintains the response quality of CoM (RQ1). We also found that PSLM-based methods achieved lower FRs than CoM-based ones. Each stream of PSLM is dedicated to a single modality, which could have reduced the failures in generation. Furthermore, methods other than CoM-SQ marked lower CERs than Ground Truth. From this result, we confirmed that both CoM and PSLM can generate appropriate SA corresponding to TA.

Multiple Speech Streams

To answer RQ2, we trained PSLM variants with two (-2x) or three (-3x) speech streams555PSLM-3x was trained with a batch size of 4 due to the increased number of parameters.. PSLM-2x achieved comparable scores to PSLM, whereas PSLM-3x demonstrated significant degradation. From these results, we conclude that speech tokens can be decoded in up to two streams without quality degradation (RQ2). An ablation study can be found in Appendix E.

4.2 Human Evaluation

Considering practical applicability to SQA, we manually evaluated three methods: CoM-SQ, CoM-ASR, and PSLM-ASR, which do not rely on gold TQ, along with Ground-Truth. Table 2 shows the results. The text response naturalness of PSLM-ASR was comparable to CoM-ASR and higher than CoM-SQ, which is consistent with the automatic evaluation results. For speech naturalness, all methods achieved higher scores than Ground-Truth. This result can be attributed to two reasons: (1) SA of Ground-Truth are synthetic speech, which may include errors in pronunciation, intonation, and pauses, and (2) SA of Ground-Truth are typically longer than those of other methods, incurring that one or two unnatural parts lowered the entire score. Nevertheless, we confirmed that our approach can generate natural and faithful speech responses. For response speed evaluation, PSLM-ASR achieved a significantly higher score than CoM-ASR and CoM-SQ. This finding verifies that the proposed method reduces latency both numerically and perceptibly. Detailed analysis can be found in the next subsection.

Table 2: Human evaluation results.
Method Text\uparrow Speech\uparrow Speed\uparrow
Ground Truth 4.08±plus-or-minus\pm±0.26 3.74±plus-or-minus\pm±0.19 4.73±plus-or-minus\pm±0.11
CoM-SQ 2.44±plus-or-minus\pm±0.29 4.04±plus-or-minus\pm±0.20 4.07±plus-or-minus\pm±0.23
CoM-ASR 2.90±plus-or-minus\pm±0.30 3.94±plus-or-minus\pm±0.20 4.17±plus-or-minus\pm±0.22
PSLM-ASR 3.08±plus-or-minus\pm±0.27 4.08±plus-or-minus\pm±0.20 4.57±plus-or-minus\pm±0.13

4.3 Detailed Latency Analysis

The sequence length of TA, or NTAsubscript𝑁TAN_{\textrm{TA}}italic_N start_POSTSUBSCRIPT TA end_POSTSUBSCRIPT, is the most influential factor in overall latency of CoM-based systems, as TA must be generated before SA. Thus, we investigated the overall latency by varying NTAsubscript𝑁TAN_{\textrm{TA}}italic_N start_POSTSUBSCRIPT TA end_POSTSUBSCRIPT. Figure 3 shows the results. Due to the need for prior generation of TA, the latency of CoM-SQ and CoM-ASR increases linearly as TA length increases. In contrast, the latency of PSLM-ASR is constant because Equation 3 does not include NTAsubscript𝑁TAN_{\textrm{TA}}italic_N start_POSTSUBSCRIPT TA end_POSTSUBSCRIPT, and PSLM-2x-ASR further reduces the latency. The gap between CoM-based and PSLM-based systems is remarkable when generating long TA, highlighting the effectiveness of generating text and speech tokens in parallel.

Refer to caption
Figure 3: Latency vs. TA length for different methods and tokens per second (TPS). PSLM-2x-ASR (50 TPS) is omitted because its latency is identical to PSLM-ASR (100 TPS).

5 Conclusion

In this study, we proposed the Parallel Speech Language Model (PSLM), an LLM capable of generating text and speech tokens in parallel with multiple input-output streams, and investigated its impact on response quality and overall latency. The experimental evaluations on spoken question answering demonstrated that the proposed method significantly reduces latency compared to existing methods while maintaining response quality. Future work includes verifying the effectiveness of the proposed method on larger datasets and real speech data. Additionally, extending the proposed method to multi-turn dialogues is an important research direction.

6 Limitations

We recognize several limitations of this study. First, PSLM sacrifices ASR capability for faster response, requiring an external ASR module to serve as a spoken dialogue system. Although this dependency can complicate the system structure, it does not degrade the system’s performance, provided that an appropriate ASR module is selected. This is supported by the fact that CoM-ASR outperformed CoM-SQ, as described in Section 4.1. Nevertheless, enabling ASR with the PSLM architecture can be an interesting research direction. Second, we used single-speaker synthetic speech for SQ and SA, which lacks diversity in several aspects of speech such as accent, rhythm, emotion, and timbre. Practical applications may require to accept voices of arbitrary speakers, which we will address in future work. Finally, multi-turn dialogue settings were not investigated in our experiments. While SpeechGPT (Zhang et al., 2023) was not applied to multi-turn dialogue due to sequence length limitations, our models with multiple speech streams have the potential to perform multi-turn dialogue.

References

Appendix A Sequence Length Distributions

We calculated the sequence length distributions of SQ, TQ, TA, and SA in the training set. The results are listed in Table 3. On average, CoM prompting requires to generate 36.5(TQ)+33.8(TA)7036.5TQ33.8TA7036.5\ (\textrm{TQ})+33.8\ (\textrm{TA})\approx 7036.5 ( TQ ) + 33.8 ( TA ) ≈ 70 text tokens before generating SA. Eliminating the need for generating these tokens can greatly reduce overall latency. In addition, speech tokens are more than 11 times longer than text tokens, highlighting the need for efficient generation of speech tokens.

Appendix B Implementation Details of HiFi-GAN

The HiFi-GAN generator comprises convolution layers. Therefore, a waveform fragment corresponding to the i𝑖iitalic_i-th token depends only on tokens with indices [iR/2,i+R/2]𝑖𝑅2𝑖𝑅2[i-\lfloor R/2\rfloor,i+\lfloor R/2\rfloor][ italic_i - ⌊ italic_R / 2 ⌋ , italic_i + ⌊ italic_R / 2 ⌋ ]. This allows waveform generation to start before the entire SA is generated. As described in Figure 4, HiFi-GAN first generates a waveform fragment once the LM generates Noffset=R/2+1subscript𝑁offset𝑅21N_{\textrm{offset}}=\lfloor R/2\rfloor+1italic_N start_POSTSUBSCRIPT offset end_POSTSUBSCRIPT = ⌊ italic_R / 2 ⌋ + 1 tokens, then generates subsequent fragments by shifting input tokens one by one.

In our experiments, we trained HiFi-GAN to generate 24 kHz waveform from 50Hz tokens, which results in R=26𝑅26R=26italic_R = 26. Following Polyak et al. (2021), we embedded input speech tokens into 256-dimensional features and fed them to HiFi-GAN. We modified the upsampling rates to [8,6,5,2]8652[8,6,5,2][ 8 , 6 , 5 , 2 ], the number of total iterations to 300k, and kept the other configuration the same as the original work (Kong et al., 2020).

Appendix C ChatGPT Evaluation Prompt

We used the prompt in Figure 5 for ChatGPT-based evaluation. The original prompt was written in Japanese, but a translated version is presented here.

Appendix D Speech Evaluation Instruction

We used the instruction in Figure 6 for speech evaluation. The original instruction was written in Japanese, but a translated version is presented here.

Appendix E Ablation Study

We trained three PSLM variants, one from scratch (-no-pretrain), one without TQ (-no-TQ), and one without SQ (-no-SQ). In addition, we trained PSLM-2x and PSLM-3x without weighted loss (-no-WL). Table 4 shows the automatic evaluation results. PSLM-no-pretrain exhibited significant degradation in all metrics, indicating the necessity of pretrained LM’s text capability. PSLM-no-TQ also showed large degradation, highlighting the importance of TQ in response quality. In contrast, PSLM-no-SQ achieved comparable scores to PSLM. This result implies that the speech-specific information such as intonation, rhythm, and emotion is not essential in the current SQA task due to the use of synthetic speech. We also found that PSLM-2x-no-WL achieved almost comparable scores to PSLM, whereas PSLM-3x-no-WL showed significant degradation. From these results, we conclude that the weighted loss is especially effective as the number of speech streams increases.

Table 3: Sequence length distributions in the training set (in tokens).
SQ TQ TA SA
Mean 406.6 36.5 33.8 386.5
Min 34 2 1 27
25% 214 19 15 179
50% 354 32 29 340
75% 577 51 50 563
Max 1861 148 147 1697
Refer to caption
Figure 4: Streaming inference using HiFi-GAN with receptive field size R=5𝑅5R=5italic_R = 5 and SA length NSA=6subscript𝑁SA6N_{\textrm{SA}}=6italic_N start_POSTSUBSCRIPT SA end_POSTSUBSCRIPT = 6. Waveform generation begins once Noffset=R/2+1=3subscript𝑁offset𝑅213N_{\textrm{offset}}=\lfloor R/2\rfloor+1=3italic_N start_POSTSUBSCRIPT offset end_POSTSUBSCRIPT = ⌊ italic_R / 2 ⌋ + 1 = 3 tokens are generated. Text tokens are omitted.
Refer to caption
Figure 5: Prompt for ChatGPT evaluation.
Refer to caption
Figure 6: Instruction for speech evaluation.
Table 4: Ablation study. The suffix no-WL denotes weighted loss was not applied.
Method Input modality Output Modality T-score\uparrow S-score\uparrow FR\downarrow CER\downarrow
PSLM SQ, TQ (Gold) TA, SA 3.50±plus-or-minus\pm±0.08 3.22±plus-or-minus\pm±0.09 5.05 5.25
PSLM-2x SQ, TQ (Gold) TA, SA 3.50±plus-or-minus\pm±0.08 3.20±plus-or-minus\pm±0.09 4.29 6.39
PSLM-3x SQ, TQ (Gold) TA, SA 3.28±plus-or-minus\pm±0.10 2.99±plus-or-minus\pm±0.10 7.07 6.09
PSLM-no-pretrain SQ, TQ (Gold) TA, SA 2.22±plus-or-minus\pm±0.07 2.12±plus-or-minus\pm±0.07 18.18 10.13
PSLM-no-TQ SQ TA, SA 2.34±plus-or-minus\pm±0.09 2.19±plus-or-minus\pm±0.09 8.84 6.38
PSLM-no-SQ TQ (Gold) TA, SA 3.54±plus-or-minus\pm±0.08 3.17±plus-or-minus\pm±0.09 6.31 8.99
PSLM-2x-no-WL SQ, TQ (Gold) TA, SA 3.42±plus-or-minus\pm±0.08 3.17±plus-or-minus\pm±0.08 8.84 4.99
PSLM-3x-no-WL SQ, TQ (Gold) TA, SA 2.67±plus-or-minus\pm±0.10 2.46±plus-or-minus\pm±0.10 11.36 6.94