Multi-Head State Space Model for Speech Recognition

Fathullah, Yassir; Wu, Chunyang; Shangguan, Yuan; Jia, Junteng; Xiong, Wenhan; Mahadeokar, Jay; Liu, Chunxi; Shi, Yangyang; Kalinli, Ozlem; Seltzer, Mike; Gales, Mark J. F.

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2305.12498 (eess)

[Submitted on 21 May 2023 (v1), last revised 25 May 2023 (this version, v2)]

Title:Multi-Head State Space Model for Speech Recognition

Authors:Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, Mark J. F. Gales

View PDF

Abstract:State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.

Comments:	Interspeech 2023
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2305.12498 [eess.AS]
	(or arXiv:2305.12498v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2305.12498

Submission history

From: Yassir Fathullah [view email]
[v1] Sun, 21 May 2023 16:28:57 UTC (1,608 KB)
[v2] Thu, 25 May 2023 21:55:58 UTC (1,609 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-Head State Space Model for Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-Head State Space Model for Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators