Add `stream=` kwarg to `Recognizer.listen` #757

clusterfudge · 2024-06-01T19:24:26Z

Support for receiving captured audio one chunk at a time, while continuing to use the wakeword and audio energy detection code.

Notably, Coqui.ai/DeepSpeech (the python STT package) support a streaming interface, which greatly improves interaction latency for continuous listening applications. Even for non-streaming interfaces, this implementation allows for eager encoding (for example converting to numpy buffers, or even precomputing transformer KVs), or just an earlier start to transmission (when using websockets or other chunked transfer mechanisms).

Note: This is a minimal extraction from a larger edit I have in a side project. There, I ended up carving up huge chunks of recognizer to make it a bit more observable (i.e. trigger events based on speech detection start/stop aside from yielding audio, as well as real-time events for audio-energy threshold and detected value). This is a much smaller edit, but I have not vetted it as well. I am in the process of adopting this change directly into a new project leveraging self-hosted whisper over http.

Support for receiving captured audio one chunk at a time, while continuing to use the wakeword and audio energy detection code.

clusterfudge · 2024-06-01T19:40:20Z

speech_recognition/__init__.py

@@ -447,10 +447,12 @@ def snowboy_wait_for_hot_word(self, snowboy_location, snowboy_hot_word_files, so

 return b"".join(frames), elapsed_time

- def listen(self, source, timeout=None, phrase_time_limit=None, snowboy_configuration=None):
+ def listen(self, source, timeout=None, phrase_time_limit=None, snowboy_configuration=None, stream=False):


The indirection here is notably strange, so I thought I'd explain here:

In newer versions of python, if a method contains the yield keyword, it always returns a generator. As such, the original implementation (now in _listen) always returns a generator, and this wrapper method either returns that generator without modification, or iterates on the generator and returns the only result.

There's an alternate path here, where _listen always yields buffers, and listen is responsible for merging them into a single payload for stream=False -- this would also mean moving the logic for truncating the final non-speech frames, which was slightly less preferable to me.

In either case, users of stream=True are going to get a few extra frames of audio. Changing that behavior would require the delaying stream emission for non_speaking_buffer_count, and it's a bit of a toss-up as to which is the lower-latency end-to-end solution.

Guillermoreno · 2024-06-12T16:52:38Z

Thanks mate you just made my day! Can I support in any way with this? I'm new to this things, but I made your modifications locally and now I can stream the audio into AWS transcribe and save precious response seconds on my AI voice agent!

clusterfudge · 2024-06-12T17:02:39Z

Can I support in any way with this?

Just looking for a maintainer's attention at this point, I think!

clusterfudge · 2024-07-07T23:12:33Z

cc @Uberi 🔔

Please take a look, and let me know if you have any questions or feedback! I've been running this on ubuntu, macos, and raspbian for the last couple weeks and would love to get off the fork!

Add stream= kwarg to Recognizer.listen

9ce6510

Support for receiving captured audio one chunk at a time, while continuing to use the wakeword and audio energy detection code.

clusterfudge commented Jun 1, 2024

View reviewed changes

clusterfudge mentioned this pull request Jul 7, 2024

Is streaming audio output now supported? #686

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `stream=` kwarg to `Recognizer.listen` #757

Add `stream=` kwarg to `Recognizer.listen` #757

clusterfudge commented Jun 1, 2024 •

edited

Loading

clusterfudge Jun 1, 2024

Guillermoreno commented Jun 12, 2024

clusterfudge commented Jun 12, 2024

clusterfudge commented Jul 7, 2024

Add stream= kwarg to Recognizer.listen #757

Are you sure you want to change the base?

Add stream= kwarg to Recognizer.listen #757

Conversation

clusterfudge commented Jun 1, 2024 • edited Loading

clusterfudge Jun 1, 2024

Choose a reason for hiding this comment

Guillermoreno commented Jun 12, 2024

clusterfudge commented Jun 12, 2024

clusterfudge commented Jul 7, 2024

Add `stream=` kwarg to `Recognizer.listen` #757

Add `stream=` kwarg to `Recognizer.listen` #757

clusterfudge commented Jun 1, 2024 •

edited

Loading