Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stream= kwarg to Recognizer.listen #757

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

clusterfudge
Copy link

@clusterfudge clusterfudge commented Jun 1, 2024

Support for receiving captured audio one chunk at a time, while continuing to use the wakeword and audio energy detection code.

Notably, Coqui.ai/DeepSpeech (the python STT package) support a streaming interface, which greatly improves interaction latency for continuous listening applications. Even for non-streaming interfaces, this implementation allows for eager encoding (for example converting to numpy buffers, or even precomputing transformer KVs), or just an earlier start to transmission (when using websockets or other chunked transfer mechanisms).

Note: This is a minimal extraction from a larger edit I have in a side project. There, I ended up carving up huge chunks of recognizer to make it a bit more observable (i.e. trigger events based on speech detection start/stop aside from yielding audio, as well as real-time events for audio-energy threshold and detected value). This is a much smaller edit, but I have not vetted it as well. I am in the process of adopting this change directly into a new project leveraging self-hosted whisper over http.

Support for receiving captured audio one chunk at a time, while
continuing to use the wakeword and audio energy detection code.
@@ -447,10 +447,12 @@ def snowboy_wait_for_hot_word(self, snowboy_location, snowboy_hot_word_files, so

return b"".join(frames), elapsed_time

def listen(self, source, timeout=None, phrase_time_limit=None, snowboy_configuration=None):
def listen(self, source, timeout=None, phrase_time_limit=None, snowboy_configuration=None, stream=False):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indirection here is notably strange, so I thought I'd explain here:

In newer versions of python, if a method contains the yield keyword, it always returns a generator. As such, the original implementation (now in _listen) always returns a generator, and this wrapper method either returns that generator without modification, or iterates on the generator and returns the only result.

There's an alternate path here, where _listen always yields buffers, and listen is responsible for merging them into a single payload for stream=False -- this would also mean moving the logic for truncating the final non-speech frames, which was slightly less preferable to me.

In either case, users of stream=True are going to get a few extra frames of audio. Changing that behavior would require the delaying stream emission for non_speaking_buffer_count, and it's a bit of a toss-up as to which is the lower-latency end-to-end solution.

@Guillermoreno
Copy link

Thanks mate you just made my day! Can I support in any way with this? I'm new to this things, but I made your modifications locally and now I can stream the audio into AWS transcribe and save precious response seconds on my AI voice agent!

@clusterfudge
Copy link
Author

Can I support in any way with this?

Just looking for a maintainer's attention at this point, I think!

@clusterfudge
Copy link
Author

cc @Uberi 🔔

Please take a look, and let me know if you have any questions or feedback! I've been running this on ubuntu, macos, and raspbian for the last couple weeks and would love to get off the fork!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants