FeatureRequest: Streaming audio input #60

Praj-17 · 2024-04-12T22:00:27Z

I have an audio clip where a person says a particular Matra once!
Like this - Om Namah Shivay - This is your input voice
Now, The person starts chanting the same mantra Over an over an without any stop

Om Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah ShivayOm Namah Shivay
Note that there is no fixed silence between each time it is being said.

I need to show the count of the number of times he has spoken it correctly in runtime, as he speaks.

How can i achieve this using python and gemini

Note that the mantra can be very different as well as very long

Currently I have developed a system that implements a websocket to read the chant continuously and the input audio is sent at the time of handshake. the stream is collected and and split of regular intervals (approximately equal to the length of input), dumped into a temporary wav file and sent to gemini along with the input audio. But there is a catch! The user can obviously modify his speed and it is not fixed that the audio will have a integer number of mantras chanted.

For example the audio chunks might be like

chunk 1: Om Namah Shivay Om Namah ShivayOm Namah
Chunk 2: Shivay Om Namah ShivayOm Namah Shivay

Here I want gemini to count the total chants as 5 (2 +0.5+0.5+2)

How can this be acheived using Gemini?

Below is the link to my repo!

https://github.com/Praj-17/Chant-Counter

I would really apprecialte any resources that could help me solve this problem in runtime!

MarkDaoust · 2024-04-17T18:16:40Z

We don't have support for streaming audio right now.

There's nothing stopping you from sending one request after another with all the audio chunks you've received so far, and each time ask how many complete chants have been seen.

Or in a more "chat" style, after each new chunk ask how many new ones have been completed.

daonb · 2024-06-13T14:01:44Z

We also need input audio streaming but for a less spiritual app. We're working on a WebRTC based terminal (https://github.com/tuzig/terminal7) and we want to let the user speak to the terminal and get assistance from Gemini. For example, when the user says: "Print all my GCP instances", Gemini's response - gcloud compute instances list - is printed on the terminal leaving the user to press enter.
I've tested the current interface, sending an inline 10 second recording (~70K) and got a TTFB of ~3.5 seconds which is almost too slow. If gemini would accept voice input over a WebRTC audio channel and process the packets as they are received the TTFB will probably cut in half.

MarkDaoust · 2024-08-27T23:39:37Z

@jlove29

markmcd · 2024-08-27T23:40:30Z

The API / product doesn't support this right now, so there's not much we can do in the cookbook unfortunately. You can file product feedback from within AI Studio, using the ... menu on the top-right.

ymodak added component:other Issues unrelated to examples/quickstarts type:help Support-related issues labels Apr 13, 2024

MarkDaoust changed the title ~~Streaming an audio for output~~ FeatureRequest: Streaming audio input Apr 17, 2024

markmcd closed this as completed Aug 27, 2024

github-actions bot mentioned this issue Sep 6, 2024

Monthly issue metrics report markmcd/gemini-api-cookbook#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FeatureRequest: Streaming audio input #60

FeatureRequest: Streaming audio input #60

Praj-17 commented Apr 12, 2024

MarkDaoust commented Apr 17, 2024

daonb commented Jun 13, 2024

MarkDaoust commented Aug 27, 2024

markmcd commented Aug 27, 2024

FeatureRequest: Streaming audio input #60

FeatureRequest: Streaming audio input #60

Comments

Praj-17 commented Apr 12, 2024

MarkDaoust commented Apr 17, 2024

daonb commented Jun 13, 2024

MarkDaoust commented Aug 27, 2024

markmcd commented Aug 27, 2024