Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add InteractMode Component with TTS and STT Functions #134

Merged

Conversation

o-stahl
Copy link
Collaborator

@o-stahl o-stahl commented May 29, 2024

Summary


This pull request introduces a new InteractMode component and integrates text-to-speech (TTS) and speech-to-text (STT) functionalities (the latter is not fully implemented in InteractMode). The enhancement by default leverages the Web Speech API and OpenAI's Whisper API to provide improved speech transcription.

Screenshot 2024-05-30 020714

Key Changes


  1. InteractMode Component:

    • Implemented the InteractMode component to handle speech interactions within the chat application.
    • Added functionality to monitor and visualize audio input in real-time.
  2. fetchTTSResponse Function:

    • Added fetchTTSResponse function to convert text to speech using the OpenAI API.
    • Ensures high-quality audio playback of transcribed text.
  3. fetchSTTResponse Function:

    • Added fetchSTTResponse function to transcribe audio to text using the OpenAI Whisper API.
    • Utilizes the Web Speech API for initial speech detection and transcription.
    • Switches to Whisper API for more accurate transcription when enabled.
  4. Toggle for Enhanced Accuracy:

    • Introduced a toggle to switch between Web Speech API and Whisper API for transcription.
    • Ensures only relevant speech is transcribed, reducing noise and improving accuracy.

Benefits


  • Enhanced user experience by enabling multimodal interaction.
  • Improved usage of OpenAI's endpoints, now also including TTS and STT.
  • Provides users with accurate and reliable speech-to-text and text-to-speech capabilities.

Notes & future plans

This is the first revision and only implements user speech to message transcription, but it should be perfectly usable in it's current state.

  • Speech to text on assistant messages when the interact mode is enabled. (40754)
  • Settings tab for TTS/STT related selections especially whether to use only Web Speech API.
  • Adding TTS/STT functionalities to the other providers.

Auto Generated Notes (Do Not Change)


@o-stahl o-stahl force-pushed the feature/interact-mode-enhancements branch from bac1d77 to aeb5ab6 Compare May 30, 2024 20:19
@fingerthief
Copy link
Owner

Really excellent work on this!

I've done some testing and I think this is easily solid enough to go ahead and merge into the main branch.

I made one commit to tweak a few little things:

  • Added a dynamic check for the highest quality supported audio format for the user's current device. It starts checking with the highest quality format and falls back to the next highest quality if it isn't supported. Rinse and repeat until the highest quality format that is supported is found.

  • Removed showing the error for no-speech while in interact mode. Otherwise it shows as an error after a bit of silence with no speech.

  • Increased audio playback speed by 5%

  • switched to tts-1-hd model as it seems to work fine

    • Soon enough this will be user configurable along with speed etc..

    Notes

    I know the mobile support for interact mode has some wonkiness on my phone at least, I'll be creating an issue for that problem though. I have some notion of an idea for a dynamic noise floor level calculation so our speech detection floor can vary with microphone sensitivity

@fingerthief fingerthief merged commit 229cc86 into fingerthief:main May 31, 2024
2 of 3 checks passed
@o-stahl o-stahl deleted the feature/interact-mode-enhancements branch June 1, 2024 08:28
@o-stahl
Copy link
Collaborator Author

o-stahl commented Jun 1, 2024

  • switched to tts-1-hd model as it seems to work fine

OpenAI's regular "tts-1" model is faster and 2x cheaper while according to user feedback the quality difference is (or at least was) barely noticeable even with audiophile gear. However as you mentioned as well, model selection will take care of different preferences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants