Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX streaming support #255

Merged
merged 3 commits into from
Apr 24, 2024
Merged

ONNX streaming support #255

merged 3 commits into from
Apr 24, 2024

Conversation

mush42
Copy link
Contributor

@mush42 mush42 commented Oct 27, 2023

Link to issue number:

Issue #25

Summary of the issue:

Piper uses sentence-level streaming.

For short sentences, the latency of Piper output is relatively low due to the good RTF. But for longer sentences, latency is prohibitively high, which hinders realtime usage applications, such as with screen readers.

Description of how this pull request fixes the issue:

This PR implements streaming output by splitting the VITS model into two parts: encoder and decoder.

First the encoder output is generated for the whole utterance at once, then the encoder output is splitted into chunks of frames using the given chunk size and fed to the decoder chunk by chunk.

To maintain speech quality, each fed chunk is padded with some frames from the previous and next chunk, and then the corresponding wave frames are removed from the final audio output.

To export a checkpoint, use the command:

python3 -m piper_train.export_onnx_streaming CHECKPOINT_PATH ONNX_OUTPUT_DIR

For inference, use the command:

cat input.json | python3 -m piper_train.infer_onnx_streaming --encoder ENCODER_ONNX_PATH --decoder DECODER_ONNX_PATH

Which pipes wave bytes to stdout. You can then redirect the output to any wave playing program.

Testing performed:

Tested export and inference using hfc-male checkpoint.

Known issues with pull request:

The encoder has many components, which can be included in the decoder to further reduce latency, but including those components in the decoder impacts naturalness. There is a trade off to be made between encoder inference speed (latency) and naturalness of generated speech.

For instance, the flow component can be included in the encoder or the decoder. When included in the encoder, it adds significant latency to the encoder. At the same time, chunking the input to the flow component (as a part of the decoder) impacts the speech quality (not verified).

We need to empirically determine which components can be made streamable, and which ones should generate their output at once.

@mush42 mush42 marked this pull request as ready for review November 12, 2023 08:56
@mush42
Copy link
Contributor Author

mush42 commented Nov 12, 2023

@synesthesiam
There is a living implementation for this in piper-rs repo.

Do you feel positive about merging this?

Best
Musharraf

@mush42
Copy link
Contributor Author

mush42 commented Dec 1, 2023

@synesthesiam
I think this is ready for merging.

@marty1885
Copy link
Contributor

Just dropping by and saying I love this! I've written my own C++ inference server and this is a major issue I met.

@marty1885
Copy link
Contributor

@mush42 How do I get input.json? I've been trying to generate phoneme IDs manually. But I get no output (0 length in stream). Can you provide an example?

@eeejay
Copy link

eeejay commented Feb 21, 2024

I don't fully understand everything in this pull request, but I have a feeling that this approach can be used to implement word tracking since the sub-sentence phonemes can be synthesized in chunks. It would be cool if the stream API would be available through a PiperVoice.

@mush42
Copy link
Contributor Author

mush42 commented Apr 2, 2024

@eeejay
Phoneme duration is a better option for word tracking.

@jaredhagel2
Copy link

The Python torch library used to stream the real-time format Piper voice is large. Our device has limited storage available. Are there any plans on modifying the main piper executable to support streaming these real-time format Piper voices?

@marty1885
Copy link
Contributor

I built paroli and muse42 has his sonata. Both supporting streaming mode Piper models.

@jaredhagel2
Copy link

Thanks for this @marty1885! These look great!

@mush42
Copy link
Contributor Author

mush42 commented Aug 15, 2024

@marty1885
BTW I watched the video of your streaming implementation. It can be even better if you apply a window function to each chunk to smooth out abrupt changes at chunk boundaries.
Also, applying a simple fade-in fade-out effect to each chunk would be enough.
You can refer Sonata's source if you want to learn more.

@marty1885
Copy link
Contributor

marty1885 commented Aug 15, 2024

@mush42 That's already done. The gap you hear is from the WS JS thread not being RT.

Actually I implemented a similarity based search to find the optimal point to concat the audio. I think it works even better then simple fade in and out.

@mush42
Copy link
Contributor Author

mush42 commented Aug 15, 2024

@marty1885 OK I understand.

@mush42
Copy link
Contributor Author

mush42 commented Aug 15, 2024

@marty1885
That's actually very cool. I'll take a look and port your approach to Sonata.
I'm glad I brought it up.

@jaredhagel2
Copy link

Would there be value (or is it even feasible) to merge Paroli into Piper? I thought this would be easier than merging Sonata into Piper since Paroli is written in C++. Just an idea from someone who would love to learn a lot more about Paroli, Piper and Sonata (so take with a grain of salt...)

@marty1885
Copy link
Contributor

@synesthesiam What do you think?

The major changes I did to Piper is to abstract the ONNX inference code to allow RKNN (and potentially other accelerators) as a backend. And some API changes to properly support low latency streaming.

The main reason I forked is because the additional dependencies (drogon, libopusenc, soxr) that piper core doesn't need.

@jaredhagel2
Copy link

jaredhagel2 commented Sep 17, 2024

It see a status that states 'This pull request was closed'. I don't have much information in github when this was done. Was this done recently? Is there information on who or why the pull request was closed?

Now that I posted this comment I see 'The pull request was closed' is always below my comment. My guess is that 'This pull request was closed' update was done quite a while ago.

@mush42
Copy link
Contributor Author

mush42 commented Sep 17, 2024

@jaredhagel2 this PR has been merged into Piper as an example of how to implement streaming support. Not implemented in the C++ app though.

@jaredhagel2
Copy link

Oh I see. Thank you for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants