Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Complete implementation of GGML_OP_CONV_1D #523

Merged
merged 6 commits into from
Sep 28, 2023

Conversation

PABannier
Copy link
Contributor

Currently, the 1d convolution is only implemented for half padding and stride 1 and 2. Yet, the 1d convolution is a crucial operation, needed for instance in bark.cpp and encodec.cpp .

This PR completes the implementation of the 1d convolution (for f32 and f16 src types). It also updates the computation of the size needed for the work buffer.

@ggerganov
Copy link
Owner

Thanks!

I just tested this by merging master into this branch and running the whisper example.
The transcription results are now wrong.

Whisper uses the ggml_conv_1d operator:

// convolution + gelu
{
cur = ggml_conv_1d_ph(ctx0, model.e_conv_1_w, mel, 1, 1);
cur = ggml_add(ctx0,
ggml_repeat(ctx0,
model.e_conv_1_b,
cur),
cur);
cur = ggml_gelu(ctx0, cur);
cur = ggml_conv_1d_ph(ctx0, model.e_conv_2_w, cur, 2, 1);
cur = ggml_add(ctx0,
ggml_repeat(ctx0,
model.e_conv_2_b,
cur),
cur);
cur = ggml_gelu(ctx0, cur);
}

Repro:

./bin/whisper -m ../../whisper.cpp/models/ggml-small.en.bin -f ../../whisper.cpp/samples/gb0.wav 

@ggerganov
Copy link
Owner

If you rebase to latest master you can simply run the following command in ggml root directory:

bash ./ci/run.sh ./tmp/results ./tmp/mnt

It will run the CI locally and at the end of the run is the Whisper test.
If all works correctly, you should see output such as this:

https://github.com/ggml-org/ci/tree/results/ggml/a1/f6ca42699228b0b4223240a2cf507732a1e716/ggml-0-x86-cpu-low-perf#whisper

whisper_init_from_file_no_state: loading model from '../models-mnt/whisper//ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: compute buffer (conv)   =   14.10 MB
whisper_init_state: compute buffer (encode) =   81.85 MB
whisper_init_state: compute buffer (cross)  =    4.40 MB
whisper_init_state: compute buffer (decode) =   24.61 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing '../models-mnt/whisper//jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

whisper_print_timings:     load time =    87.16 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    21.10 ms
whisper_print_timings:   sample time =    16.29 ms /    27 runs (    0.60 ms per run)
whisper_print_timings:   encode time =  1974.43 ms /     1 runs ( 1974.43 ms per run)
whisper_print_timings:   decode time =   126.99 ms /    27 runs (    4.70 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  2262.84 ms

@PABannier
Copy link
Contributor Author

PABannier commented Sep 15, 2023

@ggerganov Thanks for pushing a way to test conv 1d.

I took the code from ggml_conv_2d and essentially account for one spatial dimension less for the kernel and the input. The test is still not passing. Is there any documentations available for how the convolution 2d is implemented in ggml?

whisper_init_from_file_no_state: loading model from '../models-mnt/whisper//ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: compute buffer (conv)   =   14.10 MB
whisper_init_state: compute buffer (encode) =   81.85 MB
whisper_init_state: compute buffer (cross)  =    4.40 MB
whisper_init_state: compute buffer (decode) =   24.61 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

main: processing '../models-mnt/whisper//jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:17.000]   [Music]

whisper_print_timings:     load time =   161.08 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =    17.59 ms
whisper_print_timings:   sample time =   449.60 ms /   455 runs (    0.99 ms per run)
whisper_print_timings:   encode time =  3536.56 ms /     1 runs ( 3536.56 ms per run)
whisper_print_timings:   decode time =  1822.40 ms /   453 runs (    4.02 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  6079.51 ms

@PABannier
Copy link
Contributor Author

Works for me now! @ggerganov
Inspired from the fast Conv2D implementation in #483

whisper_init_from_file_no_state: loading model from '../models-mnt/whisper//ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: compute buffer (conv)   =   15.48 MB
whisper_init_state: compute buffer (encode) =   81.85 MB
whisper_init_state: compute buffer (cross)  =    4.40 MB
whisper_init_state: compute buffer (decode) =   24.61 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

main: processing '../models-mnt/whisper//jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

whisper_print_timings:     load time =   162.24 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    17.83 ms
whisper_print_timings:   sample time =    19.15 ms /    27 runs (    0.71 ms per run)
whisper_print_timings:   encode time =  1666.83 ms /     1 runs ( 1666.83 ms per run)
whisper_print_timings:   decode time =   117.93 ms /    27 runs (    4.37 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  2076.42 ms

@ggerganov
Copy link
Owner

Awesome - will take a look in the next days

@PABannier
Copy link
Contributor Author

@ggerganov Can somebody have a look please? I need it to complete bark.cpp and the implementation of other TTS models :) This would greatly help me. Thanks!

@ggerganov
Copy link
Owner

@PABannier Yes, sorry for the delay - was travelling for the past week. I'm back now and will catch up with everything today and tomorrow

@ggerganov ggerganov merged commit a706d68 into ggerganov:master Sep 28, 2023
4 checks passed
CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this pull request Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants