feat: grouped conv1d #1749

ebraraktas · 2024-07-29T09:56:24Z

Implements grouped conv1d. This was requested by @homink .

homink · 2024-08-07T15:45:27Z

@ebraraktas Thanks for your PR. Great job! I locally built this and tested it. Now I observe the improved computation in the wav2vec2 model and will make PR soon. But I found one thing to share here. For GPU with int8 quantization, the whisper works well as it did but the wav2vec2 with groups greater than 1 got the throw exception as you expected in your previous PR. The following one you suggested worked out indeed.

class Conv1DSpec(model_spec.LayerSpec):
    def __init__(self):
        self.weight = None
        if os.getenv("CTRANSLATE2_CONV1D_NO_QUANTIZE", "0") != "1":
            self.weight_scale = model_spec.OPTIONAL
        self.bias = model_spec.OPTIONAL

I prefer to int8 quantization in Conv1D but the unquantized Conv1D would be OK for the wav2vec2. I am still curious how come the whisper model doesn't need such unquantized Conv1D. I think both models use the same Conv1D except groups (greater than 1) in the wav2vec2. Any comments?

ebraraktas · 2024-08-07T17:02:00Z

I don't expect any conversion errors on this branch. Can you share your conversion script? @homink

homink · 2024-08-07T19:07:48Z

@ebraraktas I didn't meet any conversion errors with int8 quantization for both ASR models. Both model conversions worked without errors. And both converted models worked out for the recognition as expected on CPU. But the wav2vec2 met that throw exception for the recognition on GPU. Unquantized Conv1D for the wav2vec2 worked out for the model conversion as well as for the recognition. I have been using the following command for the conversion.

ct2-transformers-converter --model $src_dir --output_dir $tgt_dir --quantization int8

ebraraktas · 2024-08-07T19:14:55Z

@homink I get it. This commit must have fixed that case, too. It converts conv weights to float if device is CUDA (see these lines). If you have a chance to debug, it would be great if you check that if this happens. Otherwise, can you check if conv weights have conv substring in their names?

homink · 2024-08-07T23:11:35Z

I printed out name by adding std::cout.

          if (is_quantizable(name)) {
            auto variable_weight_dtype = weight_dtype;
            // For conv layer, we need to reshape to ensure dtype as its weights are 3D.
            std::cout << "Name: " << name << std::endl;
            auto is_conv = name.find("conv") != std::string::npos;

The wav2vec2 model:

Name: encoder/fp_projection/weight
Name: encoder/layer_0/ffn/linear_0/weight
Name: encoder/layer_0/ffn/linear_1/weight
Name: encoder/layer_0/self_attention/linear_0/weight
Name: encoder/layer_0/self_attention/linear_1/weight
Name: encoder/layer_10/ffn/linear_0/weight
Name: encoder/layer_10/ffn/linear_1/weight
Name: encoder/layer_10/self_attention/linear_0/weight
Name: encoder/layer_10/self_attention/linear_1/weight
Name: encoder/layer_11/ffn/linear_0/weight
Name: encoder/layer_11/ffn/linear_1/weight
Name: encoder/layer_11/self_attention/linear_0/weight
Name: encoder/layer_11/self_attention/linear_1/weight
Name: encoder/layer_12/ffn/linear_0/weight
Name: encoder/layer_12/ffn/linear_1/weight
Name: encoder/layer_12/self_attention/linear_0/weight
Name: encoder/layer_12/self_attention/linear_1/weight
Name: encoder/layer_13/ffn/linear_0/weight
Name: encoder/layer_13/ffn/linear_1/weight
Name: encoder/layer_13/self_attention/linear_0/weight
Name: encoder/layer_13/self_attention/linear_1/weight
Name: encoder/layer_14/ffn/linear_0/weight
Name: encoder/layer_14/ffn/linear_1/weight
Name: encoder/layer_14/self_attention/linear_0/weight
Name: encoder/layer_14/self_attention/linear_1/weight
Name: encoder/layer_15/ffn/linear_0/weight
Name: encoder/layer_15/ffn/linear_1/weight
Name: encoder/layer_15/self_attention/linear_0/weight
Name: encoder/layer_15/self_attention/linear_1/weight
Name: encoder/layer_16/ffn/linear_0/weight
Name: encoder/layer_16/ffn/linear_1/weight
Name: encoder/layer_16/self_attention/linear_0/weight
Name: encoder/layer_16/self_attention/linear_1/weight
Name: encoder/layer_17/ffn/linear_0/weight
Name: encoder/layer_17/ffn/linear_1/weight
Name: encoder/layer_17/self_attention/linear_0/weight
Name: encoder/layer_17/self_attention/linear_1/weight
Name: encoder/layer_18/ffn/linear_0/weight
Name: encoder/layer_18/ffn/linear_1/weight
Name: encoder/layer_18/self_attention/linear_0/weight
Name: encoder/layer_18/self_attention/linear_1/weight
Name: encoder/layer_19/ffn/linear_0/weight
Name: encoder/layer_19/ffn/linear_1/weight
Name: encoder/layer_19/self_attention/linear_0/weight
Name: encoder/layer_19/self_attention/linear_1/weight
Name: encoder/layer_1/ffn/linear_0/weight
Name: encoder/layer_1/ffn/linear_1/weight
Name: encoder/layer_1/self_attention/linear_0/weight
Name: encoder/layer_1/self_attention/linear_1/weight
Name: encoder/layer_20/ffn/linear_0/weight
Name: encoder/layer_20/ffn/linear_1/weight
Name: encoder/layer_20/self_attention/linear_0/weight
Name: encoder/layer_20/self_attention/linear_1/weight
Name: encoder/layer_21/ffn/linear_0/weight
Name: encoder/layer_21/ffn/linear_1/weight
Name: encoder/layer_21/self_attention/linear_0/weight
Name: encoder/layer_21/self_attention/linear_1/weight
Name: encoder/layer_22/ffn/linear_0/weight
Name: encoder/layer_22/ffn/linear_1/weight
Name: encoder/layer_22/self_attention/linear_0/weight
Name: encoder/layer_22/self_attention/linear_1/weight
Name: encoder/layer_23/ffn/linear_0/weight
Name: encoder/layer_23/ffn/linear_1/weight
Name: encoder/layer_23/self_attention/linear_0/weight
Name: encoder/layer_23/self_attention/linear_1/weight
Name: encoder/layer_2/ffn/linear_0/weight
Name: encoder/layer_2/ffn/linear_1/weight
Name: encoder/layer_2/self_attention/linear_0/weight
Name: encoder/layer_2/self_attention/linear_1/weight
Name: encoder/layer_3/ffn/linear_0/weight
Name: encoder/layer_3/ffn/linear_1/weight
Name: encoder/layer_3/self_attention/linear_0/weight
Name: encoder/layer_3/self_attention/linear_1/weight
Name: encoder/layer_4/ffn/linear_0/weight
Name: encoder/layer_4/ffn/linear_1/weight
Name: encoder/layer_4/self_attention/linear_0/weight
Name: encoder/layer_4/self_attention/linear_1/weight
Name: encoder/layer_5/ffn/linear_0/weight
Name: encoder/layer_5/ffn/linear_1/weight
Name: encoder/layer_5/self_attention/linear_0/weight
Name: encoder/layer_5/self_attention/linear_1/weight
Name: encoder/layer_6/ffn/linear_0/weight
Name: encoder/layer_6/ffn/linear_1/weight
Name: encoder/layer_6/self_attention/linear_0/weight
Name: encoder/layer_6/self_attention/linear_1/weight
Name: encoder/layer_7/ffn/linear_0/weight
Name: encoder/layer_7/ffn/linear_1/weight
Name: encoder/layer_7/self_attention/linear_0/weight
Name: encoder/layer_7/self_attention/linear_1/weight
Name: encoder/layer_8/ffn/linear_0/weight
Name: encoder/layer_8/ffn/linear_1/weight
Name: encoder/layer_8/self_attention/linear_0/weight
Name: encoder/layer_8/self_attention/linear_1/weight
Name: encoder/layer_9/ffn/linear_0/weight
Name: encoder/layer_9/ffn/linear_1/weight
Name: encoder/layer_9/self_attention/linear_0/weight
Name: encoder/layer_9/self_attention/linear_1/weight
Name: encoder/lm_head/weight

The whisper model:

Name: decoder/embeddings/weight
Name: decoder/layer_0/attention/linear_0/weight
Name: decoder/layer_0/attention/linear_1/weight
Name: decoder/layer_0/attention/linear_2/weight
Name: decoder/layer_0/ffn/linear_0/weight
Name: decoder/layer_0/ffn/linear_1/weight
Name: decoder/layer_0/self_attention/linear_0/weight
Name: decoder/layer_0/self_attention/linear_1/weight
Name: decoder/layer_1/attention/linear_0/weight
Name: decoder/layer_1/attention/linear_1/weight
Name: decoder/layer_1/attention/linear_2/weight
Name: decoder/layer_1/ffn/linear_0/weight
Name: decoder/layer_1/ffn/linear_1/weight
Name: decoder/layer_1/self_attention/linear_0/weight
Name: decoder/layer_1/self_attention/linear_1/weight
Name: decoder/layer_2/attention/linear_0/weight
Name: decoder/layer_2/attention/linear_1/weight
Name: decoder/layer_2/attention/linear_2/weight
Name: decoder/layer_2/ffn/linear_0/weight
Name: decoder/layer_2/ffn/linear_1/weight
Name: decoder/layer_2/self_attention/linear_0/weight
Name: decoder/layer_2/self_attention/linear_1/weight
Name: decoder/layer_3/attention/linear_0/weight
Name: decoder/layer_3/attention/linear_1/weight
Name: decoder/layer_3/attention/linear_2/weight
Name: decoder/layer_3/ffn/linear_0/weight
Name: decoder/layer_3/ffn/linear_1/weight
Name: decoder/layer_3/self_attention/linear_0/weight
Name: decoder/layer_3/self_attention/linear_1/weight
Name: encoder/conv1/weight
Name: encoder/conv2/weight
Name: encoder/layer_0/ffn/linear_0/weight
Name: encoder/layer_0/ffn/linear_1/weight
Name: encoder/layer_0/self_attention/linear_0/weight
Name: encoder/layer_0/self_attention/linear_1/weight
Name: encoder/layer_1/ffn/linear_0/weight
Name: encoder/layer_1/ffn/linear_1/weight
Name: encoder/layer_1/self_attention/linear_0/weight
Name: encoder/layer_1/self_attention/linear_1/weight
Name: encoder/layer_2/ffn/linear_0/weight
Name: encoder/layer_2/ffn/linear_1/weight
Name: encoder/layer_2/self_attention/linear_0/weight
Name: encoder/layer_2/self_attention/linear_1/weight
Name: encoder/layer_3/ffn/linear_0/weight
Name: encoder/layer_3/ffn/linear_1/weight
Name: encoder/layer_3/self_attention/linear_0/weight
Name: encoder/layer_3/self_attention/linear_1/weight

The wav2vec2 doesn't show anything including with 'conv'. Here is the spec class variables used for the wav2vec2 conversion in Python. I am not sure what makes not print such feat_layer or pos_conv_embed. Even it's unclear how come the whisper only prints out not up to layer_23 but up to layer_3 where 0-23 is the number of transformer blocks in the model architecture. @ebraraktas any comments?

spec.encoder.feat_layer0.conv.weight
spec.encoder.feat_layer[0].conv.weight
spec.encoder.feat_layer[5].conv.weight
spec.encoder.fp_projection.weight
spec.encoder.pos_conv_embed.conv.weight
spec.encoder.layer[0].ffn.linear_0.weight
spec.encoder.layer[0].ffn.linear_1.weight
spec.encoder.layer[0].self_attention.variables()
{'layer_norm/gamma': Parameter containing:
 tensor([ 0.2802,  0.2733, -0.0574,  ..., -0.0397,  0.1188, -0.1281],
        requires_grad=True),
 'layer_norm/beta': Parameter containing:
 tensor([ 0.0501,  0.0644, -0.0103,  ...,  0.0174, -0.0076, -0.0087],
        requires_grad=True),
 'linear_0/weight': tensor([[-0.0561, -0.0965,  0.0279,  ...,  0.0462, -0.0082,  0.0460],
         [-0.0152,  0.0842, -0.0056,  ..., -0.0384, -0.0604, -0.0390],
         [ 0.0260,  0.0472, -0.0101,  ..., -0.0438, -0.0861, -0.0391],
         ...,
         [ 0.0087, -0.0871, -0.0189,  ..., -0.0321, -0.1284, -0.0326],
         [ 0.1075,  0.2023,  0.0037,  ...,  0.0169, -0.1206, -0.0093],
         [-0.0560,  0.1533,  0.0057,  ...,  0.0132, -0.0553,  0.0189]]),
 'linear_0/bias': tensor([0.2306, 0.4396, 0.9173,  ..., 0.0251, 0.0365, 0.0026]),
 'linear_1/weight': Parameter containing:
 tensor([[-2.0867e-01, -9.0580e-03, -7.9064e-02,  ..., -3.8933e-01,
          -2.3149e-01,  1.2245e-04],
         [-6.8817e-02, -9.7114e-03, -2.7091e-01,  ...,  3.0984e-01,
           4.9859e-02, -1.0120e-01],
         [ 6.1015e-02, -1.3112e-01,  1.4859e-01,  ..., -1.1946e-01,
          -4.3422e-04, -3.5835e-02],
         ...,
         [-1.0424e-01, -4.5111e-02, -1.2698e-01,  ...,  1.7925e-03,
          -2.2818e-02, -1.3687e-02],
         [ 6.9515e-03,  5.5411e-02, -1.6486e-01,  ..., -2.4610e-02,
          -9.0637e-02, -5.9484e-02],
         [ 1.6759e-01, -5.2186e-02,  1.2228e-01,  ..., -8.2117e-02,
           2.3281e-01,  1.9040e-02]], requires_grad=True),
 'linear_1/bias': Parameter containing:
 tensor([ 0.1804,  0.0657, -0.1591,  ...,  0.1390, -0.1133, -0.1478],
        requires_grad=True)}
spec.encoder.layer[23].ffn.linear_0.weight
spec.encoder.layer[23].ffn.linear_1.weight
spec.encoder.layer[23].self_attention.variables()
{'layer_norm/gamma': Parameter containing:
 tensor([0.2409, 0.1891, 0.2404,  ..., 0.2633, 0.1928, 0.2037],
        requires_grad=True),
 'layer_norm/beta': Parameter containing:
 tensor([ 0.0092,  0.0236, -0.0083,  ...,  0.0375, -0.0285, -0.0491],
        requires_grad=True),
 'linear_0/weight': tensor([[-0.0907, -0.0764, -0.0753,  ..., -0.0006,  0.1084,  0.0259],
         [-0.0533,  0.0091, -0.1013,  ...,  0.0983, -0.0244, -0.0434],
         [ 0.0098, -0.0523,  0.0925,  ..., -0.0099, -0.1413,  0.0025],
         ...,
         [ 0.0308,  0.1389, -0.1291,  ...,  0.0480, -0.0029, -0.0894],
         [-0.0406, -0.0820, -0.0813,  ..., -0.1231, -0.0129, -0.0129],
         [ 0.0095, -0.0152, -0.0544,  ..., -0.0166,  0.0488,  0.1888]]),
 'linear_0/bias': tensor([ 0.2270, -0.2744, -0.0573,  ..., -0.0604,  0.0101,  0.1317]),
 'linear_1/weight': Parameter containing:
 tensor([[ 0.0974,  0.0543,  0.2027,  ..., -0.0518,  0.0414, -0.0470],
         [ 0.0141,  0.0283, -0.0423,  ..., -0.0170,  0.1034,  0.0482],
         [ 0.0790,  0.0086,  0.1517,  ..., -0.0540, -0.1393, -0.1048],
         ...,
         [-0.0663, -0.2147, -0.0363,  ...,  0.0936, -0.0872, -0.0266],
         [ 0.1390, -0.0073, -0.0192,  ..., -0.0420,  0.0405, -0.0361],
         [-0.1155, -0.0485, -0.0278,  ...,  0.0161,  0.0210, -0.0447]],
        requires_grad=True),
 'linear_1/bias': Parameter containing:
 tensor([-0.0210,  0.0981,  0.0365,  ...,  0.0051,  0.0293, -0.0277],
        requires_grad=True)}
spec.encoder.lm_head.weight

ebraraktas · 2024-08-08T13:36:08Z

@homink I think it is related to Wav2Vec2Model::is_quantizable implementation. You need to remove variable_name.find("conv") == std::string::npos) condition.

homink · 2024-08-08T15:40:22Z

That condition removal worked out indeed. Thanks! @ebraraktas

homink · 2024-08-12T17:47:37Z

@minhthuc2502 @vince62s @nguyendc-systran Could you review this PR and merge it? The current conv1d doesn't have groups while PyTorch has it. I guess this was not considered because the Whisper model doesn't have it. However, this is a key component for the Depthwise convolution and other speech recognition models (including Wav2Vec2 model families) have it. This PR will allow more speech recognition models to be available in CTranslate2 framework. I am going to make PR for the Wav2Vec2 upgrade working more efficiently in regard with both speed and memory. The Wav2Vec2-BERT is being reviewed on my end.

minhthuc2502 · 2024-08-13T07:45:51Z

Thank you for your PR @ebraraktas . It looks good to me.

ebraraktas added 3 commits August 7, 2024 17:04

feat: implement grouped convolution for CPU

b50b1e1

feat: implement grouped Conv1D for DNNL

cacc061

feat: implement grouped Conv1D for CUDA

d93a7f4

ebraraktas force-pushed the feat/grouped-conv1d branch from e29cb28 to d93a7f4 Compare August 7, 2024 14:04

minhthuc2502 merged commit 1000086 into OpenNMT:master Aug 13, 2024
13 checks passed

homink mentioned this pull request Aug 13, 2024

Wav2Vec2 upgrade with Conv1D options #1758

Merged

ebraraktas deleted the feat/grouped-conv1d branch August 23, 2024 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: grouped conv1d #1749

feat: grouped conv1d #1749

ebraraktas commented Jul 29, 2024

homink commented Aug 7, 2024 •

edited

Loading

ebraraktas commented Aug 7, 2024 •

edited

Loading

homink commented Aug 7, 2024

ebraraktas commented Aug 7, 2024 •

edited

Loading

homink commented Aug 7, 2024 •

edited

Loading

ebraraktas commented Aug 8, 2024

homink commented Aug 8, 2024

homink commented Aug 12, 2024 •

edited

Loading

minhthuc2502 commented Aug 13, 2024

feat: grouped conv1d #1749

feat: grouped conv1d #1749

Conversation

ebraraktas commented Jul 29, 2024

homink commented Aug 7, 2024 • edited Loading

ebraraktas commented Aug 7, 2024 • edited Loading

homink commented Aug 7, 2024

ebraraktas commented Aug 7, 2024 • edited Loading

homink commented Aug 7, 2024 • edited Loading

ebraraktas commented Aug 8, 2024

homink commented Aug 8, 2024

homink commented Aug 12, 2024 • edited Loading

minhthuc2502 commented Aug 13, 2024

homink commented Aug 7, 2024 •

edited

Loading

ebraraktas commented Aug 7, 2024 •

edited

Loading

ebraraktas commented Aug 7, 2024 •

edited

Loading

homink commented Aug 7, 2024 •

edited

Loading

homink commented Aug 12, 2024 •

edited

Loading