Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: grouped conv1d #1749

Merged
merged 3 commits into from
Aug 13, 2024
Merged

Conversation

ebraraktas
Copy link
Contributor

Implements grouped conv1d. This was requested by @homink .

@homink
Copy link
Contributor

homink commented Aug 7, 2024

@ebraraktas Thanks for your PR. Great job! I locally built this and tested it. Now I observe the improved computation in the wav2vec2 model and will make PR soon. But I found one thing to share here. For GPU with int8 quantization, the whisper works well as it did but the wav2vec2 with groups greater than 1 got the throw exception as you expected in your previous PR. The following one you suggested worked out indeed.

class Conv1DSpec(model_spec.LayerSpec):
    def __init__(self):
        self.weight = None
        if os.getenv("CTRANSLATE2_CONV1D_NO_QUANTIZE", "0") != "1":
            self.weight_scale = model_spec.OPTIONAL
        self.bias = model_spec.OPTIONAL

I prefer to int8 quantization in Conv1D but the unquantized Conv1D would be OK for the wav2vec2. I am still curious how come the whisper model doesn't need such unquantized Conv1D. I think both models use the same Conv1D except groups (greater than 1) in the wav2vec2. Any comments?

@ebraraktas
Copy link
Contributor Author

ebraraktas commented Aug 7, 2024

I don't expect any conversion errors on this branch. Can you share your conversion script? @homink

@homink
Copy link
Contributor

homink commented Aug 7, 2024

@ebraraktas I didn't meet any conversion errors with int8 quantization for both ASR models. Both model conversions worked without errors. And both converted models worked out for the recognition as expected on CPU. But the wav2vec2 met that throw exception for the recognition on GPU. Unquantized Conv1D for the wav2vec2 worked out for the model conversion as well as for the recognition. I have been using the following command for the conversion.

ct2-transformers-converter --model $src_dir --output_dir $tgt_dir --quantization int8

@ebraraktas
Copy link
Contributor Author

ebraraktas commented Aug 7, 2024

@homink I get it. This commit must have fixed that case, too. It converts conv weights to float if device is CUDA (see these lines). If you have a chance to debug, it would be great if you check that if this happens. Otherwise, can you check if conv weights have conv substring in their names?

@homink
Copy link
Contributor

homink commented Aug 7, 2024

I printed out name by adding std::cout.

          if (is_quantizable(name)) {
            auto variable_weight_dtype = weight_dtype;
            // For conv layer, we need to reshape to ensure dtype as its weights are 3D.
            std::cout << "Name: " << name << std::endl;
            auto is_conv = name.find("conv") != std::string::npos;

The wav2vec2 model:

Name: encoder/fp_projection/weight
Name: encoder/layer_0/ffn/linear_0/weight
Name: encoder/layer_0/ffn/linear_1/weight
Name: encoder/layer_0/self_attention/linear_0/weight
Name: encoder/layer_0/self_attention/linear_1/weight
Name: encoder/layer_10/ffn/linear_0/weight
Name: encoder/layer_10/ffn/linear_1/weight
Name: encoder/layer_10/self_attention/linear_0/weight
Name: encoder/layer_10/self_attention/linear_1/weight
Name: encoder/layer_11/ffn/linear_0/weight
Name: encoder/layer_11/ffn/linear_1/weight
Name: encoder/layer_11/self_attention/linear_0/weight
Name: encoder/layer_11/self_attention/linear_1/weight
Name: encoder/layer_12/ffn/linear_0/weight
Name: encoder/layer_12/ffn/linear_1/weight
Name: encoder/layer_12/self_attention/linear_0/weight
Name: encoder/layer_12/self_attention/linear_1/weight
Name: encoder/layer_13/ffn/linear_0/weight
Name: encoder/layer_13/ffn/linear_1/weight
Name: encoder/layer_13/self_attention/linear_0/weight
Name: encoder/layer_13/self_attention/linear_1/weight
Name: encoder/layer_14/ffn/linear_0/weight
Name: encoder/layer_14/ffn/linear_1/weight
Name: encoder/layer_14/self_attention/linear_0/weight
Name: encoder/layer_14/self_attention/linear_1/weight
Name: encoder/layer_15/ffn/linear_0/weight
Name: encoder/layer_15/ffn/linear_1/weight
Name: encoder/layer_15/self_attention/linear_0/weight
Name: encoder/layer_15/self_attention/linear_1/weight
Name: encoder/layer_16/ffn/linear_0/weight
Name: encoder/layer_16/ffn/linear_1/weight
Name: encoder/layer_16/self_attention/linear_0/weight
Name: encoder/layer_16/self_attention/linear_1/weight
Name: encoder/layer_17/ffn/linear_0/weight
Name: encoder/layer_17/ffn/linear_1/weight
Name: encoder/layer_17/self_attention/linear_0/weight
Name: encoder/layer_17/self_attention/linear_1/weight
Name: encoder/layer_18/ffn/linear_0/weight
Name: encoder/layer_18/ffn/linear_1/weight
Name: encoder/layer_18/self_attention/linear_0/weight
Name: encoder/layer_18/self_attention/linear_1/weight
Name: encoder/layer_19/ffn/linear_0/weight
Name: encoder/layer_19/ffn/linear_1/weight
Name: encoder/layer_19/self_attention/linear_0/weight
Name: encoder/layer_19/self_attention/linear_1/weight
Name: encoder/layer_1/ffn/linear_0/weight
Name: encoder/layer_1/ffn/linear_1/weight
Name: encoder/layer_1/self_attention/linear_0/weight
Name: encoder/layer_1/self_attention/linear_1/weight
Name: encoder/layer_20/ffn/linear_0/weight
Name: encoder/layer_20/ffn/linear_1/weight
Name: encoder/layer_20/self_attention/linear_0/weight
Name: encoder/layer_20/self_attention/linear_1/weight
Name: encoder/layer_21/ffn/linear_0/weight
Name: encoder/layer_21/ffn/linear_1/weight
Name: encoder/layer_21/self_attention/linear_0/weight
Name: encoder/layer_21/self_attention/linear_1/weight
Name: encoder/layer_22/ffn/linear_0/weight
Name: encoder/layer_22/ffn/linear_1/weight
Name: encoder/layer_22/self_attention/linear_0/weight
Name: encoder/layer_22/self_attention/linear_1/weight
Name: encoder/layer_23/ffn/linear_0/weight
Name: encoder/layer_23/ffn/linear_1/weight
Name: encoder/layer_23/self_attention/linear_0/weight
Name: encoder/layer_23/self_attention/linear_1/weight
Name: encoder/layer_2/ffn/linear_0/weight
Name: encoder/layer_2/ffn/linear_1/weight
Name: encoder/layer_2/self_attention/linear_0/weight
Name: encoder/layer_2/self_attention/linear_1/weight
Name: encoder/layer_3/ffn/linear_0/weight
Name: encoder/layer_3/ffn/linear_1/weight
Name: encoder/layer_3/self_attention/linear_0/weight
Name: encoder/layer_3/self_attention/linear_1/weight
Name: encoder/layer_4/ffn/linear_0/weight
Name: encoder/layer_4/ffn/linear_1/weight
Name: encoder/layer_4/self_attention/linear_0/weight
Name: encoder/layer_4/self_attention/linear_1/weight
Name: encoder/layer_5/ffn/linear_0/weight
Name: encoder/layer_5/ffn/linear_1/weight
Name: encoder/layer_5/self_attention/linear_0/weight
Name: encoder/layer_5/self_attention/linear_1/weight
Name: encoder/layer_6/ffn/linear_0/weight
Name: encoder/layer_6/ffn/linear_1/weight
Name: encoder/layer_6/self_attention/linear_0/weight
Name: encoder/layer_6/self_attention/linear_1/weight
Name: encoder/layer_7/ffn/linear_0/weight
Name: encoder/layer_7/ffn/linear_1/weight
Name: encoder/layer_7/self_attention/linear_0/weight
Name: encoder/layer_7/self_attention/linear_1/weight
Name: encoder/layer_8/ffn/linear_0/weight
Name: encoder/layer_8/ffn/linear_1/weight
Name: encoder/layer_8/self_attention/linear_0/weight
Name: encoder/layer_8/self_attention/linear_1/weight
Name: encoder/layer_9/ffn/linear_0/weight
Name: encoder/layer_9/ffn/linear_1/weight
Name: encoder/layer_9/self_attention/linear_0/weight
Name: encoder/layer_9/self_attention/linear_1/weight
Name: encoder/lm_head/weight

The whisper model:

Name: decoder/embeddings/weight
Name: decoder/layer_0/attention/linear_0/weight
Name: decoder/layer_0/attention/linear_1/weight
Name: decoder/layer_0/attention/linear_2/weight
Name: decoder/layer_0/ffn/linear_0/weight
Name: decoder/layer_0/ffn/linear_1/weight
Name: decoder/layer_0/self_attention/linear_0/weight
Name: decoder/layer_0/self_attention/linear_1/weight
Name: decoder/layer_1/attention/linear_0/weight
Name: decoder/layer_1/attention/linear_1/weight
Name: decoder/layer_1/attention/linear_2/weight
Name: decoder/layer_1/ffn/linear_0/weight
Name: decoder/layer_1/ffn/linear_1/weight
Name: decoder/layer_1/self_attention/linear_0/weight
Name: decoder/layer_1/self_attention/linear_1/weight
Name: decoder/layer_2/attention/linear_0/weight
Name: decoder/layer_2/attention/linear_1/weight
Name: decoder/layer_2/attention/linear_2/weight
Name: decoder/layer_2/ffn/linear_0/weight
Name: decoder/layer_2/ffn/linear_1/weight
Name: decoder/layer_2/self_attention/linear_0/weight
Name: decoder/layer_2/self_attention/linear_1/weight
Name: decoder/layer_3/attention/linear_0/weight
Name: decoder/layer_3/attention/linear_1/weight
Name: decoder/layer_3/attention/linear_2/weight
Name: decoder/layer_3/ffn/linear_0/weight
Name: decoder/layer_3/ffn/linear_1/weight
Name: decoder/layer_3/self_attention/linear_0/weight
Name: decoder/layer_3/self_attention/linear_1/weight
Name: encoder/conv1/weight
Name: encoder/conv2/weight
Name: encoder/layer_0/ffn/linear_0/weight
Name: encoder/layer_0/ffn/linear_1/weight
Name: encoder/layer_0/self_attention/linear_0/weight
Name: encoder/layer_0/self_attention/linear_1/weight
Name: encoder/layer_1/ffn/linear_0/weight
Name: encoder/layer_1/ffn/linear_1/weight
Name: encoder/layer_1/self_attention/linear_0/weight
Name: encoder/layer_1/self_attention/linear_1/weight
Name: encoder/layer_2/ffn/linear_0/weight
Name: encoder/layer_2/ffn/linear_1/weight
Name: encoder/layer_2/self_attention/linear_0/weight
Name: encoder/layer_2/self_attention/linear_1/weight
Name: encoder/layer_3/ffn/linear_0/weight
Name: encoder/layer_3/ffn/linear_1/weight
Name: encoder/layer_3/self_attention/linear_0/weight
Name: encoder/layer_3/self_attention/linear_1/weight

The wav2vec2 doesn't show anything including with 'conv'. Here is the spec class variables used for the wav2vec2 conversion in Python. I am not sure what makes not print such feat_layer or pos_conv_embed. Even it's unclear how come the whisper only prints out not up to layer_23 but up to layer_3 where 0-23 is the number of transformer blocks in the model architecture. @ebraraktas any comments?

spec.encoder.feat_layer0.conv.weight
spec.encoder.feat_layer[0].conv.weight
spec.encoder.feat_layer[5].conv.weight
spec.encoder.fp_projection.weight
spec.encoder.pos_conv_embed.conv.weight
spec.encoder.layer[0].ffn.linear_0.weight
spec.encoder.layer[0].ffn.linear_1.weight
spec.encoder.layer[0].self_attention.variables()
{'layer_norm/gamma': Parameter containing:
 tensor([ 0.2802,  0.2733, -0.0574,  ..., -0.0397,  0.1188, -0.1281],
        requires_grad=True),
 'layer_norm/beta': Parameter containing:
 tensor([ 0.0501,  0.0644, -0.0103,  ...,  0.0174, -0.0076, -0.0087],
        requires_grad=True),
 'linear_0/weight': tensor([[-0.0561, -0.0965,  0.0279,  ...,  0.0462, -0.0082,  0.0460],
         [-0.0152,  0.0842, -0.0056,  ..., -0.0384, -0.0604, -0.0390],
         [ 0.0260,  0.0472, -0.0101,  ..., -0.0438, -0.0861, -0.0391],
         ...,
         [ 0.0087, -0.0871, -0.0189,  ..., -0.0321, -0.1284, -0.0326],
         [ 0.1075,  0.2023,  0.0037,  ...,  0.0169, -0.1206, -0.0093],
         [-0.0560,  0.1533,  0.0057,  ...,  0.0132, -0.0553,  0.0189]]),
 'linear_0/bias': tensor([0.2306, 0.4396, 0.9173,  ..., 0.0251, 0.0365, 0.0026]),
 'linear_1/weight': Parameter containing:
 tensor([[-2.0867e-01, -9.0580e-03, -7.9064e-02,  ..., -3.8933e-01,
          -2.3149e-01,  1.2245e-04],
         [-6.8817e-02, -9.7114e-03, -2.7091e-01,  ...,  3.0984e-01,
           4.9859e-02, -1.0120e-01],
         [ 6.1015e-02, -1.3112e-01,  1.4859e-01,  ..., -1.1946e-01,
          -4.3422e-04, -3.5835e-02],
         ...,
         [-1.0424e-01, -4.5111e-02, -1.2698e-01,  ...,  1.7925e-03,
          -2.2818e-02, -1.3687e-02],
         [ 6.9515e-03,  5.5411e-02, -1.6486e-01,  ..., -2.4610e-02,
          -9.0637e-02, -5.9484e-02],
         [ 1.6759e-01, -5.2186e-02,  1.2228e-01,  ..., -8.2117e-02,
           2.3281e-01,  1.9040e-02]], requires_grad=True),
 'linear_1/bias': Parameter containing:
 tensor([ 0.1804,  0.0657, -0.1591,  ...,  0.1390, -0.1133, -0.1478],
        requires_grad=True)}
spec.encoder.layer[23].ffn.linear_0.weight
spec.encoder.layer[23].ffn.linear_1.weight
spec.encoder.layer[23].self_attention.variables()
{'layer_norm/gamma': Parameter containing:
 tensor([0.2409, 0.1891, 0.2404,  ..., 0.2633, 0.1928, 0.2037],
        requires_grad=True),
 'layer_norm/beta': Parameter containing:
 tensor([ 0.0092,  0.0236, -0.0083,  ...,  0.0375, -0.0285, -0.0491],
        requires_grad=True),
 'linear_0/weight': tensor([[-0.0907, -0.0764, -0.0753,  ..., -0.0006,  0.1084,  0.0259],
         [-0.0533,  0.0091, -0.1013,  ...,  0.0983, -0.0244, -0.0434],
         [ 0.0098, -0.0523,  0.0925,  ..., -0.0099, -0.1413,  0.0025],
         ...,
         [ 0.0308,  0.1389, -0.1291,  ...,  0.0480, -0.0029, -0.0894],
         [-0.0406, -0.0820, -0.0813,  ..., -0.1231, -0.0129, -0.0129],
         [ 0.0095, -0.0152, -0.0544,  ..., -0.0166,  0.0488,  0.1888]]),
 'linear_0/bias': tensor([ 0.2270, -0.2744, -0.0573,  ..., -0.0604,  0.0101,  0.1317]),
 'linear_1/weight': Parameter containing:
 tensor([[ 0.0974,  0.0543,  0.2027,  ..., -0.0518,  0.0414, -0.0470],
         [ 0.0141,  0.0283, -0.0423,  ..., -0.0170,  0.1034,  0.0482],
         [ 0.0790,  0.0086,  0.1517,  ..., -0.0540, -0.1393, -0.1048],
         ...,
         [-0.0663, -0.2147, -0.0363,  ...,  0.0936, -0.0872, -0.0266],
         [ 0.1390, -0.0073, -0.0192,  ..., -0.0420,  0.0405, -0.0361],
         [-0.1155, -0.0485, -0.0278,  ...,  0.0161,  0.0210, -0.0447]],
        requires_grad=True),
 'linear_1/bias': Parameter containing:
 tensor([-0.0210,  0.0981,  0.0365,  ...,  0.0051,  0.0293, -0.0277],
        requires_grad=True)}
spec.encoder.lm_head.weight

@ebraraktas
Copy link
Contributor Author

@homink I think it is related to Wav2Vec2Model::is_quantizable implementation. You need to remove variable_name.find("conv") == std::string::npos) condition.

@homink
Copy link
Contributor

homink commented Aug 8, 2024

That condition removal worked out indeed. Thanks! @ebraraktas

@homink
Copy link
Contributor

homink commented Aug 12, 2024

@minhthuc2502 @vince62s @nguyendc-systran Could you review this PR and merge it? The current conv1d doesn't have groups while PyTorch has it. I guess this was not considered because the Whisper model doesn't have it. However, this is a key component for the Depthwise convolution and other speech recognition models (including Wav2Vec2 model families) have it. This PR will allow more speech recognition models to be available in CTranslate2 framework. I am going to make PR for the Wav2Vec2 upgrade working more efficiently in regard with both speed and memory. The Wav2Vec2-BERT is being reviewed on my end.

@minhthuc2502
Copy link
Collaborator

Thank you for your PR @ebraraktas . It looks good to me.

@minhthuc2502 minhthuc2502 merged commit 1000086 into OpenNMT:master Aug 13, 2024
13 checks passed
@ebraraktas ebraraktas deleted the feat/grouped-conv1d branch August 23, 2024 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants