Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在线量化训练acc为0,提示模型参数不存在 #5801

Closed
marsbzp opened this issue Mar 28, 2022 · 29 comments
Closed

在线量化训练acc为0,提示模型参数不存在 #5801

marsbzp opened this issue Mar 28, 2022 · 29 comments
Assignees

Comments

@marsbzp
Copy link

marsbzp commented Mar 28, 2022

image
image

@littletomatodonkey
Copy link
Collaborator

你好,麻烦提供下具体的训练脚本或者复现方法

@marsbzp
Copy link
Author

marsbzp commented Mar 28, 2022

使用configs/rec/ch_PP-OCRv2/ch_PP-OCRv2_rec_enhanced_ctc_loss.yml这个yaml加载预训练模型复现下

@littletomatodonkey
Copy link
Collaborator

你好,感谢反馈,可以用这个pr试下:#5806

@marsbzp
Copy link
Author

marsbzp commented Mar 28, 2022

你们不先测试下再发布的吗,这个在线量化有测过精度吗

@marsbzp
Copy link
Author

marsbzp commented Mar 28, 2022

支持多卡训练吗

@marsbzp
Copy link
Author

marsbzp commented Mar 28, 2022

现在有精度了,可以多卡训练

@littletomatodonkey
Copy link
Collaborator

你们不先测试下再发布的吗,这个在线量化有测过精度吗

提供的量化模型就是使用量化脚本得到的,之前的逻辑没问题,后来PACT内部逻辑修改了,结构化命名加载不上了

@marsbzp
Copy link
Author

marsbzp commented Mar 29, 2022

image

看之前的代码是调QAT把model结构改变以后(插量化节点合并bn这些过程)再load原始模型参数,这样才导致参数不存在的吧,如果是load量化训练得到的模型应该是没问题的和PACT内部逻辑应该没啥关系吧

@marsbzp
Copy link
Author

marsbzp commented Mar 31, 2022

量化导出的模型paddle-tensorrt推理结果不正确

@littletomatodonkey
Copy link
Collaborator

image

看之前的代码是调QAT把model结构改变以后(插量化节点合并bn这些过程)再load原始模型参数,这样才导致参数不存在的吧,如果是load量化训练得到的模型应该是没问题的和PACT内部逻辑应该没啥关系吧

之前的结构化命名在插入节点前后不变,所以可以正常加载

@littletomatodonkey
Copy link
Collaborator

量化导出的模型paddle-tensorrt推理结果不正确

原生推理结果正确吗?先确认下是否为tensorRT的问题哈

@marsbzp
Copy link
Author

marsbzp commented Mar 31, 2022

可以用这个模型复现https://paddleocr.bj.bcebos.com/dygraph_v2.0/slim/ch_ppocr_mobile_v2.0_rec_slim_infer.tar 我试过检测模型和OCR的,int8模式都无法推理 paddle tensorrt_fp32 也不行,只有paddle可以,帮忙看下吧

image

@marsbzp
Copy link
Author

marsbzp commented Mar 31, 2022

量化导出的模型paddle-tensorrt推理结果不正确

原生推理结果正确吗?先确认下是否为tensorRT的问题哈

原生不使用trt是可以的但量化就是为了要是用trt加速啊

@littletomatodonkey
Copy link
Collaborator

好的,我们跟进下哈

@marsbzp
Copy link
Author

marsbzp commented Mar 31, 2022

提了好几个issue了就你回复了,如果是离线量化没问题的,在线量化的模型是否需要有进一步的转换,才可以trt推理呢

image

@littletomatodonkey
Copy link
Collaborator

量化导出的模型paddle-tensorrt推理结果不正确

原生推理结果正确吗?先确认下是否为tensorRT的问题哈

原生不使用trt是可以的但量化就是为了要是用trt加速啊

之前的量化模型主要是为了使用PaddleLite去做部署,服务器端,我们一般直接使用了fp32模型来着,这种case之前确实没测试过

@littletomatodonkey
Copy link
Collaborator

https://paddleocr.bj.bcebos.com/dygraph_v2.0/slim/ch_ppocr_mobile_v2.0_rec_slim_infer.tar

要不分别提供下原生推理和trt推理的结果?

@marsbzp
Copy link
Author

marsbzp commented Mar 31, 2022

就用这个模型 cpp demo跑下就能看出问题

@marsbzp
Copy link
Author

marsbzp commented Mar 31, 2022

我最近跑过目标检测在线量化的模型可以发你看下
paddle
image

trt_fp32 没结果
image

trt_int8 直接挂
image

@littletomatodonkey
Copy link
Collaborator

好的,我手边没trt的环境,我先把这个问题反馈下~

@marsbzp
Copy link
Author

marsbzp commented Apr 1, 2022

原因找到了qat模型导出的时候softmax没掉了导致结果不正确帮忙看下是啥原因引起的吧
image

@littletomatodonkey
Copy link
Collaborator

你好,我试了下这个tensorRT7,试了下上面这个ch_ppocr_mobile_v2.0_rec_slim_infer/模型,没有问题呀

具体地,

trt+fp32预测脚本为

d="ch_ppocr_mobile_v2.0_rec_slim_infer"
./build/ppocr rec \
    --rec_model_dir=${d} \
    --image_dir=../../doc/imgs_words/ch/word_1.jpg \
    --use_tensorrt="1" \
    --use_gpu="1" \
    --rec_batch_num="1" \
    --precision="fp32"

trt + fp32预测结果为:

韩国小馆	score: 0.990642

trt+int8预测脚本为

d="ch_ppocr_mobile_v2.0_rec_slim_infer"
./build/ppocr rec \
    --rec_model_dir=${d} \
    --image_dir=../../doc/imgs_words/ch/word_1.jpg \
    --use_tensorrt="1" \
    --use_gpu="1" \
    --rec_batch_num="1" \
    --precision="int8"

trt+int8预测结果为

韩国小馆	score: 0.990642

你要不扫描首页二维码进微信群,再单独看下你的模型?

@marsbzp
Copy link
Author

marsbzp commented Apr 1, 2022

嗯 感谢回复,官方我用tensorrt7也是没有问题,我自己导出的模型缺少softmax算子,你能否找个识别模型用现有代码导出下呢
image

@marsbzp
Copy link
Author

marsbzp commented Apr 1, 2022

找到原因了,导出量化模型的时候,那个没有进下面的if里面去,导致缺少softmax我注释掉if导出模型就有结果了,你们定位下吧
image

@littletomatodonkey
Copy link
Collaborator

应该是你改了代码,我这边直接跑,是有softmax的

image

@LDOUBLEV
Copy link
Collaborator

LDOUBLEV commented Apr 8, 2022

导出量化模型的时候,那个没有进下面的if里面去,导致缺少softmax我注释掉if导出模型就有结果了,你们定位下吧

问题修复:#5903

@yangy996
Copy link

这个bug不更新到release/v2.4分支上么?

@justcodew
Copy link

使用最新的代码,量化训练识别模型SVTR,碰到了同样的问题
测试脚本:
python tools/eval.py -c output/plate_rec_quant/config.yml -o Global.pretrained_model=output/plate_rec_quant/best_accuracy.pdparams Eval.dataset.data_dir=./plate_rec_data/convet_img_data Eval.dataset.label_file_list=[./plate_rec_data/convet_img_data/ppocr_test_list.txt] Eval.loader.num_workers=2
报错信息;

W0420 20:26:29.985214 213450 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
W0420 20:26:29.990868 213450 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._layer.weight not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._layer._fake_quant_weight._scale not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._layer._fake_quant_input._scale not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._layer._fake_quant_input._state not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._layer._fake_quant_input._accum not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._layer._act_preprocess.alpha not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._ma_output_scale._scale not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._ma_output_scale._state not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._ma_output_scale._accum not in model

W0420 20:26:29.985214 213450 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
W0420 20:26:29.990868 213450 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._layer.weight not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._layer._fake_quant_weight._scale not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._layer._fake_quant_input._scale not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._layer._fake_quant_input._state not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._layer._fake_quant_input._accum not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._layer._act_preprocess.alpha not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._ma_output_scale._scale not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._ma_output_scale._state not in model
[2023/04/20 20:26:36] ppocr WARNING: The pretrained params backbone.conv1._conv._ma_output_scale._accum not in model

[2023/04/20 20:27:05] ppocr INFO: metric eval ***************
[2023/04/20 20:27:05] ppocr INFO: acc:0.0
[2023/04/20 20:27:05] ppocr INFO: norm_edit_dis:0.000370501947975721
[2023/04/20 20:27:05] ppocr INFO: fps:861.3670144457965

@justcodew
Copy link

导出量化模型时,并不会报错,精度acc也是正常的
python deploy/slim/quantization/export_model.py -c output/plate_rec_quant/config.yml -o Global.checkpoints=./output/plate_rec_quant/best_accuracy Global.save_model_dir=./output/plate_rec_quant/infer Global.save_inference_dir=./output/plate_rec_quant/infer

[2023/04/20 20:40:16] ppocr INFO: metric eval ***************
[2023/04/20 20:40:16] ppocr INFO: acc:0.9594412819871275
[2023/04/20 20:40:16] ppocr INFO: norm_edit_dis:0.9928645333013004
[2023/04/20 20:40:16] ppocr INFO: fps:611.2247434852117
[2023/04/20 20:40:21] ppocr INFO: inference model is saved to ./output/plate_rec_quant/infer/inference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants