Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inductor][cpu]transformers models static/dynamic quant performance/accuracy crash in 2024-06-17 nightly release #128933

Open
zxd1997066 opened this issue Jun 18, 2024 · 8 comments
Labels
module: dynamic shapes oncall: cpu inductor CPU Inductor issues for Intel team to triage oncall: pt2

Comments

@zxd1997066
Copy link
Contributor

zxd1997066 commented Jun 18, 2024

🐛 Describe the bug

======================= export model ===============================
W0617 17:40:01.166444 140156037509568 torch/_export/__init__.py:95] +============================+
W0617 17:40:01.166611 140156037509568 torch/_export/__init__.py:96] |     !!!   WARNING   !!!    |
W0617 17:40:01.166671 140156037509568 torch/_export/__init__.py:97] +============================+
W0617 17:40:01.166718 140156037509568 torch/_export/__init__.py:98] capture_pre_autograd_graph() is deprecated and doesn't provide any function guarantee moving forward.
W0617 17:40:01.166769 140156037509568 torch/_export/__init__.py:99] Please switch to use torch.export instead.
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0] Error while creating guard:
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0] Name: ''
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]     Source: shape_env
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]     Create Function: SHAPE_ENV
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]     Guard Types: None
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]     Code List: None
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]     Object Weakref: None
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]     Guarded Class Weakref: None
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0] Traceback (most recent call last):
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]   File "/workspace/pytorch/torch/_guards.py", line 259, in create
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]     return self.create_fn(builder, self)
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]   File "/workspace/pytorch/torch/_dynamo/guards.py", line 1728, in SHAPE_ENV
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]     guards = output_graph.shape_env.produce_guards(
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]   File "/workspace/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4167, in produce_guards
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]     raise ConstraintViolationError(
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0] torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (dim0)! For more information, run with TORCH_LOGS="+dynamic".
E0617 17:40:03.274544 140156037509568 torch/_guards.py:261] [0/0]   - Not all values of dim0 = L['input_ids'].size()[0] in the specified range satisfy the generated guard Ne(L['input_ids'].size()[0], 9223372036854775807).
E0617 17:40:03.275653 140156037509568 torch/_guards.py:263] [0/0] Created at:
E0617 17:40:03.275653 140156037509568 torch/_guards.py:263] [0/0]   File "/workspace/pytorch/torch/_dynamo/convert_frame.py", line 564, in transform
E0617 17:40:03.275653 140156037509568 torch/_guards.py:263] [0/0]     tracer = InstructionTranslator(
E0617 17:40:03.275653 140156037509568 torch/_guards.py:263] [0/0]   File "/workspace/pytorch/torch/_dynamo/symbolic_convert.py", line 2371, in __init__
E0617 17:40:03.275653 140156037509568 torch/_guards.py:263] [0/0]     output=OutputGraph(
E0617 17:40:03.275653 140156037509568 torch/_guards.py:263] [0/0]   File "/workspace/pytorch/torch/_dynamo/output_graph.py", line 313, in __init__
E0617 17:40:03.275653 140156037509568 torch/_guards.py:263] [0/0]     self.init_ambient_guards()
E0617 17:40:03.275653 140156037509568 torch/_guards.py:263] [0/0]   File "/workspace/pytorch/torch/_dynamo/output_graph.py", line 452, in init_ambient_guards
E0617 17:40:03.275653 140156037509568 torch/_guards.py:263] [0/0]     self.guards.add(ShapeEnvSource().make_guard(GuardBuilder.SHAPE_ENV))
Traceback (most recent call last):
  File "./transformers/examples/pytorch/text-classification/run_glue.py", line 652, in <module>
    main()
  File "./transformers/examples/pytorch/text-classification/run_glue.py", line 590, in main
    metrics = trainer.evaluate(eval_dataset=eval_dataset)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 3109, in evaluate
    start_time, output = eval_loop(
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 3232, in evaluation_loop
    else self.accelerator.prepare_model(model, evaluation_mode=True)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1449, in prepare_model
    exported_model = capture_pre_autograd_graph(
  File "/workspace/pytorch/torch/_export/__init__.py", line 170, in capture_pre_autograd_graph
    m = torch._dynamo.export(
  File "/workspace/pytorch/torch/_dynamo/eval_frame.py", line 1425, in inner
    raise constraint_violation_error
  File "/workspace/pytorch/torch/_dynamo/eval_frame.py", line 1379, in inner
    result_traced = opt_f(*args, **kwargs)
  File "/workspace/pytorch/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/pytorch/torch/nn/modules/module.py", line 1575, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/pytorch/torch/_dynamo/eval_frame.py", line 433, in _fn
    return fn(*args, **kwargs)
  File "/workspace/pytorch/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/pytorch/torch/nn/modules/module.py", line 1575, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/pytorch/torch/_dynamo/convert_frame.py", line 1116, in __call__
    return self._torchdynamo_orig_callable(
  File "/workspace/pytorch/torch/_dynamo/convert_frame.py", line 472, in __call__
    return _compile(
  File "/workspace/pytorch/torch/_utils_internal.py", line 84, in wrapper_function
    return StrobelightCompileTimeProfiler.profile_compile_time(
  File "/workspace/pytorch/torch/_strobelight/compile_time_profiler.py", line 129, in profile_compile_time
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/workspace/pytorch/torch/_dynamo/convert_frame.py", line 817, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/workspace/pytorch/torch/_dynamo/utils.py", line 231, in time_wrapper
    r = func(*args, **kwargs)
  File "/workspace/pytorch/torch/_dynamo/convert_frame.py", line 726, in compile_inner
    check_fn = CheckFunctionManager(
  File "/workspace/pytorch/torch/_dynamo/guards.py", line 2141, in __init__
    guard.create(builder)
  File "/workspace/pytorch/torch/_guards.py", line 259, in create
    return self.create_fn(builder, self)
  File "/workspace/pytorch/torch/_dynamo/guards.py", line 1728, in SHAPE_ENV
    guards = output_graph.shape_env.produce_guards(
  File "/workspace/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4167, in produce_guards
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (dim0)! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of dim0 = L['input_ids'].size()[0] in the specified range satisfy the generated guard Ne(L['input_ids'].size()[0], 9223372036854775807).

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/workspace/pytorch/numa_launcher.py", line 805, in <module>
    main()
  File "/workspace/pytorch/numa_launcher.py", line 800, in main
    launcher.launch(args)
  File "/workspace/pytorch/numa_launcher.py", line 481, in launch
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd_s)
subprocess.CalledProcessError: Command 'numactl -C 0-31 -m 0 /opt/conda/bin/python -u ./transformers/examples/pytorch/text-classification/run_glue.py --model_name_or_path albert-base-v1 --task_name MRPC --do_eval --max_seq_length 16 --learning_rate 2e-5 --overwrite_output_dir --output_dir /tmp/tmp_huggingface/ --torch_compile --torch_compile_quant ptq_dynamic --report_to=none --per_device_eval_batch_size 64' returned non-zero exit status 1.

Versions

SW info

SW Branch Target commit Refer commit
Pytorch nightly 8410bf5 963d450
Torchbench chuanqiw/inductor_quant ee35d764 ee35d764
torchaudio nightly b829e93 1980f8a
torchtext nightly b0ebddc b0ebddc
torchvision nightly d23a6e1 d23a6e1
torchdata nightly 11bb5b8 11bb5b8
dynamo_benchmarks nightly fea73cb fea73cb

Repro:

git clone -b test https://github.com/chuanqi129/transformers && cd transformers && \
    python setup.py bdist_wheel && pip install --force-reinstall dist/*.whl && cd ..
git clone -b test https://github.com/zxd1997066/accelerate.git && cd accelerate && \
    python setup.py bdist_wheel && pip install --no-deps --force-reinstall dist/*.whl && cd ..
pip install -r transformers/examples/pytorch/text-classification/requirements.txt
wget https://github.com/chuanqi129/inductor-tools/raw/xiangdong/accuracy/scripts/modelbench/quant/numa_launcher.py
wget https://github.com/chuanqi129/inductor-tools/raw/xiangdong/accuracy/scripts/modelbench/quant/hf_quant_test.sh
#change model in https://github.com/chuanqi129/inductor-tools/blob/xiangdong/accuracy/scripts/modelbench/quant/hf_quant_test.sh#L88
#static quantization
bash hf_quant_test.sh key torch_compile_quant_static
#dynamic quantization
bash hf_quant_test.sh key torch_compile_quant

Suspected guilty commit: 2229884
text-classification+albert-base-v1-static-quant-accuracy-crash_guilty_commit.log

cc @ezyang @anijain2305 @chauhang @penguinwu @WeizhuoZhang-intel @chuanqi129

@leslie-fang-intel
Copy link
Collaborator

leslie-fang-intel commented Jun 18, 2024

Hi @ezyang, could you kindly help to take a look? Prepare the script to reproduce this issue: https://gist.github.com/leslie-fang-intel/696041fa7e7352ecb985b04a5e1188de and it starts to fail since 2229884

Here are the version of transformer I used pip install "git+https://github.com/huggingface/transformers@243e186efbf7fb93328dd6b34927a4e8c8f24395" in case needed.

@leslie-fang-intel leslie-fang-intel added the oncall: cpu inductor CPU Inductor issues for Intel team to triage label Jun 18, 2024
@zxd1997066
Copy link
Contributor Author

vision_maskrcnn and detectron2_fcos_r_50_fpn AMP/float32 single/multiple thread static/dynamic shape default/cpp wrapper meet TypeError: Invalid NaN comparison https://gist.github.com/zxd1997066/5f1fc727ced62f4ae82df88ea232f863 And they have the same guilty commit 2229884
bisect log:
torchbench-vision_maskrcnn-inference-float32-static-default-multiple-accuracy-crash_guilty_commit.log

Repro:
inductor_single_run.sh

bash inductor_single_run.sh single/multiple inference accuracy/performance torchbench vision_maskrcnn/detectron2_fcos_r_50_fpn amp/float32 first dynamic/static default/cpp

@leslie-fang-intel
Copy link
Collaborator

leslie-fang-intel commented Jun 23, 2024

Running this test with TORCH_LOGS="+dynamic"
We can find the guard difference before and after this commit:

  • Previously, we can statically known s0 != 9223372036854775807
  • However, after this commit, we have to add the guard which causes the failure.
image

@leslie-fang-intel
Copy link
Collaborator

leslie-fang-intel commented Jun 23, 2024

Further looking into the why we can't statically known s0 != 9223372036854775807 after this commit:

@leslie-fang-intel
Copy link
Collaborator

leslie-fang-intel commented Jun 23, 2024

Not sure how to make the correct fix. If b.lower is larger than sys.maxsize-1 and a.upper is int_oo, can we say they are not equal in SymPyValueRangeAnalysis? cc @ezyang @lezcano

@lezcano
Copy link
Collaborator

lezcano commented Jun 24, 2024

But that guard sounds reasonable to me, no? It's asking that s0 should be representable in int64.
I'm not sure how the points above are related to the failure, and it's difficult to know without more context.

Looking at the error in #128933 (comment), it might suggest that our safe_mul is not as safe as it should be. In particular, it might be doing something like 0 * sympy.oo and it's returning a NaN. In that case, we should probably treat in that operation 0 * sympy_oo (and same with -sympy_oo) as 0, as this formula is equivalent to the limit lim_{x->inf} 0 * x = 0.

@ezyang this shows a larger issue that's lurking with the inf treatment: Our bounds are inclusive... unless one of the ends is oo, in which case they are not...

@leslie-fang-intel
Copy link
Collaborator

leslie-fang-intel commented Jun 24, 2024

But that guard sounds reasonable to me, no? It's asking that s0 should be representable in int64.
I'm not sure how the points above are related to the failure, and it's difficult to know without more context.

Yean, any suggestions for how to further debug why the guard failed? I am just listing out the difference before and after this commit and maybe there is another potential issue which fails the guard :(

---------- Update for why the new added guard fail ------------

@ezyang
Copy link
Contributor

ezyang commented Jun 24, 2024

This is sort of expected, but what we probably can do is make the constraint violation error more tolerant for this case.

The big question I had to answer in #127693 was what I should do if there legitimately was different behavior when s0 == sys.maxsize. Previously, I simply assumed this couldn't happen, because who makes sys.maxsize type tensors. But with int_oo modeling, "just assuming" it doesn't happen is not so convenient. But it's also not a big deal, you just get a guard testing that the int is not maxsize, nbd.

Except for the constraint stuff. The constraint violation says "if there is ANY guard, error out". But we can probably make it softer, e.g., a guard that the value is not maxsize shouldn't trigger this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dynamic shapes oncall: cpu inductor CPU Inductor issues for Intel team to triage oncall: pt2
Projects
None yet
Development

No branches or pull requests

5 participants