Fast path detach()/alias() in FakeTensor #128281

ezyang · 2024-06-08T03:51:22Z

🐛 Describe the bug

We call detach()/alias() for a variety of administrative purposes, typically because we need to get a copy of the metadata of a tensor that won't be modified by subsequent metadata mutation. This is currently implemented quite inefficiently:

  File "/data/users/ezyang/b/pytorch/torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py", line 135, in <lambda>
    torch.Tensor, lambda t: t.detach(), updated_flat_args_subclasses_desugared
  File "/data/users/ezyang/b/pytorch/torch/utils/_stats.py", line 20, in wrapper           
    return fn(*args, **kwargs)
  File "/data/users/ezyang/b/pytorch/torch/_subclasses/fake_tensor.py", line 1060, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)      
  File "/data/users/ezyang/b/pytorch/torch/_subclasses/fake_tensor.py", line 1449, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/ezyang/b/pytorch/torch/_subclasses/fake_tensor.py", line 1144, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "/data/users/ezyang/b/pytorch/torch/_subclasses/fake_tensor.py", line 1756, in _dispatch_impl
    r = func(*args, **kwargs)
  File "/data/users/ezyang/b/pytorch/torch/_ops.py", line 666, in __call__
    return self_._op(*args, **kwargs)
  File "/data/users/ezyang/b/pytorch/torch/_prims_common/wrappers.py", line 265, in _fn
    result = fn(*args, **kwargs)
  File "/data/users/ezyang/b/pytorch/torch/_decomp/decompositions.py", line 2109, in nop_decomposition
    return aten.alias(x)
  File "/data/users/ezyang/b/pytorch/torch/_ops.py", line 1060, in __call__
    return self_._op(*args, **(kwargs or {}))
  File "/data/users/ezyang/b/pytorch/torch/_meta_registrations.py", line 3658, in meta_alias
    return self.view(self.shape)

We should have a fastpath for this which bypasses performing a view() on it.

High priority for compile time improvements.

Versions

main

cc @gchanan @zou3519 @kadeng @msaroufim @bdhirsh @anijain2305 @chauhang @eellison

The text was updated successfully, but these errors were encountered:

zou3519 · 2024-06-13T19:07:40Z

seems like this takes a long time

zou3519 · 2024-07-08T14:32:30Z

Actionable to attempt the following approaches:

the first step would be to stop going through so many hoops of dispatch (detach -> alias -> view) and seeing if that improves compilation time
the second step is to see if we can call shallow_copy_and_detach directly on the FakeTensor for the detach() created by shallow_copy_and_detach, instead of going through the detach->alias->view dispatch.

bdhirsh · 2024-07-26T15:14:50Z

A few findings. I used TORCH_COMPILE_CPROFILE=1 python benchmarks/dynamo/huggingface.py --performance --timing --explain --backend aot_eager --device cuda --training --float32 --only BertForMaskedLM as my benchmarking repro

(1) I tried fast-pathing detach to avoid the decomps (only step 1 above, not step 2) by adding a fastpath for FakeTensor.detach(), by temporarily turning off the python dispatcher , and did not see much of an overall speedup.

(2) Looking at the svg, you can see that the majority (~2/3) of the calls to TensorBase::detach()are flowing throughsnapshot_fake()`

(3) I updated snapshot_fake() to directly call the same fast_detach() (with no decomps), and I see a much larger speedup.

BertForMaskedLM is a bit weird because it has several graph breaks, so there are a few tiny graphs, and two large graphs. Looking at the largest graph, I see:

compile time before: 23.820
compile time after: 19.959

detach before:

detach after:

Fixes #128281, see investigation at #128281 (comment). benchmark: ``` python benchmarks/dynamo/huggingface.py --performance --timing --explain --backend aot_eager --device cuda --training --float32 --only BertForMaskedLM ``` time before: ``` TIMING: entire_frame_compile:30.85435 backend_compile:23.98599 total_wall_time:30.85435 ``` time after: ``` TIMING: entire_frame_compile:24.35898 backend_compile:18.15235 total_wall_time:24.35898 ``` [ghstack-poisoned]

ezyang added high priority module: dynamic shapes labels Jun 8, 2024

pytorch-bot bot added the triage review label Jun 8, 2024

albanD added the oncall: pt2 label Jun 10, 2024

zou3519 added module: fakeTensor module: pt2-dispatcher PT2 dispatcher-related issues (e.g., aotdispatch, functionalization, faketensor, custom-op, labels Jun 10, 2024

soulitzer added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Jun 11, 2024

zou3519 added the actionable label Jun 20, 2024

bdhirsh mentioned this issue Jul 26, 2024

fast-path FakeTensor detach #131899

Closed

pytorchmergebot closed this as completed in 071ac38 Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast path detach()/alias() in FakeTensor #128281

Fast path detach()/alias() in FakeTensor #128281

ezyang commented Jun 8, 2024 •

edited by pytorch-bot bot

Loading

zou3519 commented Jun 13, 2024

zou3519 commented Jul 8, 2024

bdhirsh commented Jul 26, 2024

Fast path detach()/alias() in FakeTensor #128281

Fast path detach()/alias() in FakeTensor #128281

Comments

ezyang commented Jun 8, 2024 • edited by pytorch-bot bot Loading

🐛 Describe the bug

Versions

zou3519 commented Jun 13, 2024

zou3519 commented Jul 8, 2024

bdhirsh commented Jul 26, 2024

ezyang commented Jun 8, 2024 •

edited by pytorch-bot bot

Loading