Releases: pytorch/pytorch
PyTorch 2.3.1 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
Torch.compile:
- Remove runtime dependency on JAX/XLA, when importing
torch.__dynamo
(#124634) - Hide
Plan failed with a cudnnException
warning (#125790) - Fix CUDA memory leak (#124238) (#120756)
Distributed:
- Fix
format_utils executable
, which was causing it to run as a no-op (#123407) - Fix regression with
device_mesh
in 2.3.0 during initialization causing memory spikes (#124780) - Fix crash of
FSDP + DTensor
withShardingStrategy.SHARD_GRAD_OP
(#123617) - Fix failure with distributed checkpointing + FSDP if at least 1 forward/backward pass has not been run. (#121544) (#127069)
- Fix error with distributed checkpointing + FSDP, and with
use_orig_params = False
and activation checkpointing (#124698) (#126935) - Fix
set_model_state_dict
errors on compiled module with non-persistent buffer with distributed checkpointing (#125336) (#125337)
MPS:
- Fix data corruption when coping large (>4GiB) tensors (#124635)
- Fix
Tensor.abs()
for complex (#125662)
Packaging:
- Fix UTF-8 encoding on Windows
.pyi
files (#124932) - Fix
import torch
failure when wheel is installed for a single user on Windows(#125684) - Fix compatibility with torchdata 0.7.1 (#122616)
- Fix aarch64 docker publishing to https://ghcr.io (#125617)
- Fix performance regression an aarch64 linux (pytorch/builder#1803)
Other:
- Fix DeepSpeed transformer extension build on ROCm (#121030)
- Fix kernel crash on
tensor.dtype.to_complex()
after ~100 calls in ipython kernel (#125154)
Release tracker #125425 contains all relevant pull requests related to this release as well as links to related issues.
PyTorch 2.3: User-Defined Triton Kernels in torch.compile, Tensor Parallelism in Distributed
PyTorch 2.3 Release notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
Highlights
We are excited to announce the release of PyTorch® 2.3! PyTorch 2.3 offers support for user-defined Triton kernels in torch.compile, allowing for users to migrate their own Triton kernels from eager without experiencing performance complications or graph breaks. As well, Tensor Parallelism improves the experience for training Large Language Models using native PyTorch functions, which has been validated on training runs for 100B parameter models.
This release is composed of 3393 commits and 426 contributors since PyTorch 2.2. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.3. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
Stable | Beta | Prototype | Performance Improvements |
User-defined Triton kernels in torch.compile | torch.export adds new API to specify dynamic_shapes | Weight-Only-Quantization introduced into Inductor CPU backend | |
Tensor parallelism within PyTorch Distributed | Asynchronous checkpoint generation | ||
Support for semi-structured sparsity |
*To see a full list of public feature submissions click here.
Tracked Regressions
torch.compile on MacOS is considered unstable for 2.3 as there are known cases where it will hang (#124497)
torch.compile imports many unrelated packages when it is invoked (#123954)
This can cause significant first-time slowdown and instability when these packages are not fully compatible with PyTorch within a single process.
torch.compile is not supported on Python 3.12 (#120233)
PyTorch support for Python 3.12 in general is considered experimental. Please use Python version between 3.8 and 3.11 instead. This is an existing issue since PyTorch 2.2.
Backwards Incompatible Changes
Change default torch_function behavior to be disabled when torch_dispatch is defined (#120632)
Defining a subclass with a torch_dispatch entry will now automatically set torch_function to be disabled. This aligns better with all the use cases we’ve observed for subclasses. The main change of behavior is that the result of the torch_dispatch handler will not go through the default torch_function handler anymore, wrapping it into the current subclass. This allows in particular for your subclass to return a plain Tensor or another subclass from any op.
The original behavior can be recovered by adding the following to your Tensor subclass:
@classmethod
def __torch_function__(cls, func, types, args=(), kwargs=None):
return super().__torch_function__(func, types, args, kwargs)
ProcessGroupNCCL removes multi-device-per-thread support from C++ level (#119099, #118674)
- Python level support was removed in 2.2.
- To simplify ProcessGroupNCCL’s code, we remove support for multiple cuda devices per thread. To our knowledge, this is not an active use case, but it adds a large burden to our codebase. If you are relying on this, there is no workaround other than rewriting your pytorch program to use one device per process or one device per thread (multi-threads per process is still supported).
Removes no_dist
and coordinator_rank
from public DCP API's (#121317)
As part of an overall effort to simplify our public facing API's for Distributed Checkpointing, we've decided to deprecate usage of the coordinator_rank
and no_dist
parameters under torch.distributed.checkpoint
. In our opinion, these parameters can lead to confusion around the intended effect during API usage, and have limited value to begin with. One concrete example is here, #118337, where there is ambiguity in which Process Group is referenced by the coordinator rank (additional context: #118337). In the case of the no_dist
parameter, we consider this an implementation detail which should be hidden from the user. Starting in this release, no_dist
is inferred from the initialized state of the process group, assuming the intention is to use collectives if a process group is initialized, and assuming the opposite in the case it is not.
2.2 | 2.3 |
# Version 2.2.2
import torch.distributed.checkpoint as dcp
dcp.save(
state_dict={"model": model.state_dict()},
checkpoint_id="path_to_model_checkpoint"
no_dist=True,
coordinator_rank=0
)
# ...
dcp.load(
state_dict={"model": model.state_dict()},
checkpoint_id="path_to_model_checkpoint"
no_dist=True,
coordinator_rank=0
) |
# Version 2.2.3
# no dist is assumed from pg state, and rank 0 is always coordinator.
import torch.distributed.checkpoint as dcp
dcp.save(
state_dict={"model": model.state_dict()},
checkpoint_id="path_to_model_checkpoint"
)
# ...
dcp.load(
state_dict={"model": model.state_dict()},
checkpoint_id="path_to_model_checkpoint"
) |
Remove deprecated tp_mesh_dim arg (#121432)
Starting from PyTorch 2.3, parallelize_module
API only accepts a DeviceMesh (the tp_mesh_dim
argument has been removed). If having a N-D DeviceMesh for multi-dimensional parallelism, you can use mesh_nd["tp"]
to obtain a 1-D DeviceMesh for tensor parallelism.
torch.export
- Users must pass in an nn.Module to torch.export.export. The reason is that we have several invariants the ExportedProgram that are ambiguous if the top-level object being traced is a function, such as how we guarantee that every call_function node has an nn_module_stack populated, and we offer ways to access the state_dict/parameters/buffers of the exported program. We'd like torch.export to offer strong invariants—the value proposition of export is that you can trade flexibility for stronger guarantees about your model. (#117528)
- Removed constraints in favor of dynamic_shapes (#117573, #117917, #117916, #120981, #120979)
- ExportedProgram is no longer a callable. Instead users will need to use .module() to call the ExportedProgram. This is to prevent users from treating ExportedPrograms as torch.nn.Modules as we do not plan to support all features that torch.nn.Modules have, like hooks. Instead users can create a proper torch.nn.Module through exported_program.module() and use that as a callable. (#120019, #118425, #119105)
- Remove equality_constraints from ExportedProgram as it is not used or useful anymore. Dimensions with equal constraints will now have the same symbol. (#116979)
- Remove torch._export.export in favor of torch.export.export (#119095)
- Remove CallSpec (#117671)
Enable fold_quantize by default in PT2 Export Quantization (#118701, #118605, #119425, #117797)
Previously, the PT2 Export Quantization flow did not generate quantized weight by default, but instead used fp32 weight in the quantized model in this pattern: fp32 weight -> q -> dq -> linear
. Setting fold_quantize=True
produces a graph with quantized weights in the quantized model in this pattern by default after convert_pt2e, and users will see a reduction in the model size: int8 weight -> dq -> linear
.
2.2 | 2.3 |
folded_model = convert_pt2e(model, fold_quantize=True)
non_folded_model = convert_pt2e(model) |
folded_model = convert_pt2e(model)
non_folded_model = convert_pt2e(model, fold_quantize=False) |
Remove deprecated torch.jit.quantized APIs (#118406)
All functions and classes under torch.jit.quantized
will now raise an error if called/instantiated. This API has long been deprecated in favor of torch.ao.nn.quantized
.
2.2 | 2.3 |
# torch.jit.quantized APIs
torch.jit.quantized.quantize_rnn_cell_modules
torch.jit.quantized.quantize_rnn_modules
torch.jit.quantized.quantize_linear_modules
torch.jit.quantized.QuantizedLinear
torch.jit.QuantizedLinearFP16
torch.jit.quantized.QuantizedGRU
torch.jit.quantized.QuantizedGRUCell
torch.jit.quantized.QuantizedLSTM
torch.jit.quantized.QuantizedLSTMCell |
# Corresponding torch.ao.quantization APIs
torch.ao.nn.quantized.dynamic.RNNCell
torch.ao.quantization.quantize_dynamic APIs
torch.ao.nn.quantized.dynamic.Linear
torch.ao.nn.quantized.dynamic.GRU
torch.ao.nn.quantized.dynamic.GRUCell
torch.ao.nn.quantized.dynamic.LSTM |
...
PyTorch 2.2.2 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
- Properly raise an error when trying to use inductor backend on non-supported platforms such as Windows (#115969)
- Fix mkldnn performance issue on Windows platform (#121618)
- Fix
RuntimeError: cannot create std::vector larger than max_size()
intorch.nn.functional.conv1d
on non-contiguous cpu inputs by patching OneDNN (pytorch/builder#1742) (pytorch/builder#1744) - Add support for
torch.distributed.fsdp.StateDictType.FULL_STATE_DICT
for when usingtorch.distributed.fsdp.FullyShardedDataParallel
with thedevice_mesh
argument (#120837) - Fix
make triton
command on release branch for users building the release branch from source (#121169) - Ensure gcc>=9.0 for build from source and cpp_extensions (#120126)
- Fix cxx11-abi build in release branch (pytorch/builder#1709)
- Fix building from source on Windows source MSVC 14.38 - VS 2022 (#122120)
Release tracker #120999 contains all relevant pull requests related to this release as well as links to related issues.
PyTorch 2.2.1 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
- Fix missing OpenMP support on Apple Silicon binaries (pytorch/builder#1697)
- Fix crash when mixing lazy and non-lazy tensors in one operation (#117653)
- Fix PyTorch performance regression on Linux aarch64 (pytorch/builder#1696)
- Fix silent correctness in DTensor
_to_copy
operation (#116426) - Fix properly assigning
param.grad_fn
for next forward (#116792) - Ensure gradient clear out pending
AsyncCollectiveTensor
in FSDP Extension (#116122) - Fix processing unflatten tensor on compute stream in FSDP Extension (#116559)
- Fix FSDP
AssertionError
on tensor subclass when settingsync_module_states=True
(#117336) - Fix DCP state_dict cannot correctly find FQN when the leaf module is wrapped by FSDP (#115592)
- Fix OOM when when returning a AsyncCollectiveTensor by forcing
_gather_state_dict()
to be synchronous with respect to the mian stream. (#118197) (#119716) - Fix Windows runtime
torch.distributed.DistNetworkError
: [WinError 32] The process cannot access the file because it is being used by another process (#118860) - Update supported python versions in package description (#119743)
- Fix SIGILL crash during
import torch
on CPUs that do not support SSE4.1 (#116623) - Fix DCP RuntimeError in
get_state_dict
andset_state_dict
(#119573) - Fixes for HSDP + TP integration with device_mesh (#112435) (#118620) (#119064) (#118638) (#119481)
- Fix numerical error with
mixedmm
on NVIDIA V100 (#118591) - Fix RuntimeError when using SymInt input invariant when splitting graphs (#117406)
- Fix compile
DTensor.from_local
in trace_rule_look up (#119659) - Improve torch.compile integration with CUDA-11.8 binaries (#119750)
Release tracker #119295 contains all relevant pull requests related to this release as well as links to related issues.
PyTorch 2.2: FlashAttention-v2, AOTInductor
PyTorch 2.2 Release Notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
Highlights
We are excited to announce the release of PyTorch® 2.2! PyTorch 2.2 offers ~2x performance improvements to scaled_dot_product_attention
via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments.
This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.
Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.
Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.
This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
Summary:
scaled_dot_product_attention
(SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.- PyTorch 2.2 introduces a new ahead-of-time extension of TorchInductor called AOTInductor, designed to compile and deploy PyTorch programs for non-python server-side.
torch.distributed
supports a new abstraction for initializing and representing ProcessGroups called device_mesh.- PyTorch 2.2 ships a standardized, configurable logging mechanism called TORCH_LOGS.
- A number of torch.compile improvements are included in PyTorch 2.2, including improved support for compiling Optimizers and improved TorchInductor fusion and layout optimizations.
- Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.
torch.ao.quantization
now offers a prototypetorch.export
based flow
Stable | Beta | Prototype | Performance Improvements |
FlashAttentionV2 backend for scaled dot product attention | PT 2 Quantization | Inductor optimizations | |
AOTInductor | Scaled dot product attention support for jagged layout NestedTensors | aarch64-linux optimizations (AWS Graviton) | |
TORCH_LOGS | |||
torch.distributed.device_mesh | |||
torch.compile + Optimizers |
*To see a full list of public 2.2 - 1.12 feature submissions click here.
Tracked Regressions
Performance reduction when using NVLSTree algorithm in NCCL 2.19.3 (#117748)
We have noticed a performance regression introduced to all-reduce in NCCL 2.19.3. Please use version 2.19.1 instead.
Poor numeric stability of loss when training with FSDP + DTensor (#117471)
We observe the loss will flatline randomly while training with FSDP + DTensor in some instances.
Backwards Incompatible Changes
Building PyTorch from source now requires GCC 9.4 or newer (#112858)
GCC 9.4 is the oldest version fully compatible with C++17, which the PyTorch codebase has migrated to from C++14.
Updated flash attention kernel in scaled_dot_product_attention
to use Flash Attention v2 (#105602)
Previously, the v1 Flash Attention kernel had a Windows implementation. So if a user on Windows had explicitly forced the flash attention kernel to be run by using sdp_kernel
context manager with only flash attention enabled, it would work. In 2.2, if the sdp_kernel
context manager must be used, use the memory efficient or math kernel if on Windows.
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
torch.nn.functional.scaled_dot_product_attention(q,k,v)
# Don't force flash attention to be used if using sdp_kernel on Windows
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True):
torch.nn.functional.scaled_dot_product_attention(q,k,v)
Rewrote DTensor (Tensor Parallel) APIs to improve UX (#114732)
In PyTorch 2.1 or before, users can use ParallelStyles like PairwiseParallel
and specify input/output layout with functions like make_input_replicate_1d
or make_output_replicate_1d
. And we have default values for _prepare_input and _prepare_output. The UX of Tensor Parallel was like:
from torch.distributed.tensor.parallel.style import (
ColwiseParallel,
make_input_replicate_1d,
make_input_reshard_replicate,
make_input_shard_1d,
make_input_shard_1d_last_dim,
make_sharded_output_tensor,
make_output_replicate_1d,
make_output_reshard_tensor,
make_output_shard_1d,
make_output_tensor,
PairwiseParallel,
parallelize_module,
)
from torch.distributed.tensor import DeviceMesh
module = DummyModule()
device_mesh = DeviceMesh("cuda", list(range(self.world_size)))
parallelize_module(module, device_mesh, PairwiseParallel(_prepare_input=make_input_replicate_1d))
...
Starting from PyTorch 2.2, we simplified parallel styles to only contain ColwiseParallel
and RowwiseParallel
because other ParallelStyle can consist of these two. We also deleted the input/output functions, and started using input_layouts
and output_layouts
as kwargs instead to specify the sharding layout of both input/output tensors. Finally, added PrepareModuleInput/PrepareModuleOutput style, and no default arguments for layouts in these two styles and users need to specify them to think about the sharding layouts.
from torch.distributed.tensor.parallel.style import (
ColwiseParallel,
PrepareModuleInput,
RowwiseParallel,
parallelize_module,
)
from torch.distributed._tensor import init_device_mesh
module = SimpleMLPModule()
device_mesh = init_device_mesh("cuda", (self.world_size,)))
parallelize_module(
module,
device_mesh,
{
"fqn": PrepareModuleInput(
input_layouts=Shard(0),
desired_input_layouts=Replicate()
),
"fqn.net1": ColwiseParallel(),
"fqn.net2": RowwiseParallel(output_layouts=Shard(0)),
}
)
...
UntypedStorage.resize_
now uses the original device instead of the current device context (#113386)
Before this PR, UntypedStorage.resize_
would move data to the current CUDA device index (given by torch.cuda.current_device()
).
Now, UntypedStorage.resize_()
keeps the data on the same device index that it was on before, regardless of the current device index.
2.1 | 2.2 |
---|---|
>>> import torch
>>> with torch.cuda.device('cuda:0'):
...: a = torch.zeros(0, device='cuda:1')
...: print(a.device)
...: a = a.untyped_storage().resize_(0)
...: print(a.device)
cuda:1
cuda:0 |
>>> import torch
>>> with torch.cuda.device('cuda:0'):
...: a = torch.zeros(0, device='cuda:1')
...: print(a.device)
...: a = a.untyped_storage().resize_(0)
...: print(a.device)
cuda:1
cuda:1 |
Wrapping a function with set_grad_enabled will consume its global mutation (#113359)
This bc-breaking change fixes some unexpected behavior when set_grad_enabled
is used as a decorator.
2.1 | 2.2 |
---|---|
>>> import torch
>>> @torch.set_grad_enabled(False) # unexpectedly, this mutates the grad mode!
def inner_func(x):
return x.sin()
>>> torch.is_grad_enabled()
True |
>>> import torch
>>> @torch.set_grad_enabled(False) # unexpectedly, this mutates the grad mode!
def inner_func(x):
return x.sin()
>>> torch.is_grad_enabled()
False |
Deprecated verbose
parameter in LRscheduler
constructors (#111302)
As part of our decision to move towards a consolidated logging system, we are deprecating the verbose
flag in LRScheduler
.
If you would like to print the learning rate during execution, please use get_last_lr()
2.1 | 2.2 |
---|---|
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min', verbose=True)
for epoch in range(10):
train(...)
val_loss = validate(...)
# Note that step should be called after validate()
scheduler.step(val_loss) |
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
train(...)
val_loss = validate(...)
# Note that step should be called after validate()
scheduler.step(val_loss)
print(f"Epoch {epoch} has concluded with lr of {scheduler.get_last_lr()}") </td... |
PyTorch 2.1.2 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
- Fix crashes for float16 empty tensors (#115183)
- Fix MPS memory corruption when working with tensor slices (#114838)
- Fix crashes during Conv backward pass on MPS devices (#113398)
- Partially fix nn.Linear behavior on AArch64 platform (#110150)
- Fix cosine_similarity for tensors of different sizes (#109363)
- Package missing headers needed for extension development (#113055)
- Improve error handling of
torch.set_num_threads
(#113684) - Fix profiling traces generation (#113763)
The Cherry pick tracker #113962 contains all relevant pull requests related to this release as well as links to related issues.
PyTorch 2.1.1 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
- Remove spurious warning in comparison ops (#112170)
- Fix segfault in foreach_* operations when input list length does not match (#112349)
- Fix cuda driver API to load the appropriate .so file (#112996)
- Fix missing CUDA initialization when calling FFT operations (#110326)
- Ignore beartype==0.16.0 within the onnx package as it is incompatible (#111861)
- Fix the behavior of torch.new_zeros in onnx due to TorchScript behavior change (#111694)
- Remove unnecessary slow code in
torch.distributed.checkpoint.optimizer.load_sharded_optimizer_state_dict
(#111687) - Add
planner
argument totorch.distributed.checkpoint.optimizer.load_sharded_optimizer_state_dict
(#111393) - Continue if param not exist in sharded load in
torch.distributed.FSDP
(#109116) - Fix handling of non-contiguous bias_mask in
torch.nn.functional.scaled_dot_product_attention
(#112673) - Fix the meta device implementation for
nn.functional.scaled_dot_product_attention
(#110893) - Fix copy from mps to cpu device when storage_offset is non-zero (#109557)
- Fix segfault in
torch.sparse.mm
for non-contiguous inputs (#111742) - Fix circular import between Dynamo and einops (#110575)
- Verify flatbuffer module fields are initialized for mobile deserialization (#109794)
The #110961 contains all relevant pull requests related to this release as well as links to related issues.
PyTorch 2.1: automatic dynamic shape compilation, distributed checkpointing
PyTorch 2.1 Release Notes
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
- Security
Highlights
We are excited to announce the release of PyTorch® 2.1! PyTorch 2.1 offers automatic dynamic shape support in torch.compile, torch.distributed.checkpoint for saving/loading distributed training jobs on multiple ranks in parallel, and torch.compile support for the NumPy API.
In addition, this release offers numerous performance improvements (e.g. CPU inductor improvements, AVX512 support, scaled-dot-product-attention support) as well as a prototype release of torch.export, a sound full-graph capture mechanism, and torch.export
-based quantization.
Along with 2.1, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.
This release is composed of 6,682 commits and 784 contributors since 2.0. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.1. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
Summary:
torch.compile
now includes automatic support for detecting and minimizing recompilations due to tensor shape changes using automatic dynamic shapes.torch.distributed.checkpoint
enables saving and loading models from multiple ranks in parallel, as well as resharding due to changes in cluster topology.torch.compile
can now compile NumPy operations via translating them into PyTorch-equivalent operations.torch.compile
now includes improved support for Python 3.11.- New CPU performance features include inductor improvements (e.g. bfloat16 support and dynamic shapes), AVX512 kernel support, and scaled-dot-product-attention kernels.
torch.export
, a sound full-graph capture mechanism is introduced as a prototype feature, as well as torch.export-based quantization.torch.sparse
now includes prototype support for semi-structured (2:4) sparsity on NVIDIA® GPUs.
Stable | Beta | Prototype | Performance Improvements |
Automatic Dynamic Shapes | torch.export() | AVX512 kernel support | |
torch.distributed.checkpoint | torch.export-based Quantization | CPU optimizations for scaled-dot-product-attention (SDPA) | |
torch.compile + NumPy | semi-structured (2:4) sparsity | CPU optimizations for bfloat16 | |
torch.compile + Python 3.11 | cpp_wrapper for torchinductor | ||
torch.compile + autograd.Function | |||
third-party device integration: PrivateUse1 |
*To see a full list of public 2.1, 2.0, and 1.13 feature submissions click here.
For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.
Backwards Incompatible Changes
Building PyTorch from source now requires C++ 17 (#100557)
The PyTorch codebase has migrated from the C++14 to the C++17 standard, so a C++17 compatible compiler is now required to compile PyTorch, to integrate with libtorch, or to implement a C++ PyTorch extension.
Disable torch.autograd.{backward, grad}
for complex scalar output (#92753)
Gradients are not defined for functions that don't return real outputs; we now raise an error if you try to call backward on complex outputs. Previously, the complex component of the output was implicitly ignored. If you wish to preserve this behavior, you must now explicitly call .real
on your complex outputs before calling .grad()
or .backward()
.
Example
def fn(x):
return (x * 0.5j).sum()
x = torch.ones(1, dtype=torch.double, requires_grad=True)
o = fn(x)
2.0.1
o.backward()
2.1
o.real.backward()
Update non-reentrant checkpoint to allow nesting and support autograd.grad
(#90105)
As a part of a larger refactor to torch.utils.checkpoint
, we changed the interaction activation checkpoint and retain_graph=True
. Previously in 2.0.1, recomputed activations are kept alive if retain_graph=True
, in PyTorch 2.1, non-reentrant impl now clears recomputed tensors on backward immediately upon unpack, even if retain_graph=True
. This has the following additional implications: (1) Accessing ctx.saved_tensor
twice in the same backward will now raise an error. (2) Accessing _saved_tensors
multiple times will silently recompute forward multiple times.
2.1
class Func(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
out = x.exp()
ctx.save_for_backward(out)
return out
@staticmethod
def backward(ctx, x);
out, = ctx.saved_tensors
# Calling ctx.saved_tensors again will raise in 2.1
out, = ctx.saved_tensors
return out
a = torch.tensor(1., requires_grad=True)
def fn(x):
return Func.apply(x)
out = torch.utils.checkpoint(fn, (a,), use_reentrant=False)
def fn2(x):
return x.exp()
out = torch.utils.checkpoint(fn2, (a,), use_reentrant=False)
out.grad_fn._saved_result
# Calling _saved_result will trigger another unpack, and lead to forward being
# recomputed again
out.grad_fn._saved_result
Only sync buffers when broadcast_buffers
is True (#100729)
- In PyTorch 2.0.1 and previous releases, when users use DistributedDataParallel (DDP), all buffers were synced automatically even if users set flag
broadcast_buffers
to beFalse
:
from torch.nn.parallel import DistributedDataParallel as DDP
module = torch.nn.Linear(4, 8)
module = DDP(module) # Buffer is synchronized across all devices.
module = DDP(module, broadcast_buffers=False) # Buffer is synchronized across all devices.
...
- Starting with PyTorch 2.1, if users specify the flag
broadcast_buffers
to beFalse
, we don’t sync the buffer across devices:
from torch.nn.parallel import DistributedDataParallel as DDP
module = torch.nn.Linear(4, 8)
module = DDP(module) # Buffer is synchronized across all devices.
module = DDP(module, broadcast_buffers=False) # Buffer is NOT synchronized across all devices
...
Remove store barrier after PG init (#99937)
- In PyTorch 2.0.1 and previous releases, after we initialize PG, we always call store based barrier:
from torch.distributed.distributed_c10d import init_process_group
init_process_group(...) # Will call _store_based_barrier in the end.
...
- Starting with PyTorch 2.1, after we initialize PG, the environment variable
TORCH_DIST_INIT_BARRIER
controls whether we call store based barrier or not:
from torch.distributed.distributed_c10d import init_process_group
import os
os.environ["TORCH_DIST_INIT_BARRIER"] = "1" # This is the default behavior
init_process_group(...) # Will call _store_based_barrier in the end.
os.environ["TORCH_DIST_INIT_BARRIER"] = "0"
init_process_group(...) # Will not call _store_based_barrier in the end.
...
Disallow non-bool masks in torch.masked_{select, scatter, fill}
(#96112, #97999, #96594)
Finish the deprecation cycle for non-bool masks. Functions now require the dtype
of the mask to be torch.bool
.
>>> # 2.0.1
>>> inp = torch.rand(3)
>>> mask = torch.tensor([0, 1, 0], dtype=torch.uint8)
>>> torch.masked_select(inp, mask)
UserWarning: masked_select received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ../aten/src/ATen/native/TensorAdvancedIndexing.cpp:1855.)
torch.masked_select(inp, mask)
>>> torch.masked_select(inp, mask.to(dtype=torch.bool))
# Works fine
>>> correct_mask = torch.tensor([0, 1, 0], dtype=torch.bool)
>>> torch.masked_select(inp, correct_mask)
# Works fine
>>> # 2.1
>>> inp = torch.rand(3)
>>> mask = torch.tensor([0, 1, 0], dtype=torch.uint8)
>>> torch.masked_select(inp, mask)
RuntimeError: masked_select: expected BoolTensor for mask
>>> correct_mask = torch.tensor([0, 1, 0], dtype=torch.bool)
>>> torch.masked_select(inp, correct_mask)
# Works fine
>>> torch.masked_select(inp, mask.to(dtype=torch.bool))
# Works fine
Fix the result of torch.unique
to make it consistent with NumPy when dim
is specified (#101693)
The dim
argument was clarified and its behavior aligned to match the one from NumPy to signify which sub-tensor to consider when considering uniqueness. See the documentation for more details, https://pytorch.org/docs/stable/generated/torch.unique.html
Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Nearest Neighbor Interpolations (#97000)
Prior to this change, for torch.nn.functional.grid_sample(mode='nearest')
the forward 2D kernel used std::nearbyint
whereas the forward 3D kernel used std::round
in order to determine the nearest pixel locations after un-normalization of the grid. Additionally, the backward kernels for both ...
PyTorch 2.0.1 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
- Fix
_canonical_mask
throws warning when bool masks passed as input to TransformerEncoder/TransformerDecoder (#96009, #96286) - Fix Embedding bag max_norm=-1 causes leaf Variable that requires grad is being used in an in-place operation #95980
- Fix type hint for torch.Tensor.grad_fn, which can be a torch.autograd.graph.Node or None. #96804
- Can’t convert float to int when the input is a scalar np.ndarray. #97696
- Revisit torch._six.string_classes removal #97863
- Fix module backward pre-hooks to actually update gradient #97983
- Fix load_sharded_optimizer_state_dict error on multi node #98063
- Warn once for TypedStorage deprecation #98777
- cuDNN V8 API, Fix incorrect use of emplace in the benchmark cache #97838
Torch.compile:
- Add support for Modules with custom getitem method to torch.compile #97932
- Fix improper guards with on list variables. #97862
- Fix Sequential nn module with duplicated submodule #98880
Distributed:
- Fix distributed_c10d's handling of custom backends #95072
- Fix MPI backend not properly initialized #98545
NN_frontend:
- Update Multi-Head Attention's doc string #97046
- Fix incorrect behavior of
is_causal
paremeter for torch.nn.TransformerEncoderLayer.forward #97214 - Fix error for SDPA on sm86 and sm89 hardware #99105
- Fix nn.MultiheadAttention mask handling #98375
DataLoader:
- Fix regression for pin_memory recursion when operating on bytes #97737
- Fix collation logic #97789
- Fix Ppotentially backwards incompatible change with DataLoader and is_shardable Datapipes #97287
MPS:
- Fix LayerNorm crash when input is in float16 #96208
- Add support for cumsum on int64 input #96733
- Fix issue with setting BatchNorm to non-trainable #98794
Functorch:
- Fix Segmentation Fault for vmaped function accessing BatchedTensor.data #97237
- Fix index_select support when dim is negative #97916
- Improve docs for autograd.Function support #98020
- Fix Exception thrown when running Migration guide example for jacrev #97746
Releng:
- Fix Convolutions for CUDA-11.8 wheel builds #99451
- Fix Import torchaudio + torch.compile crashes on exit #96231
- Linux aarch64 wheels are missing the mkldnn+acl backend support - pytorch/builder@54931c2
- Linux aarch64 torchtext 0.15.1 wheels are missing for aarch64_linux platform - pytorch/builder#1375
- Enable ROCm 5.4.2 manywheel and python 3.11 builds #99552
- PyTorch cannot be installed at the same time as numpy in a conda env on osx-64 / Python 3.11 #97031
- Illegal instruction (core dumped) on Raspberry Pi 4.0 8gb - pytorch/builder#1370
Torch.optim:
- Fix fused AdamW causes NaN loss #95847
- Fix Fused AdamW has worse loss than Apex and unfused AdamW for fp16/AMP #98620
The release tracker should contain all relevant pull requests related to this release as well as links to related issues
PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever
PyTorch 2.0 Release notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
Highlights
We are excited to announce the release of PyTorch® 2.0 (release note) which we highlighted during the PyTorch Conference on 12/2/22! PyTorch 2.0 offers the same eager-mode development and user experience, while fundamentally changing and supercharging how PyTorch operates at compiler level under the hood with faster performance and support for Dynamic Shapes and Distributed.
This next-generation release includes a Stable version of Accelerated Transformers (formerly called Better Transformers); Beta includes torch.compile as the main API for PyTorch 2.0, the scaled_dot_product_attention function as part of torch.nn.functional, the MPS backend, functorch APIs in the torch.func module; and other Beta/Prototype improvements across various inferences, performance and training optimization features on GPUs and CPUs. For a comprehensive introduction and technical overview of torch.compile, please visit the 2.0 Get Started page.
Along with 2.0, we are also releasing a series of beta updates to the PyTorch domain libraries, including those that are in-tree, and separate libraries including TorchAudio, TorchVision, and TorchText. An update for TorchX is also being released as it moves to community supported mode. More details can be found in this library blog.
This release is composed of over 4,541 commits and 428 contributors since 1.13.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.0 and the overall 2-series this year.
Summary:
- torch.compile is the main API for PyTorch 2.0, which wraps your model and returns a compiled model. It is a fully additive (and optional) feature and hence 2.0 is 100% backward compatible by definition.
- As an underpinning technology of torch.compile, TorchInductor with Nvidia and AMD GPUs will rely on OpenAI Triton deep learning compiler to generate performant code and hide low level hardware details. OpenAI Triton-generated kernels achieve performance that's on par with hand-written kernels and specialized cuda libraries such as cublas.
- Accelerated Transformers introduce high-performance support for training and inference using a custom kernel architecture for scaled dot product attention (SPDA). The API is integrated with torch.compile() and model developers may also use the scaled dot product attention kernels directly by calling the new scaled_dot_product_attention() operator.
- Metal Performance Shaders (MPS) backend provides GPU accelerated PyTorch training on Mac platforms with added support for Top 60 most used ops, bringing coverage to over 300 operators.
- Amazon AWS optimize the PyTorch CPU inference on AWS Graviton3 based C7g instances. PyTorch 2.0 improves inference performance on Graviton compared to the previous releases, including improvements for Resnet50 and Bert.
- New prototype features and technologies across TensorParallel, DTensor, 2D parallel, TorchDynamo, AOTAutograd, PrimTorch and TorchInductor.
Stable | Beta | Prototype | Platform Changes |
Accelerated PT 2 Transformers | torch.compile | DTensor | CUDA support for 11.7 & 11.8 (deprecating CUDA 11.6) |
PyTorch MPS Backend | TensorParallel | Python 3.8 (deprecating Python 3.7) | |
Scaled dot product attention | 2D Parallel | AWS Graviton3 | |
Functorch | Torch.compile (dynamic=True) | ||
Dispatchable Collectives | |||
torch.set_default_device and torch.device as context manager | |||
X86 quantization backend | |||
GNN inference and training performance |
*To see a full list of public 2.0, 1.13 and 1.12 feature submissions click here
Backwards Incompatible Changes
Drop support for Python versions <= 3.7 (#93155)
Previously the minimum supported version of Python for PyTorch was 3.7. This PR updates the minimum version to require 3.8 in order to install PyTorch. See Hardware / Software Support for more information.
Drop support for CUDA 10 (#89582)
This PR updates the minimum CUDA version to 11.0. See the getting-started for installation or building from source for more information.
Gradients are now set to None
instead of zeros by default in torch.optim.*.zero_grad()
and torch.nn.Module.zero_grad()
(#92731)
This changes the default behavior of zero_grad()
to zero out the grads by setting them to None
instead of zero tensors. In other words, the set_to_none
kwarg is now True
by default instead of False
. Setting grads to None
reduces peak memory usage and increases performance. This will break code that directly accesses data or does computation on the grads after calling zero_grad()
as they will now be None
. To revert to the old behavior, pass in zero_grad(set_to_none=False)
.
1.13 | 2.0 |
---|---|
>>> import torch
>>> from torch import nn
>>> module = nn.Linear(2,22)
>>> i = torch.randn(2, 2, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
False
>>> module.weight.grad.data
tensor([[0., 0.],
[0., 0.]])
>>> module.weight.grad + 1.0
tensor([[1., 1.],
[1., 1.]]) |
>>> import torch
>>> from torch import nn
>>> module = nn.Linear(5, 5)
>>> i = torch.randn(2, 5, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
True
>>> module.weight.grad.data
AttributeError: 'NoneType' object has no attribute 'data'
>>> module.weight.grad + 1.0
TypeError: unsupported operand type(s) for +:
'NoneType' and 'float' |
Update torch.tensor
and nn.Parameter
to serialize all their attributes (#88913)
Any attribute stored on torch.tensor
and torch.nn.Parameter
will now be serialized. This aligns the serialization behavior of torch.nn.Parameter
, torch.Tensor
and other tensor subclasses
1.13 | 2.0 |
---|---|
# torch.Tensor behavior
>>> a = torch.Tensor()
>>> a.foo = 'hey'
>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
AttributeError: 'Tensor' object has no attribute 'foo'
# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> a.foo = 'hey'
>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
AttributeError: 'Parameter' object has no attribute 'foo'
# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
... pass
>>> a = MyTensor()
>>> a.foo = 'hey'
>>> print(a.foo)
hey
>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>>print(b.foo)
hey |
# torch.Tensor behavior
a = torch.Tensor()
a.foo = 'hey'
>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
hey
# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> a.foo = 'hey'
>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
hey
# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
... pass
>>> a = MyTensor()
>>> a.foo = 'hey'
>>> print(a.foo)
hey
>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>>print(b.foo)
hey |
If you have an attribute that you don't want to be serialized you should not store it as an attribute on tensor or Parameter but instead it is recommended to use torch.utils.weak.WeakTensorKeyDictionary
>>> foo_dict = weak.WeakTensorKeyDictionary()
>>> foo_dict[a] = 'hey'
>>> print(foo_dict[a])
hey
Algorithms {Adadelta, Adagrad, Adam, Adamax, AdamW, ASGD, NAdam, RAdam, RMSProp, RProp, SGD}
default to faster foreach
implementation when on CUDA + differentiable=False
When applicable, this changes the default behavior of step()
and anything that ca...