Pulse · NVIDIA/TransformerEngine · GitHub

July 4, 2024 – July 11, 2024

Overview

17 Active pull requests

9 Active issues

10 Pull requests merged by 7 people

Add cuDNN sliding window and set_deterministic_algorithm
#992 merged Jul 10, 2024
Reduce CUDA driver calls when choosing transpose kernels
#1002 merged Jul 10, 2024
[PyTorch] Prototype for operation-based API
#707 merged Jul 9, 2024
[TE/JAX] Remove tuple wrapper of singleton in HLO lowering return
#1000 merged Jul 9, 2024
Add test for building without support for any DL frameworks
#974 merged Jul 9, 2024
Support individual framework builds for python<=3.7
#997 merged Jul 8, 2024
Parallel build with limited resource
#987 merged Jul 8, 2024
[Paddle] Fix forward and backward logic of te.Linear(parallel_mode='column') to adapt DiT of PaddleMIX
#963 merged Jul 8, 2024
[PyTorch] Remove implicit padding and unpadding in GroupedLinear
#984 merged Jul 8, 2024
[MoE][Pytorch]Fix size mismatch error in fp8 transpose.
#988 merged Jul 5, 2024

7 Pull requests opened by 6 people

Use 2hd layout for context parallelism
#993 opened Jul 7, 2024
Add efficient cross entropy by cuda kernel.
#995 opened Jul 8, 2024
Optimize multi-tensor cast-transpose kernel
#998 opened Jul 8, 2024
Simplify logic for launching CI
#1001 opened Jul 9, 2024
[JAX] Sharding Utils
#1003 opened Jul 9, 2024
DGRAD_RS UB overlap Bug fixes
#1004 opened Jul 10, 2024
[JAX] Allow enabling partial custom calls through the environment variable
#1007 opened Jul 10, 2024

5 Issues closed by 5 people

Attention mask type must be padding or padding_causal for qkv_format=thd!
#1005 closed Jul 11, 2024
Training core dump in megatron-lm with tp-comm-overlap.
#985 closed Jul 8, 2024
initialize_ub failed: transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:208 in function create_communicator_grouped2: CUDA Error: operation not supported
#991 closed Jul 8, 2024
ERROR: Failed building wheel for transformer-engine
#700 closed Jul 5, 2024
Hang when training with MPI with --tp-comm-overlap turned on
#989 closed Jul 5, 2024

4 Issues opened by 4 people

Command '['ninja', '-v', '-j', '1']' returned non-zero exit status 1.
#1008 opened Jul 10, 2024
can fp8 be used with pipeline parallel?
#1006 opened Jul 10, 2024
Why requires_grad attribute of weight from offloading will set to False ?
#996 opened Jul 8, 2024
tp_overlap init failed when tp_size != world_size
#994 opened Jul 8, 2024

12 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[TE/JAX] Prototype for New XLA Custom Calls with FFI
#946 commented on Jul 9, 2024 • 8 new comments
[MoE][Common/PyTorch] Add permutation
#936 commented on Jul 10, 2024 • 6 new comments
[PyTorch] Fixing hang in `initialize_ub()` for multi-node runs after PR901 removal of MPI-dependence
#986 commented on Jul 11, 2024 • 2 new comments
[C/PyTorch] Refactor and move userbuffers into TE/common
#760 commented on Jul 11, 2024 • 1 new comment
[Paddle] Add deterministic option in DotProductAttention
#956 commented on Jul 11, 2024 • 1 new comment
Calling backward(retain_graph=True) multiple times with TE Layer does not work
#990 commented on Jul 5, 2024 • 0 new comments
PyTorch 2.2.0 NVFuser deprecation is incompatible with TransformerEngine.
#666 commented on Jul 11, 2024 • 0 new comments
question for building wheel for transformer-engine
#516 commented on Jul 11, 2024 • 0 new comments
[PyTorch] How to restore fp8 amp training from checkpoint
#982 commented on Jul 11, 2024 • 0 new comments
[UB] Adding support for multinode nvlink
#815 commented on Jul 5, 2024 • 0 new comments
[Draft] Zero fwd and bwd results for THD+CP
#920 commented on Jul 11, 2024 • 0 new comments
[pre-commit.ci] pre-commit suggestions
#979 commented on Jul 8, 2024 • 0 new comments