-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Insights: NVIDIA/Megatron-LM
September 20, 2024 – September 27, 2024
Overview
-
- 0 Merged pull requests
- 3 Open pull requests
- 1 Closed issue
- 7 New issues
Could not load contribution data
Please try again later
3 Pull requests opened by 3 people
-
opt:opt ltor masks
#1155 opened
Sep 24, 2024 -
Enabling UCC backend for PP communication
#1157 opened
Sep 24, 2024 -
Expose cp_comm_type in ModelParallelConfig
#1160 opened
Sep 27, 2024
1 Issue closed by 1 person
-
[BUG] Unable to Convert Mamba PP=1 TP=1 to PP>1 TP>1 Using convert.py
#1153 closed
Sep 23, 2024
7 Issues opened by 7 people
-
[QUESTION] About all_reduce order while using CP
#1162 opened
Sep 27, 2024 -
Why are not all SMs active when NCCL kernel and compute kernel overlap?[QUESTION]
#1161 opened
Sep 27, 2024 -
[QUESTION] Do we really need to call np.arange every time we restart the task?
#1159 opened
Sep 26, 2024 -
[BUG]TypeError: 'type' object is not subscriptable
#1158 opened
Sep 25, 2024 -
[QUESTION] How to enable ZeRO 1/2/3 stages ?
#1156 opened
Sep 24, 2024 -
[BUG] Some checkpoint shards don't save / hang on multi-node setups, since v0.7
#1154 opened
Sep 23, 2024 -
[BUG] Loss difference when training with FP8 vs. BF16 MoE
#1152 opened
Sep 20, 2024
10 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[BUG] 'NoneType' object has no attribute 'shape' error raised when saving model state with the pretrain_gpt.py
#1134 commented on
Sep 23, 2024 • 0 new comments -
[BUG] Context parallel gives NCCL error
#1151 commented on
Sep 23, 2024 • 0 new comments -
[BUG] when use --use-mcore-models and --overlap-param-gather bug
#950 commented on
Sep 23, 2024 • 0 new comments -
[BUG]`examples/multimodal/combine_mistral_clip.sh` Vision model file mismatch.
#949 commented on
Sep 23, 2024 • 0 new comments -
[QUESTION] Validation loss & PPL keep going up
#787 commented on
Sep 23, 2024 • 0 new comments -
[BUG] GPTDataset._build_document_sample_shuffle_indices does not build the indices on non-root nodes when not using NFS
#907 commented on
Sep 24, 2024 • 0 new comments -
Distributed Mamba Training
#944 commented on
Sep 27, 2024 • 0 new comments -
[BUG]"Unexpected key(s) in state_dict" while loading Llama-megatron checkpoint.
#1132 commented on
Sep 27, 2024 • 0 new comments -
[BUG]Get an AtrributeError when trying to finetune llama3-8B model with multi nodes
#937 commented on
Sep 27, 2024 • 0 new comments -
[BUG] NCCL TIMEOUT ( maybe ALLREDUCE ? )
#735 commented on
Sep 27, 2024 • 0 new comments