Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL error in: ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3 #1147

Closed
mackmake opened this issue Feb 14, 2024 · 4 comments
Closed

Comments

@mackmake
Copy link

hi
i started training on two nodes and used 125M.yml config file and only changed the directories for data and tokenizer files. also added my own hostfile. now during training it gives me this error:

NCCL error in: ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3

i run the code again with NCCL_DEBUG=INFO and got this:

node2: Last error:
node2: Net : Call to recv from NODE2_IP<56843> failed : Connection refused                                                                                                         
node1: node1:15874:17284 [4] NCCL INFO P2P is disabled between connected GPUs 4 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.                              
node1: node1:15874:17284 [4] NCCL INFO Could not enable P2P between dev 4(=61000) and dev 3(=42000)                                                                               
node1: node1:15874:17284 [4] NCCL INFO Channel 00 : 4[61000] -> 3[42000] via SHM/direct/direct                                                                                    
node1: node1:15874:17284 [4] NCCL INFO P2P is disabled between connected GPUs 4 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.                              
node1: node1:15874:17284 [4] NCCL INFO Could not enable P2P between dev 4(=61000) and dev 3(=42000)                                                                               
node1: node1:15874:17284 [4] NCCL INFO Channel 01 : 4[61000] -> 3[42000] via SHM/direct/direct                                                                                    
node1: node1:15876:17282 [5] NCCL INFO Connected all trees                                                                                                                        
node1: node1:15876:17282 [5] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512                                                                                              
node1: node1:15876:17282 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer                                                                                   
node1: node1:15874:17284 [4] NCCL INFO Connected all trees                                                                                                                        
node1: node1:15874:17284 [4] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512                                                                                              
node1: node1:15874:17284 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer                                                                                   
node1: node1:15872:17277 [3] NCCL INFO Connected all trees                                                                                                                        
node1: node1:15872:17277 [3] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512                                                                                              
node1: node1:15872:17277 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer                                                                                   
node2: node2:5496:6219 [0] NCCL INFO Channel 00/0 : 0[1000] -> 6[1000] [receive] via NET/Socket/0                                                                                 
node2: node2:5496:6219 [0] NCCL INFO Channel 01/0 : 0[1000] -> 6[1000] [receive] via NET/Socket/0                                                                                 
node2: node2:5496:6219 [0] NCCL INFO Channel 00/0 : 6[1000] -> 0[1000] [send] via NET/Socket/0                                                                                    
node2: node2:5496:6219 [0] NCCL INFO Channel 01/0 : 6[1000] -> 0[1000] [send] via NET/Socket/0     

what might be the problem?
how to solve it?

@StellaAthena
Copy link
Member

What happens when you run with NCCL_IGNORE_DISABLED_P2P=1 set? Does it crash, or does it run less efficiently than one would desire?

@mackmake
Copy link
Author

if i can remember, it crashed as i tested with NCCL_IGNORE_DISABLED_P2P. i stopped using multi-node approach.

@Quentin-Anthony
Copy link
Member

NCCL_IGNORE_DISABLED_P2P=1 just disables the warning message. I think NCCL_P2P_DISABLE=1 is what you'd need?

@Quentin-Anthony
Copy link
Member

@mackmake -- Please reopen if this doesn't resolve your issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants