[ao] 読み進めメモ #5

aokomoriuta · 2019-08-03T03:27:08Z

allreduceを上から追っておきます

aokomoriuta · 2019-08-03T03:29:39Z

すごく初歩から

allreducの宣言

nccl/src/nccl.h.in

Lines 185 to 186 in 7c72dee

    
           ncclResult_t  ncclAllReduce(const void* sendbuff, void* recvbuff, size_t count, 
        
               ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream);

実体

nccl/src/collectives/all_reduce.cc

Lines 12 to 18 in 7c72dee

    
           ncclResult_t ncclAllReduce(const void* sendbuff, void* recvbuff, size_t count, 
        
               ncclDataType_t datatype, ncclRedOp_t op, ncclComm* comm, cudaStream_t stream) { 
        
             struct ncclInfo info = { ncclCollAllReduce, "AllReduce", 
        
               sendbuff, recvbuff, count, datatype, op, 0, comm, stream, /* Args */ 
        
               ALLREDUCE_CHUNKSTEPS, ALLREDUCE_SLICESTEPS }; 
        
             return ncclEnqueueCheck(&info); 
        
           }

aokomoriuta · 2019-08-03T03:43:42Z

順に、NCCL_APIについて

nccl/src/include/core.h

Lines 25 to 39 in 9db4b1d

    
           #ifdef PROFAPI 
        
           #define NCCL_API(ret, func, args...)        \ 
        
               __attribute__ ((visibility("default"))) \ 
        
               __attribute__ ((alias(#func)))          \ 
        
               ret p##func (args);                     \ 
        
               extern "C"                              \ 
        
               __attribute__ ((visibility("default"))) \ 
        
               __attribute__ ((weak))                  \ 
        
               ret func(args) 
        
           #else 
        
           #define NCCL_API(ret, func, args...)        \ 
        
               extern "C"                              \ 
        
               __attribute__ ((visibility("default"))) \ 
        
               ret func(args) 
        
           #endif // end PROFAPI

PROFAPIコンパイルスイッチは

nccl/makefiles/common.mk

Line 13 in 9db4b1d

PROFAPI ?= 0

にあるが、よくわからないので飛ばす（普通は使われないっぽい）。

通常版の

nccl/src/include/core.h

Lines 36 to 38 in 9db4b1d

    
           extern "C"                              \ 
        
           __attribute__ ((visibility("default"))) \ 
        
           ret func(args)

は

extern C
シンボルエクスポート

しているだけ。後者はgcc拡張で、実はNCCLはattributeで明示的に指定しない関数は全部ライブラリの外からは見えなくなっている

nccl/makefiles/common.mk

Line 46 in 9db4b1d

    
           CXXFLAGS   := -DCUDA_MAJOR=$(CUDA_MAJOR) -DCUDA_MINOR=$(CUDA_MINOR) -fPIC -fvisibility=hidden

aokomoriuta · 2019-08-03T03:49:52Z

ncclInfoは単に引数をまとめただけのもの

nccl/src/include/info.h

Lines 23 to 43 in 9db4b1d

    
           struct ncclInfo { 
        
             ncclColl_t coll; 
        
             const char* opName; 
        
             // NCCL Coll Args 
        
             const void* sendbuff; 
        
             void* recvbuff; 
        
             size_t count; 
        
             ncclDataType_t datatype; 
        
             ncclRedOp_t op; 
        
             int root; 
        
             ncclComm_t comm; 
        
             cudaStream_t stream; 
        
             // Algorithm details 
        
             int chunkSteps; 
        
             int sliceSteps; 
        
             // Computed later 
        
             ncclPattern_t pattern; 
        
             size_t nBytes; 
        
             int nstepsPerLoop; 
        
             int nchunksPerLoop; 
        
           };

大体はncclAllreduceの引数をそのまま転送しているが違うのは以下

ncclCollAllReduce：

nccl/src/include/devcomm.h

Line 16 in 9db4b1d

typedef enum { ncclCollBroadcast, ncclCollReduce, ncclCollAllGather, ncclCollReduceScatter, ncclCollAllReduce, ncclCollCount } ncclColl_t;

にあるけどなんだか分からん（あとで出てきそう）
opName：なにかに使うんだろうか（エラーメッセージかな）
op, 0, commの0はルートプロセス番号っぽい（allreduceにルートもなにもなくないか？）

aokomoriuta · 2019-08-03T03:59:39Z

ncclEnqueueCheck

nccl/src/enqueue.cc

Lines 409 to 442 in 9db4b1d

    
           ncclResult_t ncclEnqueueCheck(struct ncclInfo* info) { 
        
             if (info->comm == NULL) return ncclInvalidArgument; 
        
             INFO(NCCL_COLL,"%s: opCount %lx sendbuff %p recvbuff %p count %zi datatype %d op %d root %d comm %p [nranks=%d] stream %p", 
        
                  info->opName, info->comm->opCount, info->sendbuff, info->recvbuff, info->count, 
        
                  info->datatype, info->op, info->root, info->comm, info->comm->nRanks, info->stream); 
        
             // Launch asynchronously if needed 
        
             if (ncclAsyncMode()) { 
        
               ncclResult_t ret = ncclSuccess; 
        
               int savedDev = -1; 
        
               if (info->comm->checkPointers) { 
        
                 CUDACHECKGOTO(cudaGetDevice(&savedDev), ret, end); 
        
                 CUDACHECKGOTO(cudaSetDevice(info->comm->cudaDev), ret, end); 
        
               } 
        
               // Check arguments 
        
               NCCLCHECKGOTO(ArgsCheck(info), ret, end); 
        
               // Always register comm even in case of error to make sure ncclGroupEnd 
        
               // cleans it up. 
        
               NCCLCHECKGOTO(ncclAsyncColl(info->comm), ret, end); 
        
               NCCLCHECKGOTO(saveKernel(info), ret, end); 
        
           end: 
        
               if (savedDev != -1) CUDACHECK(cudaSetDevice(savedDev)); 
        
               ncclAsyncErrCheck(ret); 
        
               return ret; 
        
             } else { 
        
               NCCLCHECK(ArgsCheck(info)); 
        
               NCCLCHECK(saveKernel(info)); 
        
               NCCLCHECK(ncclBarrierEnqueue(info->comm)); 
        
               NCCLCHECK(ncclBarrierEnqueueWait(info->comm)); 
        
               NCCLCHECK(ncclEnqueueEvents(info->comm)); 
        
               return ncclSuccess; 
        
             } 
        
           }

Asyncモードかどうかで別れている。asyncモードとは、1プロセスで複数GPUを使う時に、ncclAllreduceを非同期にして、全GPUでncclAllreduceを呼び出すためのもの。複雑になるすぎるので割愛する。

ということで、ここでの処理は

nccl/src/enqueue.cc

Lines 435 to 439 in 9db4b1d

    
           NCCLCHECK(ArgsCheck(info)); 
        
           NCCLCHECK(saveKernel(info)); 
        
           NCCLCHECK(ncclBarrierEnqueue(info->comm)); 
        
           NCCLCHECK(ncclBarrierEnqueueWait(info->comm)); 
        
           NCCLCHECK(ncclEnqueueEvents(info->comm));

aokomoriuta · 2019-08-03T04:09:21Z

ArgsCheck

nccl/src/misc/argcheck.cc

Lines 35 to 68 in 9db4b1d

    
           ncclResult_t ArgsCheck(struct ncclInfo* info) { 
        
             NCCLCHECK(PtrCheck(info->comm, info->opName, "comm")); 
        
             // First, the easy ones 
        
             if (info->root < 0 || info->root >= info->comm->nRanks) { 
        
               WARN("%s : invalid root %d (root should be in the 0..%d range)", info->opName, info->root, info->comm->nRanks); 
        
               return ncclInvalidArgument; 
        
             } 
        
             if (info->datatype < 0 || info->datatype >= ncclNumTypes) { 
        
               WARN("%s : invalid type %d", info->opName, info->datatype); 
        
               return ncclInvalidArgument; 
        
             } 
        
             // Type is OK, compute nbytes. Convert Allgather/Broadcast calls to chars. 
        
             info->nBytes = info->count * ncclTypeSize(info->datatype); 
        
             if (info->coll == ncclCollAllGather || info->coll == ncclCollBroadcast) { 
        
               info->count = info->nBytes; 
        
               info->datatype = ncclInt8; 
        
             } 
        
             if (info->coll == ncclCollAllGather || info->coll == ncclCollReduceScatter) info->nBytes *= info->comm->nRanks; // count is per rank 
        
             if (info->op < 0 || info->op >= ncclNumOps) { 
        
               WARN("%s : invalid reduction operation %d", info->opName, info->op); 
        
               return ncclInvalidArgument; 
        
             } 
        
             if (info->comm->checkPointers) { 
        
               // Check CUDA device pointers 
        
               if (info->coll != ncclCollBroadcast || info->comm->rank == info->root) { 
        
                 NCCLCHECK(CudaPtrCheck(info->sendbuff, info->comm, "sendbuff", info->opName)); 
        
               } 
        
               if (info->coll != ncclCollReduce || info->comm->rank == info->root) { 
        
                 NCCLCHECK(CudaPtrCheck(info->recvbuff, info->comm, "recvbuff", info->opName)); 
        
               } 
        
             } 
        
             return ncclSuccess;

ルートプロセス番号が0-プロセス数の間か
datatypeが定義されているものか
通信長を決定
演算子が定義されているものか
send/recvがCUDAのデバイスポインタかどうか

aokomoriuta · 2019-08-03T04:44:49Z

チャンネルという概念が出てくるが、これは何を指しているのだろう・・・？

aokomoriuta · 2019-08-03T04:50:14Z

ncclChannelの定義

nccl/src/include/devcomm.h

Lines 147 to 171 in 9db4b1d

    
           struct ncclChannel { 
        
             union { 
        
               struct { 
        
                 struct ncclRing ring; 
        
                 struct ncclTree tree; 
        
                 int id; 
        
                 int nthreads; 
        
                 int buffSize; 
        
                 // Communication structures 
        
                 struct ncclPeer* peers; 
        
                 struct ncclPeer* devPeers; 
        
                 // Operation list for aggregation 
        
                 struct ncclColl* collectives; 
        
                 struct ncclColl* devCollectives; 
        
                 int collStart; 
        
                 int collCount; 
        
                 int collFifoHead; // Only used by GPU 
        
                 int collFifoTail; // Only used by CPU 
        
               }; 
        
               int data[0x80]; 
        
             }; 
        
           };

通信方法を規定している何かには見えるが・・・？

初期化は

nccl/src/channel.cc

Lines 12 to 34 in 9db4b1d

    
           ncclResult_t initChannel(struct ncclComm* comm, int channelid) { 
        
             struct ncclChannel* channel = comm->channels+channelid; 
        
             channel->id = channelid; 
        
             // Setup intermediate buffering 
        
             channel->buffSize = ncclParamBuffsize(); 
        
             // Ring index to user rank table. 
        
             NCCLCHECK(ncclCudaCalloc(&channel->ring.devUserRanks, comm->nRanks)); 
        
             NCCLCHECK(ncclCalloc(&channel->ring.userRanks, comm->nRanks)); 
        
             // Communication structures with peers. 
        
             NCCLCHECK(ncclCudaCalloc(&channel->devPeers, comm->nRanks)); 
        
             NCCLCHECK(ncclCalloc(&channel->peers, comm->nRanks)); 
        
             for (size_t i=0; i<comm->nRanks; ++i) { 
        
               channel->peers[i].send.comm = comm; 
        
               channel->peers[i].recv.comm = comm; 
        
             } 
        
             // Per-channel operation list. 
        
             NCCLCHECK(ncclCudaHostAlloc((void**)&channel->collectives, (void**)&channel->devCollectives, sizeof(struct ncclColl)*NCCL_MAX_OPS)); 
        
             return ncclSuccess; 
        
           }

aokomoriuta · 2019-08-03T04:52:42Z

nccl/src/init.cc

Lines 812 to 818 in 8e04d80

    
           for (int r=0; r<nrings; r++) { 
        
             struct ncclChannel* channel = comm->channels+r; 
        
             NCCLCHECK(setupChannel(comm, r, rank, nranks, rings+r*nranks, treeIn+r*nranks)); 
        
             NCCLCHECK(p2pSetup(comm, channel, 1, &channel->ring.prev, 1, &channel->ring.next)); 
        
             NCCLCHECK(p2pSetup(comm, channel, NCCL_MAX_TREE_ARITY, channel->tree.down, 1, &channel->tree.up)); 
        
             NCCLCHECK(p2pSetup(comm, channel, 1, &channel->tree.up, NCCL_MAX_TREE_ARITY, channel->tree.down)); 
        
           }

で、channelIdにリング番号が入っているのを発見した。もしかして、チャンネル＝リング（ひとつ）か？

aokomoriuta · 2019-08-03T04:58:00Z

なおリングというのは https://research.preferred.jp/2018/07/prototype-allreduce-library/ にあるやつ。

となると、リングが複数あるのってどういう状況なんだろう？

aokomoriuta · 2019-08-03T05:34:07Z

チャンネル＝リングと仮定して話を進める。

チャンネル数は上限16なので、それほど多く作るものではないらしい。

そしてSlack等で話した限り、リングは、NVLink等で各GPUが全結合しているのを使い切るために複数作られる模様。

16個しかないのは、NVLinkが16GPUまでしかサポートしてないからか？実際のnringsを決めているのは

nccl/src/misc/rings.cc

Line 204 in 0ceaec9

    
           ncclResult_t ncclGetRings(int* nrings, int* nthreads, int rank, int nranks, int* transports, ncclTvalue_t* values, int* prev, int* next, int* treeIn, int* treeOut) {

だが割と複雑そう。ちょっと脱線するがどうやって決めてるのか興味あるので読み解いてみよう。

aokomoriuta · 2019-08-03T05:42:58Z

nccl/src/misc/rings.cc

Lines 209 to 225 in 0ceaec9

    
           char* str = getenv("NCCL_RINGS"); 
        
           if (str && strlen(str)>0) { 
        
             int ret = parseRings(str, nrings, nranks, prev, next); 
        
             if (ret == ncclSuccess && *nrings > 0) { 
        
               if (rank == 0) INFO(NCCL_INIT,"%d ring(s) set by environment", *nrings); 
        
               NCCLCHECK(getEnvThreads(nthreads)); 
        
               for (int r = 0; r<*nrings; r++) { 
        
                 for (int i = 0; i<nranks; i++) { 
        
                   if (transports[i*nranks+prev[r*nranks+i]] == 2) treeIn[r*nranks+i] = 1; 
        
                   if (transports[i*nranks+next[r*nranks+i]] == 2) treeOut[r*nranks+i] = 1; 
        
                 } 
        
               } 
        
               return ncclSuccess; 
        
             } 
        
             if (rank == 0) INFO(NCCL_INIT,"No valid ring found in environment, ignoring"); 
        
             *nrings = 0; 
        
           }

の部分。NCCL_RINGS環境変数から設定を読み込んでいる。文字列を解析してリングが作れたら使うし、使えなかった場合はエラーではなくそのまま内部生成に進む。中身はあんまり面白くなさそうなので割愛

aokomoriuta · 2019-08-03T05:45:03Z

どうやら #1 にあるリング生成アルゴリズムを理解しておく必要がありそう。誰か知ってそうだが、練習（？）も兼ねて自力でやる。

aokomoriuta · 2019-08-03T06:07:09Z

新しくtransportsという概念が出てきたので整理。transportには3種類ある

P2P通信
共有メモリ(shm)経由通信
ネットワーク通信

nccl/src/transport.cc

Lines 13 to 17 in 9db4b1d

    
           struct ncclTransport ncclTransports[NTRANSPORTS] = { 
        
             p2pTransport, 
        
             shmTransport, 
        
             netTransport, 
        
           };

aokomoriuta · 2019-08-03T06:13:51Z

リングを作る前に、各デバイス（というかランク）間の接続状況を

nccl/src/misc/rings.cc

Lines 131 to 171 in 0ceaec9

    
           static ncclResult_t fillCoords(int nranks, int* matrix, int* coords, int* rankToIdx, int* idxToRank) { 
        
             int current[NTRANSPORTS]; 
        
             int* p2pConnected; 
        
             NCCLCHECK(ncclCalloc(&p2pConnected, nranks)); 
        
             for (int i=0; i<NTRANSPORTS; i++) current[i] = 0; 
        
             int curRank = 0, idx = 0; 
        
             while (1) { 
        
               // P2P is handled separately as there is no level below it and we need to 
        
               // cover the case of being connected to another GPU indirectly. 
        
               // So we detect all GPUs in the same P2P domain once and add them all at 
        
               // once. 
        
               isConnected(curRank, p2pConnected, nranks, matrix, 0); 
        
               for (int r=0; r<nranks; r++) { 
        
                 if (p2pConnected[r]) { 
        
                   NEW_IDX(r); 
        
                   curRank = r; 
        
                   current[0]++; 
        
                 } 
        
               } 
        
               current[0] = 0; 
        
               if (idx == nranks) { 
        
                 free(p2pConnected); 
        
                 return ncclSuccess; 
        
               } 
        
               // Find next group, either connected through SHM or NET. 
        
               int rank; 
        
               int transport = 1; 
        
               while ((rank = findConnected(curRank, matrix, nranks, transport, coords)) == -1) { 
        
                 current[transport] = 0; 
        
                 transport++; 
        
                 if (transport == NTRANSPORTS) { 
        
                   WARN("Error : Could not find transport to connect next group\n"); 
        
                   free(p2pConnected); 
        
                   return ncclInternalError; } 
        
               } 
        
               curRank = rank; 
        
               current[transport]++; 
        
             } 
        
           }

で調べている。

aokomoriuta · 2019-08-03T06:19:22Z

接続されているかは、実はここではなく

nccl/src/init.cc

Lines 472 to 484 in 8e04d80

    
           static ncclResult_t fillConnect(struct ncclPeerInfo* peerInfo, int nranks, int rank, int* connectTransport, ncclTvalue_t* connectValue) { 
        
             for (int r=0; r<nranks; r++) { 
        
               connectTransport[r] = -1; 
        
               for (int t=0; t<NTRANSPORTS; t++) { 
        
                 NCCLCHECK(ncclTransports[t].canConnect(connectValue+r, peerInfo+rank, peerInfo+r)); 
        
                 if (connectValue[r] > 0) { 
        
                   connectTransport[r] = t; 
        
                   break; 
        
                 } 
        
               } 
        
             } 
        
             return ncclSuccess; 
        
           }

で調べて、それを引数matrixで渡されているだけっぽい

aokomoriuta · 2019-08-03T06:36:24Z

canConnectは関数ポインタになっていて分かりづらいが、実体はsrc/transport/以下の各通信方法に任されている。

p2p

nccl/src/transport/p2p.cc

Lines 58 to 126 in 9db4b1d

    
           ncclResult_t p2pCanConnect(ncclTvalue_t* ret, struct ncclPeerInfo* myInfo, struct ncclPeerInfo* peerInfo) { 
        
             // Do not use P2P across root complexes by default (provided CUDA permits it) 
        
             int p2pLevel = PATH_NODE; 
        
             if (ncclParamP2pDisable() == 1) p2pLevel = 0; 
        
             if (ncclParamP2pLevel() != -2) p2pLevel = ncclParamP2pLevel(); 
        
             *ret = 0; 
        
             if (p2pLevel == 0) return ncclSuccess; 
        
             // Rule out different nodes 
        
             if (myInfo->hostHash != peerInfo->hostHash) return ncclSuccess; 
        
             // Convert the peer's busId into a local cudaDev index (cf. CUDA_VISIBLE_DEVICES) 
        
             int peerCudaDev = busIdToCudaDev(peerInfo->busId); 
        
             if (peerCudaDev == -1) { 
        
               // Peer's CUDA device is not visible in this process 
        
           #if CUDART_VERSION >= 10010 
        
               // But in CUDA 10.1 we can still communicate with 'invisible' devices 
        
               TRACE(NCCL_INIT|NCCL_P2P, "Checking P2P connection between %d(%s) and %d(%s)", myInfo->nvmlDev, myInfo->busId, peerInfo->nvmlDev, peerInfo->busId); 
        
               // Check for NVLink/NVswitch including P2P access 
        
               int nvlinkp2p = getNvlinkGpu(myInfo->busId, peerInfo->busId); 
        
               if (nvlinkp2p > 0) { 
        
                 *ret = nvlinkp2p; 
        
                 return ncclSuccess; 
        
               } 
        
           #endif 
        
               return ncclSuccess; 
        
             } 
        
             TRACE(NCCL_INIT|NCCL_P2P, "Checking P2P connection between [%d=%d] and [%d=%d]", myInfo->cudaDev, myInfo->nvmlDev, peerCudaDev, peerInfo->nvmlDev); 
        
             // Do not detect topology if we're on the same GPU. Note this is not really supported. 
        
             if (myInfo->cudaDev == peerCudaDev) { 
        
               *ret = 1 + PATH_SYS; 
        
               return ncclSuccess; 
        
             } 
        
             // See if CUDA can do P2P 
        
             int p2p; 
        
             if (cudaDeviceCanAccessPeer(&p2p, myInfo->cudaDev, peerCudaDev) != cudaSuccess) { 
        
               INFO(NCCL_INIT|NCCL_P2P,"peer query failed between dev %d(=%d) and dev %d(=%d)", 
        
                    myInfo->cudaDev, myInfo->nvmlDev, peerCudaDev, peerInfo->nvmlDev); 
        
               return ncclSuccess; 
        
             } 
        
             if (p2p == 0) return ncclSuccess; 
        
             // Check for NVLink/NVswitch 
        
             int nvlinkp2p = getNvlinkGpu(myInfo->busId, peerInfo->busId); 
        
             if (nvlinkp2p > 0) { 
        
               *ret = nvlinkp2p; 
        
               return ncclSuccess; 
        
             } 
        
             // Finally compute the PCI distance and compare with the p2pLevel. 
        
             char* myPath; 
        
             char* peerPath; 
        
             ncclResult_t err1 = getCudaPath(myInfo->cudaDev, &myPath); 
        
             ncclResult_t err2 = getCudaPath(peerCudaDev, &peerPath); 
        
             if (err1 == ncclSuccess && err2 == ncclSuccess) { 
        
               int distance = pciDistance(myPath, peerPath); 
        
               if (distance < p2pLevel) { 
        
                 *ret = 1 + PATH_SYS - distance; 
        
               } 
        
             } 
        
             if (err1 == ncclSuccess) free(myPath); 
        
             if (err2 == ncclSuccess) free(peerPath); 
        
             return ncclSuccess; 
        
           }

shm

nccl/src/transport/shm.cc

Lines 44 to 47 in 9db4b1d

    
           ncclResult_t shmCanConnect(ncclTvalue_t* ret, struct ncclPeerInfo* myInfo, struct ncclPeerInfo* peerInfo) { 
        
             *ret = ((ncclParamShmDisable() == 1) || (myInfo->hostHash != peerInfo->hostHash)) ? 0 : 1; 
        
             return ncclSuccess; 
        
           }

net

nccl/src/transport/net.cc

Lines 116 to 133 in 9db4b1d

    
           ncclResult_t netCanConnect(ncclTvalue_t* ret, struct ncclPeerInfo* myInfo, struct ncclPeerInfo* peerInfo) { 
        
             int cudaDev; 
        
             CUDACHECK(cudaGetDevice(&cudaDev)); 
        
             ret[0] = ncclNetTvalues[cudaDev]; 
        
             if (ret[0] == NET_TVALUE_UNKNOWN) { 
        
               if (cudaDev >= NET_MAX_GPUS) { 
        
                 WARN("CUDA device %d >= MAX %d\n", cudaDev, NET_MAX_GPUS); 
        
                 return ncclInternalError; 
        
               } 
        
               int nDev; 
        
               short* distances; 
        
               NCCLCHECK(netDevices(&nDev, &distances)); 
        
               ncclNetTvalues[cudaDev] = ret[0] = getTvalue(distances, nDev); 
        
               ncclNetNDev = nDev; 
        
               free(distances); 
        
             } 
        
             return ncclSuccess; 
        
           }

どうやって確認しているかは難しそう＆今はあんまり興味ないので割愛

aokomoriuta · 2019-08-03T06:45:31Z

nccl/src/misc/rings.cc

Lines 142 to 150 in 0ceaec9

    
           isConnected(curRank, p2pConnected, nranks, matrix, 0); 
        
           for (int r=0; r<nranks; r++) { 
        
             if (p2pConnected[r]) { 
        
               NEW_IDX(r); 
        
               curRank = r; 
        
               current[0]++; 
        
             } 
        
           } 
        
           current[0] = 0;

が分かりづらいので書き下すと

    isConnected(curRank, p2pConnected, nranks, matrix, 0);
    for (int r=0; r<nranks; r++) {
      if (p2pConnected[r]) {
        rankToIdx[r] = idx;
        idxToRank[idx] = r;
        for (int t=0; t<NTRANSPORTS; t++) coords[r*NTRANSPORTS+t] = current[t];
        idx++;
        curRank = r;
        current[0]++;
      }
    }
    current[0] = 0;

aokomoriuta · 2019-08-03T06:48:06Z

isConnected(curRank, p2pConnected, nranks, matrix, 0);

の0はp2pを指している

aokomoriuta · 2019-08-03T06:50:58Z

全部P2P通信できるならそれで終わり #それはそう

nccl/src/misc/rings.cc

Lines 152 to 155 in 0ceaec9

    
           if (idx == nranks) { 
        
             free(p2pConnected); 
        
             return ncclSuccess; 
        
           }

aokomoriuta · 2019-08-03T06:57:49Z

あ、違うな。P2P接続されているもの同士をグループと呼び、最初にランク0から辿れるランクを全部取りに行って0番目のグループにしてますね。

aokomoriuta · 2019-08-03T07:18:23Z

次のグループを探しているところ

nccl/src/misc/rings.cc

Lines 157 to 169 in 0ceaec9

    
           // Find next group, either connected through SHM or NET. 
        
           int rank; 
        
           int transport = 1; 
        
           while ((rank = findConnected(curRank, matrix, nranks, transport, coords)) == -1) { 
        
             current[transport] = 0; 
        
             transport++; 
        
             if (transport == NTRANSPORTS) { 
        
               WARN("Error : Could not find transport to connect next group\n"); 
        
               free(p2pConnected); 
        
               return ncclInternalError; } 
        
           } 
        
           curRank = rank; 
        
           current[transport]++;

また分かりづらいので書き下した

    // Find next group, either connected through SHM or NET.
    int rank;
    int transport = 1;
    do {
      rank = -1;
      for (int r=0; r<nranks; r++) {
        if (coords[r*NTRANSPORTS + 0] == -1 && matrix[curRank*nranks+r] == transport) {
          rank = r;
          break;
        }
      }

      if(rank == -1) {
        current[transport] = 0;
        transport++;
        if (transport == NTRANSPORTS) {
          WARN("Error : Could not find transport to connect next group\n");
          free(p2pConnected);
          return ncclInternalError;
        }
      }

    } while(rank == -1);
    curRank = rank;
    current[transport]++;

ここでは、curRank（つまりさっき見つけたP2P相互接続集団の最後のランク）とshmかnet接続されているもののうち（shm優先）、これまでにどこかのP2P集団に含まれていないランクを探している。そのランクが次のP2P集団を探す始点になるということ。

至って普通。

aokomoriuta · 2019-08-03T07:36:27Z

で、ここまで読んでようやくcoordの意味がわかった。つまり

nccl/src/misc/rings.cc

Lines 72 to 73 in 0ceaec9

    
            * {   node, pci_domain,   rank } corresponding to the three transports : 
        
            * { 2[NET],     1[SHM], 0[P2P] }.

だが、順番が逆で、自分のランクがrankの時に

coords[rank][0]：P2P通信できる集団の中での識別番号
coords[rank][1]：shm通信できる（つまりノード内の）P2P通信できる集団の識別番号
coords[rank][2]：net通信できる集団の識別番号。といっても繋がらないところがあるとエラーで落ちるので、これは実質0だけのはず

を表してますね。

aokomoriuta · 2019-08-03T07:41:23Z

ということで、fillCoords()のやってることは、各ランクの存在するP2P通信集団（≒PCIe root complex/NVLink switch以下）、shm通信集団（≒ノード以下）を探して、それぞれ番号を振っているのと、集団ごとにツリー状に固まるように前からIDを振り直しているだけですね

aokomoriuta · 2019-08-03T07:49:20Z

そして、このfillCoords()、全ランクに対して全部呼ばれるんだけど、そんな実装で大丈夫か・・・？まぁ変に通信するよりは全員で同じ計算をしたほうが良いということか

aokomoriuta · 2019-08-03T07:56:32Z

そしてncclGetRings()に戻り

nccl/src/misc/rings.cc

Lines 253 to 263 in 0ceaec9

    
           for (int t=NTRANSPORTS-1; t>=0; t--) { 
        
             for (int i=0; i<nranks; i++) idxToRank[i] = rankToIdx[i] = -1; 
        
             int nidx = 0; 
        
             for (int i=0; i<nranks; i++) { 
        
               // Extract only ranks in the same local area as rank 
        
               // We need to extract them in the topological order, hence we iterate over indexes, not ranks 
        
               int r = globalIdxToRank[i]; 
        
               int sameLocal = 1; 
        
               for (int tr = NTRANSPORTS-1; tr > t; tr--) if (coords[r*NTRANSPORTS+tr] != coords[rank*NTRANSPORTS+tr]) sameLocal = 0; 
        
               if (!sameLocal) continue;

の処理は、net,shm,p2pの順に、より上位（広い）集団において引数のrankと同じものを探索している

aokomoriuta · 2019-08-03T08:40:49Z

TValueについて rmatsumiya 情報

P2Pは、スイッチまたぎの距離で計算されている
NICの外（つまりノード間）は、全部「最悪値」で終わってる

aokomoriuta · 2019-08-03T08:50:43Z

次回以降やること：

復習
TValueを使って、どうやってリングのコスト値にしているのかを探索する

nccl/src/misc/rings.cc

Lines 280 to 322 in 0ceaec9

    
           /* Extract subvalues */ 
        
           for (int i=0; i<nidx; i++) { 
        
             for (int j=0; j<nidx; j++) { 
        
               if (transports[idxToRank[i]*nranks+idxToRank[j]] == t) 
        
                 subvalues[i*nidx+j] = values[idxToRank[i]*nranks+idxToRank[j]]; 
        
               else 
        
                 subvalues[i*nidx+j] = 0; 
        
             } 
        
           } 
        
           /* Extract subprev/subnext */ 
        
           for (int i=0; i<nidx*nringsTmp; i++) { 
        
             subprev[i] = subnext[i] = -1; 
        
           } 
        
           for (int r=0; r<nringsTmp; r++) { 
        
             int start = -1, end = -1; 
        
             for (int i=0; i<nranks; i++) { 
        
               if (rankToIdx[i] == -1) continue; 
        
               if (prevTmp[r*nranks+i] != -1) start = i; 
        
               if (nextTmp[r*nranks+i] != -1) end = i; 
        
             } 
        
             if (start != -1 && end != -1) { 
        
               subprev[r*nidx+rankToIdx[start]] = rankToIdx[end]; 
        
               subnext[r*nidx+rankToIdx[end]] = rankToIdx[start]; 
        
             } 
        
           } 
        
           /* Get rings */ 
        
           NCCLCHECK(ncclTransports[t].getRings(nidx, groups, subgroups, subvalues, &nringsTmp, subprev, subnext, minScore, &nThreads)); 
        
           /* Merge subprev/subnext into prev/next */ 
        
           for (int r=0; r<nringsTmp; r++) { 
        
             for (int i=0; i<nidx; i++) { 
        
               if ((prevTmp[r*nranks+idxToRank[i]] == -1) && (subprev[r*nidx+i] != -1)) prevTmp[r*nranks+idxToRank[i]] = idxToRank[subprev[r*nidx+i]]; 
        
               if ((nextTmp[r*nranks+idxToRank[i]] == -1) && (subnext[r*nidx+i] != -1)) nextTmp[r*nranks+idxToRank[i]] = idxToRank[subnext[r*nidx+i]]; 
        
               if (t == NTRANSPORTS-1) { 
        
                 // Save node-level masters for trees 
        
                 treeIn[r*nranks+idxToRank[i]] = prevTmp[r*nranks+idxToRank[i]] == -1 ? 0 : 1; 
        
                 treeOut[r*nranks+idxToRank[i]] = nextTmp[r*nranks+idxToRank[i]] == -1 ? 0 : 1; 
        
               } 
        
             } 
        
           } 
        
           //for (int r=0; r<nringsTmp; r++) { 
        
           //printf("[%d] [%d] [%d] [%d] Prev ", rank, minScore, t, r); for (int i=0; i<nranks; i++) printf("%d ", prevTmp[r*nranks+i]); printf("\n"); 
        
           //printf("[%d] [%d] [%d] [%d] Next ", rank, minScore, t, r); for (int i=0; i<nranks; i++) printf("%d ", nextTmp[r*nranks+i]); printf("\n"); 
        
           //}

aokomoriuta · 2019-09-01T03:11:52Z

TValueを使って、どうやってリングのコスト値にしているのかを探索する

やります。

aokomoriuta · 2019-09-01T03:37:38Z

nccl/src/misc/rings.cc

Lines 257 to 270 in 0ceaec9

    
           for (int i=0; i<nranks; i++) { 
        
             // Extract only ranks in the same local area as rank 
        
             // We need to extract them in the topological order, hence we iterate over indexes, not ranks 
        
             int r = globalIdxToRank[i]; 
        
             int sameLocal = 1; 
        
             for (int tr = NTRANSPORTS-1; tr > t; tr--) if (coords[r*NTRANSPORTS+tr] != coords[rank*NTRANSPORTS+tr]) sameLocal = 0; 
        
             if (!sameLocal) continue; 
        
             groups[nidx] = coords[r*NTRANSPORTS+t]; 
        
             subgroups[nidx] = t ? coords[r*NTRANSPORTS+t-1] : nidx; 
        
             rankToIdx[r] = nidx; 
        
             idxToRank[nidx] = r; 
        
             nidx++; 
        
           }

はコメント通り。自分のrankと~~同じnetかつ同じshmかつ同じP2P~~同じ上位のtransport（p2pならshmとnet、shmならnet）に属している（ここではlocal areaと呼ばれている）やつを探して、その

groups: 今処理しょうとしているt（つまりnet/shm/P2Pのどれか）における識別番号
subgroups: 今処理しょうとしているtより下（つまりshm/P2Pのどっちか）における識別番号
rankToIdx/idxToRank: ランクと添字の変換

を保存している。

aokomoriuta · 2019-09-01T04:11:52Z

nccl/src/misc/rings.cc

Line 283 in 0ceaec9

if (transports[idxToRank[i]*nranks+idxToRank[j]] == t)

のtransports変数について。これは #5 (comment) で作られるもので、中身としては、2次元配列transport[i][j]で、ランクiとjがつながっている最小のtransport番号、つまりp2p/shm/netの順が入っている。

aokomoriuta · 2019-09-01T04:32:57Z

nccl/src/misc/rings.cc

Lines 281 to 288 in 0ceaec9

    
           for (int i=0; i<nidx; i++) { 
        
             for (int j=0; j<nidx; j++) { 
        
               if (transports[idxToRank[i]*nranks+idxToRank[j]] == t) 
        
                 subvalues[i*nidx+j] = values[idxToRank[i]*nranks+idxToRank[j]]; 
        
               else 
        
                 subvalues[i*nidx+j] = 0; 
        
             } 
        
           }

subvaluesというのは、単にvaluesをrankからidxに変換し、かつ今処理しているtransport以外は0にする、というだけ

aokomoriuta · 2019-09-01T04:40:28Z

subprev/subnextは初回は

nccl/src/misc/rings.cc

Lines 293 to 304 in 0ceaec9

    
           for (int r=0; r<nringsTmp; r++) { 
        
             int start = -1, end = -1; 
        
             for (int i=0; i<nranks; i++) { 
        
               if (rankToIdx[i] == -1) continue; 
        
               if (prevTmp[r*nranks+i] != -1) start = i; 
        
               if (nextTmp[r*nranks+i] != -1) end = i; 
        
             } 
        
             if (start != -1 && end != -1) { 
        
               subprev[r*nidx+rankToIdx[start]] = rankToIdx[end]; 
        
               subnext[r*nidx+rankToIdx[end]] = rankToIdx[start]; 
        
             } 
        
           }

が全部成立しないので-1

aokomoriuta · 2019-09-01T04:42:31Z

Ring生成処理自体はtransportに投げていて

nccl/src/misc/rings.cc

Line 306 in 0ceaec9

    
           NCCLCHECK(ncclTransports[t].getRings(nidx, groups, subgroups, subvalues, &nringsTmp, subprev, subnext, minScore, &nThreads));

それぞれ以下

p2p

nccl/src/transport/p2p.cc

Line 379 in 9db4b1d

    
           ncclResult_t p2pGetRings(int nranks, int* groups, int* subgroups, ncclTvalue_t* values, int* nringsRet, int* prev, int* next, int minScore, int* nthreads) {

shm

nccl/src/transport/shm.cc

Line 65 in 9db4b1d

    
           ncclResult_t shmGetRings(int nranks, int* groups, int* subgroups, ncclTvalue_t* values, int* nringsRet, int* prev, int* next, int minScore, int* nthreads) {

net

nccl/src/transport/net.cc

Line 176 in 9db4b1d

    
           ncclResult_t netGetRings(int nranks, int* groups, int* subgroups, ncclTvalue_t* values, int* nringsRet, int* prev, int* next, int minScore, int* nthreads) {

aokomoriuta · 2019-09-01T05:15:14Z

getRingsの各中身を読んでいたらgroupの定義が崩壊してきた・・・。単に #5 (comment) のような識別番号の保存だけではないっぽい。

aokomoriuta · 2019-09-01T05:30:15Z

#5 (comment) がちょっと間違っていて訂正した。なので、groupとは、自分と同じ上位のtransportに属している中で割り振っている。

で、groups[gidx]とループを回していたりすることから分かる通り、groupの番号っぽい。そしてそこにはtransport内識別番号が入っている。

ということは、同じtransportに属しているデバイスを、別々のgroupに割り当てていると推測される。

aokomoriuta · 2019-09-01T05:43:06Z

と思ったけど、nranksInGroupという変数があって

nccl/src/transport/shm.cc

Line 78 in 9db4b1d

nranksInGroup++;

イマイチわからなくなった。
先述の通り、同じtransport内にあるデバイスは別々のgroup番号になっているので、group内にあるrank（つまりデバイス）の数はつねに1だと思うんだけど？

aokomoriuta · 2019-09-01T06:00:50Z

同じtransport内にあるデバイスは別々のgroup番号になっているので

嘘。shmとかだと同じP2P内にいるやつは同じgroup番号になるし、netだと同じshmに属するやつは同じgroup。

aokomoriuta · 2019-09-01T06:04:52Z

で、話が戻ってgetRings()が何をしているか。

全部の詳細は追うと大変なので置いといて、何が起きるかだけをshm実装（一番簡単そうなので）から読み解くと、要はnringsTmp, subprev, subnextだけが出力のよう。

で、中では

nccl/src/transport/net.cc

Lines 216 to 217 in 9db4b1d

    
           next[ring*nranks+ends[group]] = starts[nextGroup]; 
        
           prev[ring*nranks+starts[nextGroup]] = ends[group];

みたいに代入されていて、このstarts/ends[group]は、そのgroupの先頭/末尾rank。

つまり、subprev/nextには、そのrankが各groupの先頭・末尾だった時に、前・次のgroupの末尾・先頭rankが入っているということ。

aokomoriuta · 2019-09-01T06:22:45Z

nccl/src/misc/rings.cc

Lines 319 to 322 in 0ceaec9

    
           //for (int r=0; r<nringsTmp; r++) { 
        
           //printf("[%d] [%d] [%d] [%d] Prev ", rank, minScore, t, r); for (int i=0; i<nranks; i++) printf("%d ", prevTmp[r*nranks+i]); printf("\n"); 
        
           //printf("[%d] [%d] [%d] [%d] Next ", rank, minScore, t, r); for (int i=0; i<nranks; i++) printf("%d ", nextTmp[r*nranks+i]); printf("\n"); 
        
           //}

の結果は #9 (comment) を参照。これから逆解析を試みる。

aokomoriuta · 2019-09-01T06:38:46Z

なんか順番がおかしいように見えるので自分で取り直す

aokomoriuta · 2019-09-01T06:40:41Z

いや、大丈夫だった。

[0] [3] [2] [0] Prev 7 -1 1 -1 3 -1 5 -1 
[0] [3] [2] [0] Next -1 2 -1 4 -1 6 -1 0 
[0] [3] [0] [0] Prev 7 0 1 -1 3 -1 5 -1 
[0] [3] [0] [0] Next 1 2 -1 4 -1 6 -1 0

だけ見れば良さそう。全部P2Pでつながっていてshmがないから変に見えただけだった。

（あと、printfのrはrankではなくてring）

aokomoriuta · 2019-09-01T07:12:21Z

1nodeだったり1GPUのときは簡単で、t=0,2(P2P, net)だけでそれぞれ順に1周して終わり。

複数ノード複数GPUだとちょっとやっかいで、2node2GPUの例

[0] [3] [2] [0] Prev 3 -1 1 -1 
[0] [3] [2] [0] Next -1 2 -1 0 
[0] [3] [0] [0] Prev 3 0 1 -1 
[0] [3] [0] [0] Next 1 2 -1 0

だと、net(i.e. t=2)しかやってない状態では

0<-3, 2<-1
1->2, 3->0
という状態でしか無い。つまり、0&1, 2&3が同じgroupで、それぞれの先頭・末尾が、相手の末尾・先頭とつながっている。それはそう。

ここから、P2P(i.e. t=0)まで含めると

0<-3, 1<-0, 2<-1
0->1, 1->2, 3->0

になって、ノード内の接続状況も書かれる。2->3がないのは、他のノードの状況は知らないから。逆に、rank=2では

[2] [3] [0] [0] Prev 3 -1 1 2 
[2] [3] [0] [0] Next -1 2 3 0

となっているので、rank=0のいたノードの情報は知らないが、2->3が出力される。

aokomoriuta · 2019-09-01T07:29:41Z

実行例ではminScore=NCCL_MAX_SCOREのままで、つまりdo-whileは1回しか回ってないが、複数回実行されることはあるのか。

nringsTmpに0が入った時がそれに該当するが、それはgetRingsが0を返してきた時だけである。

ということで、やはりgetRings #5 (comment) を真面目に読まないとダメそうだ

aokomoriuta · 2019-09-01T07:41:16Z

getRingsは、net,shm,p2pの順番で呼ばれ、以前のprevTmp/nextTmpを保存しているので、最初はnetから読み始めるのが簡単そう。

nccl/src/transport/net.cc

Lines 183 to 219 in 9db4b1d

    
           for (int ring = 0; ring<*nringsRet; ring++) { 
        
             for (int group = 0; group<nGroups; group++) { 
        
               int nranksInGroup = 0; 
        
               int nsubGroups = 0; 
        
               for (int rank=0; rank<nranks; rank++) 
        
                 if (groups[rank] == group) { 
        
                   nranksInGroup++; 
        
                   nsubGroups = std::max(subgroups[rank], nsubGroups); 
        
                 } 
        
               starts[group] = ends[group] = -1; 
        
               // Receive on the rank closest to the NIC 
        
               for (int card=0; card<NET_MAX_IFS; card++) { 
        
                 if (cardUsed[group*NET_MAX_IFS+card] == 1) continue; 
        
                 int start = groupBestStart(nranks, groups, group, values, card, minScore); 
        
                 // Send from any rank, but best on a different subgroup and close to the NIC also. 
        
                 int end = (nranksInGroup == 1) ? start 
        
                     : groupBestEnd(nranks, groups, group, subgroups, nsubGroups ? subgroups[start] : -1, start, values, card, minScore); 
        
                 //printf("Ring %d, Minscore %d, Card %d, group %d, start = %d, end = %d\n", ring, minScore, card, group, start, end); 
        
                 if (start != -1 && end != -1) { 
        
                   cardUsed[group*NET_MAX_IFS+card] = 1; 
        
                   starts[group] = start; 
        
                   ends[group] = end; 
        
                   break; 
        
                 } 
        
               } 
        
               if (starts[group] == -1 || ends[group] == -1) { 
        
                 *nringsRet = ring; 
        
                 goto done; 
        
               } 
        
             } 
        
             // Link groups together 
        
             for (int group = 0; group<nGroups; group++) { 
        
               int nextGroup = (group+1)%nGroups; 
        
               next[ring*nranks+ends[group]] = starts[nextGroup]; 
        
               prev[ring*nranks+starts[nextGroup]] = ends[group]; 
        
             } 
        
           }

aokomoriuta · 2019-09-01T07:48:14Z

netつまりt=2の時、

nccl/src/misc/rings.cc

Line 262 in 0ceaec9

    
           for (int tr = NTRANSPORTS-1; tr > t; tr--) if (coords[r*NTRANSPORTS+tr] != coords[rank*NTRANSPORTS+tr]) sameLocal = 0;

は回らないので、つねに

nccl/src/misc/rings.cc

Line 265 in 0ceaec9

groups[nidx] = coords[r*NTRANSPORTS+t];

が実行され、かつ全部同じネットワークにいるはずなので全部0のはず cf. #5 (comment)

なので、nGroupsは1になって

nccl/src/transport/net.cc

Line 177 in 9db4b1d

int nGroups = groups[nranks-1] + 1;

、groupで回るところは無意味

nccl/src/transport/net.cc

Line 184 in 9db4b1d

for (int group = 0; group<nGroups; group++) {

だと思うが、なぜこのループがあるんだろうか（実は違うnetにいるやつとかあるのか？

aokomoriuta · 2019-09-01T07:59:42Z

coords[net]=0以外になることがあるのか検証するために

nccl/src/transport/net.cc

Lines 116 to 133 in 9db4b1d

    
           ncclResult_t netCanConnect(ncclTvalue_t* ret, struct ncclPeerInfo* myInfo, struct ncclPeerInfo* peerInfo) { 
        
             int cudaDev; 
        
             CUDACHECK(cudaGetDevice(&cudaDev)); 
        
             ret[0] = ncclNetTvalues[cudaDev]; 
        
             if (ret[0] == NET_TVALUE_UNKNOWN) { 
        
               if (cudaDev >= NET_MAX_GPUS) { 
        
                 WARN("CUDA device %d >= MAX %d\n", cudaDev, NET_MAX_GPUS); 
        
                 return ncclInternalError; 
        
               } 
        
               int nDev; 
        
               short* distances; 
        
               NCCLCHECK(netDevices(&nDev, &distances)); 
        
               ncclNetTvalues[cudaDev] = ret[0] = getTvalue(distances, nDev); 
        
               ncclNetNDev = nDev; 
        
               free(distances); 
        
             } 
        
             return ncclSuccess; 
        
           }

をちゃんと読んでおく（もしcanConnectでなければ、

nccl/src/misc/rings.cc

Line 169 in 0ceaec9

current[transport]++;

で番号が1になるので）。

実際の接続判定は

nccl/src/init.cc

Line 477 in 8e04d80

if (connectValue[r] > 0) {

で、canConnectの第一引数の値が0以上かどうかを見ている。

nccl/src/transport/net.cc

Line 119 in 9db4b1d

ret[0] = ncclNetTvalues[cudaDev];

のncclNetTvalues は計算結果のキャッシュなので放置で良くて、本体は

nccl/src/transport/net.cc

Lines 127 to 128 in 9db4b1d

    
           NCCLCHECK(netDevices(&nDev, &distances)); 
        
           ncclNetTvalues[cudaDev] = ret[0] = getTvalue(distances, nDev);

あたり。

getTValueの中身

nccl/src/transport/net.cc

Lines 28 to 36 in 9db4b1d

    
           static ncclTvalue_t getTvalue(short* distances, int ndev) { 
        
             ncclTvalue_t tvalue = 0; 
        
             for (int d=0; d<ndev; d++) { 
        
               int score = 1 + PATH_SYS - distances[d]; 
        
               // Keep 3 bits of score info per dev 
        
               tvalue |= ((score & NET_BITS_PER_IF_MASK)<<(NET_BITS_PER_IF*d)); 
        
             } 
        
             return tvalue; 
        
           }

を見るとdistanceに結構複雑な情報が入っていそうなのでnetDevices

nccl/src/transport/net.cc

Lines 90 to 113 in 9db4b1d

    
           static ncclResult_t netDevices(int* ndev, short** distances) { 
        
             NCCLCHECK(ncclNetDevices(ndev)); 
        
             if (*ndev == 0) { 
        
               WARN("Error : Network returned 0 device"); 
        
               return ncclSystemError; 
        
             } 
        
             if (*ndev > NET_MAX_IFS) *ndev = NET_MAX_IFS; 
        
             *distances = (short*)malloc(*ndev*sizeof(short)); 
        
             if (*distances == NULL) return ncclSystemError; 
        
             // Find distance with current GPU 
        
             int cudaDev, nvmlDev; 
        
             CUDACHECK(cudaGetDevice(&cudaDev)); 
        
             NCCLCHECK(getNvmlDevice(cudaDev, &nvmlDev)) 
        
             char line[1024]; 
        
             sprintf(line, "CUDA Dev %d[%d], %s NIC distance : ", cudaDev, nvmlDev, ncclNetName()); 
        
             for (int d=0; d<*ndev; d++) { 
        
               NCCLCHECK(netDistance(cudaDev, d, *distances+d)); 
        
               sprintf(line+strlen(line), " %s", pathDists[(*distances)[d]]); 
        
             } 
        
             INFO(NCCL_INIT|NCCL_NET, "%s", line); 
        
             return ncclSuccess; 
        
           }

を解読する必要がある。

aokomoriuta · 2019-09-01T08:05:22Z

そういえば前回誰かやってなかったかなと思ったらrmatsumiyaメモにあった。

GPUがネットワークに繋がっているかだけを判定している
ソケット通信だと結構色々やってる

もうちょっと深読みします。

aokomoriuta · 2019-09-01T08:37:35Z

nccl/src/transport/net.cc

Lines 105 to 111 in 9db4b1d

    
           char line[1024]; 
        
           sprintf(line, "CUDA Dev %d[%d], %s NIC distance : ", cudaDev, nvmlDev, ncclNetName()); 
        
           for (int d=0; d<*ndev; d++) { 
        
             NCCLCHECK(netDistance(cudaDev, d, *distances+d)); 
        
             sprintf(line+strlen(line), " %s", pathDists[(*distances)[d]]); 
        
           } 
        
           INFO(NCCL_INIT|NCCL_NET, "%s", line);

lineとかsprintfとか入っているが、単なるデバッグログようなので無視して良くて、本体は

nccl/src/transport/net.cc

Lines 78 to 88 in 9db4b1d

    
           static ncclResult_t netDistance(int cudaDev, int dev, short* distance) { 
        
             char* cudaPath = NULL; 
        
             char* nicPath = NULL; 
        
             ncclResult_t err; 
        
             NCCLCHECK(getCudaPath(cudaDev, &cudaPath)); 
        
             err = ncclNetPciPath(dev, &nicPath); 
        
             *distance = (err != ncclSuccess || nicPath == NULL || cudaPath == NULL) ? PATH_SYS : pciDistance(nicPath, cudaPath); 
        
             if (nicPath) free(nicPath); 
        
             if (cudaPath) free(cudaPath); 
        
             return ncclSuccess; 
        
           }

distanceの決定は

nccl/src/transport/net.cc

Line 84 in 9db4b1d

    
           *distance = (err != ncclSuccess || nicPath == NULL || cudaPath == NULL) ? PATH_SYS : pciDistance(nicPath, cudaPath);

まず、getCudaPathとncclNetPciPathのどちらかがNULLだとPATH_SYS、つまり

nccl/src/include/topo.h

Line 37 in 9db4b1d

PATH_SYS = 4,

なので距離4としている。

このうち、getCudaPathは、NULLを返すときは戻り値がエラーになっている

nccl/src/misc/topo.cc

Line 21 in 0ceaec9

if (*path == NULL) {

ので、このNULLチェックは意味がない（なら残さないでほしい

nccl/src/include/net.h

Line 19 in 8e04d80

    
           static ncclResult_t ncclNetPciPath(int dev, char** path) { NCCLCHECK(ncclNet->pciPath(dev, path)); return ncclSuccess; }

ncclNetPciPathはまた関数ポインタになっていてややこしい・・・。ncclNetはIBかSocketしかない

nccl/src/init.cc

Lines 131 to 137 in ccb1298

    
           if (ncclNet != NULL) return ncclSuccess; 
        
           if (initNet(&ncclNetIb) == ncclSuccess) { 
        
             ncclNet = &ncclNetIb; 
        
           } else { 
        
             NCCLCHECK(initNet(&ncclNetSocket)); 
        
             ncclNet = &ncclNetSocket; 
        
           }

ので、つまり、

nccl/src/transport/net_socket.cc

Lines 62 to 71 in 8e04d80

    
           ncclResult_t ncclSocketPciPath(int dev, char** path) { 
        
             char devicepath[PATH_MAX]; 
        
             snprintf(devicepath, PATH_MAX, "/sys/class/net/%s/device", ncclNetIfNames+dev*MAX_IF_NAME_SIZE); 
        
             *path = realpath(devicepath, NULL); 
        
             if (*path == NULL) { 
        
               INFO(NCCL_NET|NCCL_INIT, "Could not find real path of %s", devicepath); 
        
               return ncclSystemError; 
        
             } 
        
             return ncclSuccess; 
        
           }

nccl/src/transport/net_ib.cc

Lines 180 to 189 in 8e04d80

    
           ncclResult_t ncclIbPciPath(int dev, char** path) { 
        
             char devicepath[PATH_MAX]; 
        
             snprintf(devicepath, PATH_MAX, "/sys/class/infiniband/%s/device", ncclIbDevs[dev].devName); 
        
             *path = realpath(devicepath, NULL); 
        
             if (*path == NULL) { 
        
               WARN("Could not find real path of %s", devicepath); 
        
               return ncclSystemError; 
        
             } 
        
             return ncclSuccess; 
        
           }

のどっちか。これらがNULLを返す時は、そもそもNCCL_SOCET_IFで指定されたEth/IBのネットワークカードがないという意味。それは何も動いていないということですね。

aokomoriuta · 2019-09-01T08:49:21Z

pciDistanceの方は

nccl/src/misc/topo.cc

Lines 45 to 56 in 0ceaec9

    
               return PATH_NODE; 
        
           #else 
        
               /* Split the former PATH_SOC distance into PATH_NODE and PATH_SYS based on numaId */ 
        
               int numaId1 = getNumaId(path1); 
        
               int numaId2 = getNumaId(path2); 
        
               TRACE(NCCL_INIT, "depth %d score %d path1 %s numaId %d path2 %s numaId %d", depth, score, path1, numaId1, path2, numaId2); 
        
               return ((numaId1 == numaId2) ? PATH_NODE : PATH_SYS); 
        
           #endif 
        
             } 
        
             if (score == 4) return PATH_PHB; 
        
             if (score == depth-1) return PATH_PIX; 
        
             return PATH_PXB;

となっていて、中身はともかく絶対に0-4の値を返すことになっている。

以上から、つまりdistanceは何もなければ0から4の数字が絶対に入っている。別に複雑でもない。

で、スコア算出

nccl/src/transport/net.cc

Lines 28 to 36 in 9db4b1d

    
           static ncclTvalue_t getTvalue(short* distances, int ndev) { 
        
             ncclTvalue_t tvalue = 0; 
        
             for (int d=0; d<ndev; d++) { 
        
               int score = 1 + PATH_SYS - distances[d]; 
        
               // Keep 3 bits of score info per dev 
        
               tvalue |= ((score & NET_BITS_PER_IF_MASK)<<(NET_BITS_PER_IF*d)); 
        
             } 
        
             return tvalue; 
        
           }

に戻って、1 + PATH_SYS - distances[d];なので、scoreは1から5。NET_BITS_PER_IF_MASKは1<<３-1 = 0b111で3bitつまり1-5は表現できるので、絶対にゼロになることはない。

ということで結論として、何があってもnetCanConnectがfalseにはならないので、 #5 (comment) のcoord[net]=0以外はありえない。

なので最終的にぐるっと戻って

nccl/src/transport/net.cc

Line 184 in 9db4b1d

for (int group = 0; group<nGroups; group++) {

のgroupはつねに0固定、他は考えなくて良い。

aokomoriuta · 2019-09-01T09:03:47Z

netGetRings()がnRings=0を返すのは、starts/endsのどちらかが-1、つまり見つからなかった時

nccl/src/transport/net.cc

Lines 208 to 209 in 9db4b1d

    
           if (starts[group] == -1 || ends[group] == -1) { 
        
             *nringsRet = ring;

starts/endsは

nccl/src/transport/net.cc

Line 144 in 9db4b1d

if (score >= minScore && score > bestScore) {
nccl/src/transport/net.cc

Line 165 in 9db4b1d

if (score >= minScore) {

あたりで設定されるが、先述の通りどのifでも必ず4か5の値なので、これは常にどれかで真になる。

ゆえ、少なくともnet(transport=2)でnringsTmp=0はありえない。

aokomoriuta · 2019-09-01T09:06:45Z

今日はここまで。次回

[ao] 読み進めメモ #5 (comment) のshm/p2p

ncclGetRingsの後半部分

nccl/src/misc/rings.cc

Lines 351 to 390 in 0ceaec9

    
           /* Duplicate the rings in case of multinode+NVLink */ 
        
           int nnodes = 0; 
        
           for (int r=0; r<nranks; r++) nnodes += treeIn[r]; 
        
           int nvlink; 
        
           NCCLCHECK(ncclNvlinkGpu(&nvlink)); 
        
           if (nnodes > 1 && nvlink) { 
        
             *nrings = copyRings(*nrings, *nrings*2, nranks, prev, next, treeIn, treeOut); 
        
           } 
        
           if (*nrings == 0) { 
        
             WARN("Could not create rings, falling back on simple ring"); 
        
             *nrings = 1; 
        
             prev[rank] = (rank-1+nranks) % nranks; 
        
             next[rank] = (rank+1)%nranks; 
        
           } 
        
           int maxNrings = ncclParamMaxNrings(); 
        
           int minNrings = ncclParamMinNrings(); 
        
           if (maxNrings > 0 && minNrings > maxNrings) { 
        
             if (rank == 0) WARN("NCCL_MIN_NRINGS set to a value greater than NCCL_MAX_NRINGS, ignoring NCCL_MIN_NRINGS"); 
        
             minNrings = 0; 
        
           } 
        
           if (minNrings > MAXCHANNELS) { 
        
             if (rank == 0) WARN("NCCL_MIN_NRINGS set to a value greater than the maximum number of rings supported (%d), limiting it to %d", MAXCHANNELS, MAXCHANNELS); 
        
             minNrings = MAXCHANNELS; 
        
           } 
        
           if (maxNrings > 0 && maxNrings <= *nrings) { 
        
             if (rank == 0) INFO(NCCL_INIT,"Limiting to %d rings per user request.", maxNrings); 
        
             *nrings = maxNrings; 
        
           } else { 
        
             int defaultMinNrings = ncclCudaCompCap() == 3 ? 2 : 1; 
        
             if (minNrings < defaultMinNrings) minNrings = defaultMinNrings; 
        
             if (minNrings > 0 && minNrings > *nrings) { 
        
               if (rank == 0 && minNrings > defaultMinNrings) INFO(NCCL_INIT,"Duplicating rings to %d per user request.", minNrings); 
        
               *nrings = copyRings(*nrings, minNrings, nranks, prev, next, treeIn, treeOut); 
        
             } 
        
           } 
        
           NCCLCHECK(getEnvThreads(nthreads)); 
        
           return ncclSuccess;

aokomoriuta self-assigned this Aug 3, 2019

[ao] 読み進めメモ #5

[ao] 読み進めメモ #5

Comments

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019 • edited Loading

aokomoriuta commented Aug 3, 2019 • edited Loading

aokomoriuta commented Aug 3, 2019 • edited Loading

aokomoriuta commented Aug 3, 2019 • edited Loading

aokomoriuta commented Aug 3, 2019

ArgsCheck

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019 • edited Loading

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019 • edited Loading

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Aug 3, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019 • edited Loading

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019 • edited Loading

aokomoriuta commented Sep 1, 2019 • edited Loading

aokomoriuta commented Sep 1, 2019 • edited Loading

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019 • edited Loading

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019 • edited Loading

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Sep 1, 2019

aokomoriuta commented Aug 3, 2019 •

edited

Loading

aokomoriuta commented Aug 3, 2019 •

edited

Loading

aokomoriuta commented Aug 3, 2019 •

edited

Loading

aokomoriuta commented Aug 3, 2019 •

edited

Loading

aokomoriuta commented Aug 3, 2019 •

edited

Loading

aokomoriuta commented Aug 3, 2019 •

edited

Loading

aokomoriuta commented Sep 1, 2019 •

edited

Loading

aokomoriuta commented Sep 1, 2019 •

edited

Loading

aokomoriuta commented Sep 1, 2019 •

edited

Loading

aokomoriuta commented Sep 1, 2019 •

edited

Loading

aokomoriuta commented Sep 1, 2019 •

edited

Loading

aokomoriuta commented Sep 1, 2019 •

edited

Loading