$ git clone https://github.com/RC4ML/Legion.git
All platforms are bare-metal machines. Table 1
Platform | CPU-Info | #sockets | #NUMA nodes | CPU Memory | PCIe | GPUs | NVLinks |
---|---|---|---|---|---|---|---|
DGX-V100 | 96*Intel(R) Xeon(R) Platinum 8163 CPU @2.5GHZ | 2 | 1 | 384GB | PCIe 3.0x16, 4*PCIe switches, each connecting 2 GPUs | 8x16GB-V100 | NVLink Bridges, Kc = 2, Kg = 4 |
Siton | 104*Intel(R) Xeon(R) Gold 5320 CPU @2.2GHZ | 2 | 2 | 1TB | PCIe 4.0x16, 2*PCIe switches, each connecting 4 GPUs | 8x40GB-A100 | NVLink Bridges, Kc = 4, Kg = 2 |
DGX-A100 | 128*Intel(R) Xeon(R) Platinum 8369B CPU @2.9GHZ | 2 | 1 | 1TB | PCIe 4.0x16, 4*PCIE switches, each connecting 2 GPUs | 8x80GB-A100 | NVSwitch, Kc = 1, Kg = 8 |
Kc means the number of groups in which GPUs connect each other. And Kg means the number of GPUs in each group.
Legion's software is light-weighted and portable. Here we list some tested environment.
-
Nvidia Driver Version: 515.43.04
-
CUDA 11.7
-
GCC/G++ 11.4.0
-
OS: Ubuntu(other linux systems are ok)
-
Intel PCM(according to OS version)
$ wget https://download.opensuse.org/repositories/home:/opcm/xUbuntu_18.04/amd64/pcm_0-0+651.1_amd64.deb
- pytorch-cu117, torchmetrics
$ pip3 install torch-cu1xx
- dgl 1.1.0
$ pip3 install dgl -f https://data.dgl.ai/wheels/cu1xx/repo.html
- MPI-3.1
Datasets are from OGB (https://ogb.stanford.edu/), Standford-snap (https://snap.stanford.edu/), and Webgraph (https://webgraph.di.unimi.it/). Here is an example of preparing datasets for Legion.
Refer to README in dataset directory for more instructions
$ bash prepare_datasets.sh
gpu_num represents all gpu numbers you want to use, Legion will partition the graph according to underlying NVlink topology Note that this step would consume a large volume of CPU memory.
$ python graph_partitioning.py --dataset_name 'ukunion' --gpu_num 2
$ bash build.sh
There are three steps to train a GNN model in Legion. In these steps, you need to change to root user for PCM. (2024.3.11, to solve PCM bugs for general platforms, I disable PCM for now)
$ modprobe msr
$ python legion_server.py --dataset_path 'dataset' --dataset_name ukunion --train_batch_size 8000 --fanout [25,10] --gpu_number 2 --epoch 2 --cache_memory 38000000
After Legion outputs "System is ready for serving", then start training by:
$ python training_backend/legion_graphsage.py --class_num 2 --features_num 128 --hidden_dim 256 --hops_num 2 --gpu_number 2 --epoch 2
I will continusly work on this to improve the running process for easier use.
If you use it in your paper, please cite our work
@inproceedings {sun2023legion,
author = {Jie Sun and Li Su and Zuocheng Shi and Wenting Shen and Zeke Wang and Lei Wang and Jie Zhang and Yong Li and Wenyuan Yu and Jingren Zhou and Fei Wu},
title = {Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training},
booktitle = {2023 USENIX Annual Technical Conference (USENIX ATC 23)},
year = {2023},
pages = {165--179}
}
We will open-source SSD support for Legion in the future.