StellaTrain (SIGCOMM 2024)

The repository is a public version of StellaTrain implementation.

Project Structure

backend/src/optim: CPU sparse optimizer for SGD and Adam → 3.1 CPU-based Sparse Optimizer
backend/src/compress: Gradient sparsification method (e.g. Top-k, threshold-v, thresholdv16(Cache-aware)) → 3.2 CPU-based Gradient Sparsification
backend/src/engine/core.cpp and other code (backend/src/engine/comm_manager,cpp, core_module_api.cpp, shm_manager.cpp, task.cpp, threadpool.cpp): Responsible for the main process and scheduling of Stellatrain → 3.3 Efficient Pipeline Management
backend/src/telemetry_*.cpp: Asynchornous data update for optimization → 4.1 The telemetry server
backend/src/engine/batch_rate_alloc*.py: Adaptive optimization for variable bandwidth → 4.2, 4.3 The centralized controller
bayesian/profile: Offline bayesian optimization

How to run experiments

Setup

Hardware requirements for running the experiments are as follows:

>= 2 nodes connected via network
>= 2 NVIDIA GPUs per node (with memory of 16GB or more)

We recommend using AWS EC2 instances for the experiments. Launch two instances with the following configurations:

p3.8xlarge (4 NVIDIA V100 GPUs)
Deep Learning Base AMI (Amazon Linux 2) Version 58.2
Disk space of 200 GB
Allow all TCP ports in the security group of the instances.

Run container image

Execute the script to build and run the Docker image. It pulls image from ghcr.io/kaist-ina/stellatrain:main and launch bash inside the docker container.

REPO=ghcr.io/kaist-ina/stellatrain:main
docker pull $REPO
docker run -it --rm --gpus all --ipc=host --net=host --ulimit memlock=-1 --ulimit stack=67108864 $REPO

Run test script

You can run script below to test distributed training without dataset. Note that loss may diverge without downloading dataset.

In the server 1 (master server), run:

test_script.sh --master-ip-address <master server public IP address> --my-ip-address <server 1 public ip address> --world-size 2 --num-gpus 2 --rank 0

In the server 2, run:

test_script.sh --master-ip-address <master server public IP address> --my-ip-address <server 2 public ip address> --world-size 2 --num-gpus 2 --rank 1

Run test with ImageNet

If you wish to perform training with real dataset, download the ImageNet dataset and follow the instructions below.

Download ImageNet dataset from ImageNet Website.

Run docker on each server with the following command:

REPO=ghcr.io/kaist-ina/stellatrain:main
DATASET_PATH="{your dataset path}"
docker pull $REPO
docker run -it --rm --gpus all --ipc=host --net=host --ulimit memlock=-1 --ulimit stack=67108864 -v "$DATASET_PATH":/datasets $REPO

In the server 1 (master server), run

test_script_imagenet.sh --master-ip-address <master server public IP address> --my-ip-address <server 1 public ip address> --world-size 2 --num-gpus 2 --rank 0

In the server 2, run

test_script_imagenet.sh --master-ip-address <master server public IP address> --my-ip-address <server 2 public ip address> --world-size 2 --num-gpus 2 --rank 1

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
asset		asset
backend		backend
bayesian/profile		bayesian/profile
util		util
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.gitmodules		.gitmodules
Dockerfile		Dockerfile
README.md		README.md
test_script.sh		test_script.sh
test_script_imagenet.sh		test_script_imagenet.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StellaTrain (SIGCOMM 2024)

Project Structure

How to run experiments

Setup

Run container image

Run test script

Run test with ImageNet

About

Releases

Packages

Languages

kaist-ina/stellatrain

Folders and files

Latest commit

History

Repository files navigation

StellaTrain (SIGCOMM 2024)

Project Structure

How to run experiments

Setup

Run container image

Run test script

Run test with ImageNet

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages