INFR11164: Dissertation Project

Leveraging Deep Reinforcement Learning on HPC Batch Job Schedulers

Wuwei Zhang

Supervisor: Caoimhín Laoide-Kemp

Abstract:

This dissertation explores the potential of integrating Deep Reinforcement Learning (DRL) into High Performance Computing (HPC) job scheduling. We began by formu- lating the HPC job scheduling problem as a Markov Decision Process (MDP) within a specialized HPC job scheduling simulation setup. This simulation is constructed with modules such as workload, job, and simulator to allow for an intricate depiction of state space, capturing aspects like the waiting job queue and the current state of the cluster. We crafted an action space consisting of 128 distinct actions to represent various job scheduling situations, including potential non-operative or invalid schedules. Central to our methodology is the sigmoid reward function, governed by a hyperparameter k, which standardizes performance metrics to fit a [0, 1] interval. This function is evaluated against the episodic reward function and direct performance metric rewards. Leveraging the Proximal Policy Optimization (PPO) technique from Stable Baseline3, we incorporated a kernel-driven policy network and an MLP-centric value network for the agent’s training process. Our strategies, especially when employing the PPO method, demonstrated exemplary outcomes on the HPC2N dataset, marked by a no- table decrease in the average bounded slowdown. However, transitioning to black-box datasets, such as SDSC-SP2, the agent’s performance showed variability, suggesting its reliance on the initial training setup and reward function. In essence, our research explores DRL’s capability to potentially surpass conventional priority-based scheduling approaches in HPC job scheduling problems, albeit its success being closely linked to the nature of training data and the design of the reward function.

Installation

To install the required dependencies, run the following command:

pip install -r requirements.txt

Project Architecture

./Documentation: This directory contains project meeting minutes and legacy source code.
./data: This folder contains training logs and trained models.
./dataset: Contains all the workload datasets from the Parallel Workload Archive used in this project.
./ppo_configs: This folder contains numerous configurations used in our experiments.
./hpc_rl_simulator: This package contains the implementation of our HPC job scheduling simulator and its OpenAI gym wrappers. Additionally, it includes the customized actor policy networks, callbacks, and additional evaluation methods for training the Stable Baselines3 PPO agent.
./result_analysis: This folder contains the data visualization and result analysis of our experiments.

Running

Training the Agent

To train the agent, use the following command:

python train_ppo.py

Evaluating Priority-Based Scheduling Methods' Performance

To evaluate and export the performance, use:

python non-RL-score.py

Evaluating Trained Agent Performance

To evaluate and export the trained agent's performance:

python evaluate_agent.py

Accessing Training Logs

To access the training logs via Tensorboard:

tensorboard --logdir ./data/logs/*****

Note: Replace ***** with the appropriate log directory name.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INFR11164: Dissertation Project

Leveraging Deep Reinforcement Learning on HPC Batch Job Schedulers

Wuwei Zhang

Supervisor: Caoimhín Laoide-Kemp

Abstract:

Table of Contents

Installation

Project Architecture

Running

Training the Agent

Evaluating Priority-Based Scheduling Methods' Performance

Evaluating Trained Agent Performance

Accessing Training Logs

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
Documentation		Documentation
data		data
dataset		dataset
hpc_rl_simulator		hpc_rl_simulator
ppo_configs		ppo_configs
result_analysis		result_analysis
README.md		README.md
evaluate_agent.py		evaluate_agent.py
non-RL-score.py		non-RL-score.py
requirements.txt		requirements.txt
train_ppo.py		train_ppo.py

LANNDS18/DeepRLScheduler

Folders and files

Latest commit

History

Repository files navigation

INFR11164: Dissertation Project

Leveraging Deep Reinforcement Learning on HPC Batch Job Schedulers

Wuwei Zhang

Supervisor: Caoimhín Laoide-Kemp

Abstract:

Table of Contents

Installation

Project Architecture

Running

Training the Agent

Evaluating Priority-Based Scheduling Methods' Performance

Evaluating Trained Agent Performance

Accessing Training Logs

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages