Skip to content

NKI-AI/kosmos-cluster-legacy

 
 

Repository files navigation

Cosmos cluster management

GPU infrastructure and automation tools

Overview

The DeepOps project encapsulates best practices in the deployment of GPU server clusters and sharing single powerful nodes. DeepOps can also be adapted or used in a modular fashion to match site-specific cluster needs. For example:

  • An on-prem data center of NVIDIA DGX servers where DeepOps provides end-to-end capabilities to set up the entire cluster management stack
  • An existing cluster that needs a resource manager / batch scheduler, where DeepOps is used to install Slurm, Kubernetes, or a hybrid of both
  • A single machine where no scheduler is desired, only NVIDIA drivers, Docker, and the NVIDIA Container Runtime

Getting Started

For detailed help or guidance, read through our Getting Started Guide or pick one of the deployment options documented below.

Supported Ansible versions

DeepOps supports using Ansible 2.9.x. Ansible 2.10.x and newer are not currently supported.

Supported distributions

DeepOps currently supports the following Linux distributions:

  • Ubuntu 18.04 LTS, 20.04 LTS

Slurm

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

Consult the DeepOps Slurm Deployment Guide for instructions on building a GPU-enabled Slurm cluster using DeepOps.

For more information on Slurm in general, refer to the official Slurm docs.

Updating DeepOps

To update from a previous version of DeepOps to a newer release, please consult the DeepOps Update Guide.

Copyright and License

This project is released under the BSD 3-clause license.

Issues

NVIDIA DGX customers should file an NVES ticket via NVIDIA Enterprise Services.

Otherwise, bugs and feature requests can be made by filing a GitHub Issue.

Contributing

To contribute, please issue a signed pull request against the master branch from a local fork. See the contribution document for more information.

Packages

No packages published

Languages

  • Jinja 81.9%
  • Shell 8.9%
  • Python 4.3%
  • HCL 4.1%
  • Dockerfile 0.4%
  • Jupyter Notebook 0.1%
  • Other 0.3%