Skip to content

Commit

Permalink
Merge branch 'skypilot-org:master' into kueue-multipod
Browse files Browse the repository at this point in the history
  • Loading branch information
asaiacai committed Jul 8, 2024
2 parents cd4c4ab + 6acaa75 commit c19df88
Show file tree
Hide file tree
Showing 55 changed files with 1,203 additions and 551 deletions.
4 changes: 3 additions & 1 deletion docs/source/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ Contents
../reference/job-queue
../examples/auto-failover
../reference/kubernetes/index
../running-jobs/index
../running-jobs/distributed-jobs

.. toctree::
:maxdepth: 1
Expand Down Expand Up @@ -155,12 +155,14 @@ Contents
:maxdepth: 1
:caption: User Guides

../running-jobs/environment-variables
../examples/docker-containers
../examples/ports
../reference/tpu
../reference/logging
../reference/faq


.. toctree::
:maxdepth: 1
:caption: Developer Guides
Expand Down
31 changes: 16 additions & 15 deletions docs/source/getting-started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -311,25 +311,26 @@ Fluidstack
Cudo Compute
~~~~~~~~~~~~~~~~~~

`Cudo Compute <https://www.cudocompute.com/>`__ GPU cloud provides low cost GPUs powered with green energy.
1. Create a billing account by following `this guide <https://www.cudocompute.com/docs/guide/billing/>`__.
2. Create a project `<https://www.cudocompute.com/docs/guide/projects/>`__.
3. Create an API Key by following `this guide <https://www.cudocompute.com/docs/guide/api-keys/>`__.
3. Download and install the `cudoctl <https://www.cudocompute.com/docs/cli-tool/>`__ command line tool
3. Run :code:`cudoctl init`:
`Cudo Compute <https://www.cudocompute.com/>`__ provides low cost GPUs powered by green energy.

.. code-block:: shell
1. Create a `billing account <https://www.cudocompute.com/docs/guide/billing/>`__.
2. Create a `project <https://www.cudocompute.com/docs/guide/projects/>`__.
3. Create an `API Key <https://www.cudocompute.com/docs/guide/api-keys/>`__.
4. Download and install the `cudoctl <https://www.cudocompute.com/docs/cli-tool/>`__ command line tool
5. Run :code:`cudoctl init`:

.. code-block:: shell
cudoctl init
✔ api key: my-api-key
✔ project: my-project
✔ billing account: my-billing-account
✔ context: default
config file saved ~/.config/cudo/cudo.yml
cudoctl init
✔ api key: my-api-key
✔ project: my-project
✔ billing account: my-billing-account
✔ context: default
config file saved ~/.config/cudo/cudo.yml
pip install "cudo-compute>=0.1.10"
pip install "cudo-compute>=0.1.10"
If you want to want to use skypilot with a different Cudo Compute account or project, just run :code:`cudoctl init`: again.
If you want to want to use SkyPilot with a different Cudo Compute account or project, run :code:`cudoctl init` again.



Expand Down
4 changes: 3 additions & 1 deletion docs/source/getting-started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,8 @@ To launch a cluster and run a task, use :code:`sky launch`:

You can use the ``-c`` flag to give the cluster an easy-to-remember name. If not specified, a name is autogenerated.

If the cluster name is an existing cluster shown in ``sky status``, the cluster will be reused.

The ``sky launch`` command performs much heavy-lifting:

- selects an appropriate cloud and VM based on the specified resource constraints;
Expand Down Expand Up @@ -208,7 +210,7 @@ Managed spot jobs run on much cheaper spot instances, with automatic preemption

.. code-block:: console
$ sky spot launch hello_sky.yaml
$ sky jobs launch --use-spot hello_sky.yaml
Next steps
-----------
Expand Down
25 changes: 25 additions & 0 deletions docs/source/reference/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,31 @@ Available fields and semantics:
- gcp
- kubernetes
docker:
# Additional Docker run options (optional).
#
# When image_id: docker:<docker_image> is used in a task YAML, additional
# run options for starting the Docker container can be specified here.
# These options will be passed directly as command line args to `docker run`,
# see: https://docs.docker.com/reference/cli/docker/container/run/
#
# The following run options are applied by default and cannot be overridden:
# --net=host
# --cap-add=SYS_ADMIN
# --device=/dev/fuse
# --security-opt=apparmor:unconfined
# --runtime=nvidia # Applied if nvidia GPUs are detected on the host
#
# This field can be useful for mounting volumes and other advanced Docker
# configurations. You can specify a list of arguments or a string, where the
# former will be combined into a single string with spaces. The following is
# an example option for allowing running Docker inside Docker and increase
# the size of /dev/shm.:
# sky launch --cloud aws --image-id docker:continuumio/miniconda3 "apt update; apt install -y docker.io; docker run hello-world"
run_options:
- -v /var/run/docker.sock:/var/run/docker.sock
- --shm-size=2g
nvidia_gpus:
# Disable ECC for NVIDIA GPUs (optional).
#
Expand Down
14 changes: 13 additions & 1 deletion docs/source/reference/kubernetes/kubernetes-troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,19 @@ Run :code:`sky check` to verify that SkyPilot can access your cluster.
If you see an error, ensure that your kubeconfig file at :code:`~/.kube/config` is correctly set up.


Step A3 - Can you launch a SkyPilot task?
Step A3 - Do your nodes have enough disk space?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If your nodes are out of disk space, pulling the SkyPilot images may fail with :code:`rpc error: code = Canceled desc = failed to pull and unpack image: context canceled` error in the terminal during provisioning.
Make sure your nodes are not under disk pressure by checking :code:`Conditions` in :code:`kubectl describe nodes`, or by running:

.. code-block:: bash
$ kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .status.conditions[?(@.type=="DiskPressure")]}{.type}={.status}{"\n"}{end}{"\n"}{end}'
# Should not show DiskPressure=True for any node
Step A4 - Can you launch a SkyPilot task?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Next, try running a simple hello world task to verify that SkyPilot can launch tasks on your cluster.
Expand Down
44 changes: 18 additions & 26 deletions docs/source/running-jobs/distributed-jobs.rst
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
.. _dist-jobs:

Distributed Jobs on Many VMs
Distributed Jobs on Many Nodes
================================================

SkyPilot supports multi-node cluster
provisioning and distributed execution on many VMs.
provisioning and distributed execution on many nodes.

For example, here is a simple PyTorch Distributed training example:

.. code-block:: yaml
:emphasize-lines: 6-6,21-22,24-25
:emphasize-lines: 6-6,21-21,23-26
name: resnet-distributed-app
Expand All @@ -31,14 +31,13 @@ For example, here is a simple PyTorch Distributed training example:
run: |
cd pytorch-distributed-resnet
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
python3 -m torch.distributed.launch \
--nproc_per_node=${SKYPILOT_NUM_GPUS_PER_NODE} \
--node_rank=${SKYPILOT_NODE_RANK} \
--nnodes=$num_nodes \
--master_addr=$master_addr \
--master_port=8008 \
MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1`
torchrun \
--nnodes=$SKPILOT_NUM_NODES \
--master_addr=$MASTER_ADDR \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank=$SKYPILOT_NODE_RANK \
--master_port=12375 \
resnet_ddp.py --num_epochs 20
In the above,
Expand Down Expand Up @@ -66,16 +65,11 @@ SkyPilot exposes these environment variables that can be accessed in a task's ``
the node executing the task.
- :code:`SKYPILOT_NODE_IPS`: a string of IP addresses of the nodes reserved to execute
the task, where each line contains one IP address.

- You can retrieve the number of nodes by :code:`echo "$SKYPILOT_NODE_IPS" | wc -l`
and the IP address of the third node by :code:`echo "$SKYPILOT_NODE_IPS" | sed -n
3p`.

- To manipulate these IP addresses, you can also store them to a file in the
:code:`run` command with :code:`echo $SKYPILOT_NODE_IPS >> ~/sky_node_ips`.
- :code:`SKYPILOT_NUM_NODES`: number of nodes reserved for the task, which can be specified by ``num_nodes: <n>``. Same value as :code:`echo "$SKYPILOT_NODE_IPS" | wc -l`.
- :code:`SKYPILOT_NUM_GPUS_PER_NODE`: number of GPUs reserved on each node to execute the
task; the same as the count in ``accelerators: <name>:<count>`` (rounded up if a fraction).

See :ref:`sky-env-vars` for more details.

Launching a multi-node task (new cluster)
-------------------------------------------------
Expand Down Expand Up @@ -106,7 +100,7 @@ The following happens in sequence:
and step 4).

Executing a task on the head node only
-----------------------------------------
--------------------------------------
To execute a task on the head node only (a common scenario for tools like
``mpirun``), use the ``SKYPILOT_NODE_RANK`` environment variable as follows:

Expand Down Expand Up @@ -141,7 +135,7 @@ This allows you directly to SSH into the worker nodes, if required.
Executing a Distributed Ray Program
------------------------------------
To execute a distributed Ray program on many VMs, you can download the `training script <https://github.com/skypilot-org/skypilot/blob/master/examples/distributed_ray_train/train.py>`_ and launch the `task yaml <https://github.com/skypilot-org/skypilot/blob/master/examples/distributed_ray_train/ray_train.yaml>`_:
To execute a distributed Ray program on many nodes, you can download the `training script <https://github.com/skypilot-org/skypilot/blob/master/examples/distributed_ray_train/train.py>`_ and launch the `task yaml <https://github.com/skypilot-org/skypilot/blob/master/examples/distributed_ray_train/ray_train.yaml>`_:

.. code-block:: console
Expand Down Expand Up @@ -171,19 +165,17 @@ To execute a distributed Ray program on many VMs, you can download the `training
run: |
sudo chmod 777 -R /var/tmp
head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1`
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
HEAD_IP=`echo "$SKYPILOT_NODE_IPS" | head -n1`
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats --port 6379
sleep 5
python train.py --num-workers $num_nodes
python train.py --num-workers $SKYPILOT_NUM_NODES
else
sleep 5
ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $head_ip:6379 --disable-usage-stats
ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $HEAD_IP:6379 --disable-usage-stats
fi
.. warning::
**Avoid Installing Ray in Base Environment**: Before proceeding with the execution of a distributed Ray program, it is crucial to ensure that Ray is **not** installed in the *base* environment. Installing a different version of Ray in the base environment can lead to abnormal cluster status.

It is highly recommended to **create a dedicated virtual environment** (as above) for Ray and its dependencies, and avoid calling `ray stop` as that will also cause issue with the cluster.
When using Ray, avoid calling ``ray stop`` as that will also cause the SkyPilot runtime to be stopped.

Loading

0 comments on commit c19df88

Please sign in to comment.