Merge branch 'skypilot-org:master' into kueue-multipod

skypilot-org · Jul 8, 2024 · c19df88 · c19df88
2 parents cd4c4ab + 6acaa75
commit c19df88
Show file tree

Hide file tree

Showing 55 changed files with 1,203 additions and 551 deletions.
diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst
@@ -126,7 +126,7 @@ Contents
  ../reference/job-queue
  ../examples/auto-failover
  ../reference/kubernetes/index
- ../running-jobs/index
+ ../running-jobs/distributed-jobs
 
 .. toctree::
  :maxdepth: 1
@@ -155,12 +155,14 @@ Contents
  :maxdepth: 1
  :caption: User Guides
 
+ ../running-jobs/environment-variables
  ../examples/docker-containers
  ../examples/ports
  ../reference/tpu
  ../reference/logging
  ../reference/faq
 
+
 .. toctree::
  :maxdepth: 1
  :caption: Developer Guides

diff --git a/docs/source/getting-started/installation.rst b/docs/source/getting-started/installation.rst
@@ -311,25 +311,26 @@ Fluidstack
 Cudo Compute
 ~~~~~~~~~~~~~~~~~~
 
-`Cudo Compute <https://www.cudocompute.com/>`__ GPU cloud provides low cost GPUs powered with green energy.
-1. Create a billing account by following `this guide <https://www.cudocompute.com/docs/guide/billing/>`__.
-2. Create a project `<https://www.cudocompute.com/docs/guide/projects/>`__.
-3. Create an API Key by following `this guide <https://www.cudocompute.com/docs/guide/api-keys/>`__.
-3. Download and install the `cudoctl <https://www.cudocompute.com/docs/cli-tool/>`__ command line tool
-3. Run :code:`cudoctl init`:
+`Cudo Compute <https://www.cudocompute.com/>`__ provides low cost GPUs powered by green energy.
 
-.. code-block:: shell
+1. Create a `billing account <https://www.cudocompute.com/docs/guide/billing/>`__.
+2. Create a `project <https://www.cudocompute.com/docs/guide/projects/>`__.
+3. Create an `API Key <https://www.cudocompute.com/docs/guide/api-keys/>`__.
+4. Download and install the `cudoctl <https://www.cudocompute.com/docs/cli-tool/>`__ command line tool
+5. Run :code:`cudoctl init`:
+
+ .. code-block:: shell
 
- cudoctl init
- ✔ api key: my-api-key
- ✔ project: my-project
- ✔ billing account: my-billing-account
- ✔ context: default
- config file saved ~/.config/cudo/cudo.yml
+  cudoctl init
+  ✔ api key: my-api-key
+  ✔ project: my-project
+  ✔ billing account: my-billing-account
+  ✔ context: default
+  config file saved ~/.config/cudo/cudo.yml
 
- pip install "cudo-compute>=0.1.10"
+  pip install "cudo-compute>=0.1.10"
 
-If you want to want to use skypilot with a different Cudo Compute account or project, just run :code:`cudoctl init`: again.
+If you want to want to use SkyPilot with a different Cudo Compute account or project, run :code:`cudoctl init` again.
 
 
 

diff --git a/docs/source/getting-started/quickstart.rst b/docs/source/getting-started/quickstart.rst
@@ -72,6 +72,8 @@ To launch a cluster and run a task, use :code:`sky launch`:
 
  You can use the ``-c`` flag to give the cluster an easy-to-remember name. If not specified, a name is autogenerated.
 
+ If the cluster name is an existing cluster shown in ``sky status``, the cluster will be reused.
+
 The ``sky launch`` command performs much heavy-lifting:
 
 - selects an appropriate cloud and VM based on the specified resource constraints;
@@ -208,7 +210,7 @@ Managed spot jobs run on much cheaper spot instances, with automatic preemption
 
 .. code-block:: console
 
- $ sky spot launch hello_sky.yaml
+ $ sky jobs launch --use-spot hello_sky.yaml
 
 Next steps
 -----------

diff --git a/docs/source/reference/config.rst b/docs/source/reference/config.rst
@@ -40,6 +40,31 @@ Available fields and semantics:
  - gcp
  - kubernetes
 
+ docker:
+ # Additional Docker run options (optional).
+ #
+ # When image_id: docker:<docker_image> is used in a task YAML, additional
+ # run options for starting the Docker container can be specified here.
+ # These options will be passed directly as command line args to `docker run`,
+ # see: https://docs.docker.com/reference/cli/docker/container/run/
+ #
+ # The following run options are applied by default and cannot be overridden:
+ # --net=host
+ # --cap-add=SYS_ADMIN
+ # --device=/dev/fuse
+ # --security-opt=apparmor:unconfined
+ # --runtime=nvidia # Applied if nvidia GPUs are detected on the host
+ #
+ # This field can be useful for mounting volumes and other advanced Docker
+ # configurations. You can specify a list of arguments or a string, where the
+ # former will be combined into a single string with spaces. The following is
+ # an example option for allowing running Docker inside Docker and increase
+ # the size of /dev/shm.:
+ # sky launch --cloud aws --image-id docker:continuumio/miniconda3 "apt update; apt install -y docker.io; docker run hello-world"
+ run_options:
+ - -v /var/run/docker.sock:/var/run/docker.sock
+ - --shm-size=2g
+
  nvidia_gpus:
  # Disable ECC for NVIDIA GPUs (optional).
  #

diff --git a/docs/source/reference/kubernetes/kubernetes-troubleshooting.rst b/docs/source/reference/kubernetes/kubernetes-troubleshooting.rst
@@ -68,7 +68,19 @@ Run :code:`sky check` to verify that SkyPilot can access your cluster.
 If you see an error, ensure that your kubeconfig file at :code:`~/.kube/config` is correctly set up.
 
 
-Step A3 - Can you launch a SkyPilot task?
+Step A3 - Do your nodes have enough disk space?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If your nodes are out of disk space, pulling the SkyPilot images may fail with :code:`rpc error: code = Canceled desc = failed to pull and unpack image: context canceled` error in the terminal during provisioning.
+Make sure your nodes are not under disk pressure by checking :code:`Conditions` in :code:`kubectl describe nodes`, or by running:
+
+.. code-block:: bash
+
+ $ kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .status.conditions[?(@.type=="DiskPressure")]}{.type}={.status}{"\n"}{end}{"\n"}{end}'
+ # Should not show DiskPressure=True for any node
+
+
+Step A4 - Can you launch a SkyPilot task?
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Next, try running a simple hello world task to verify that SkyPilot can launch tasks on your cluster.

diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst
@@ -1,15 +1,15 @@
 .. _dist-jobs:
 
-Distributed Jobs on Many VMs
+Distributed Jobs on Many Nodes
 ================================================
 
 SkyPilot supports multi-node cluster
-provisioning and distributed execution on many VMs.
+provisioning and distributed execution on many nodes.
 
 For example, here is a simple PyTorch Distributed training example:
 
 .. code-block:: yaml
- :emphasize-lines: 6-6,21-22,24-25
+ :emphasize-lines: 6-6,21-21,23-26
 
  name: resnet-distributed-app
 
@@ -31,14 +31,13 @@ For example, here is a simple PyTorch Distributed training example:
  run: |
  cd pytorch-distributed-resnet
 
- num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
- master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
- python3 -m torch.distributed.launch \
- --nproc_per_node=${SKYPILOT_NUM_GPUS_PER_NODE} \
- --node_rank=${SKYPILOT_NODE_RANK} \
- --nnodes=$num_nodes \
- --master_addr=$master_addr \
- --master_port=8008 \
+ MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1`
+ torchrun \
+ --nnodes=$SKPILOT_NUM_NODES \
+ --master_addr=$MASTER_ADDR \
+ --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
+ --node_rank=$SKYPILOT_NODE_RANK \
+ --master_port=12375 \
  resnet_ddp.py --num_epochs 20
 
 In the above,
@@ -66,16 +65,11 @@ SkyPilot exposes these environment variables that can be accessed in a task's ``
  the node executing the task.
 - :code:`SKYPILOT_NODE_IPS`: a string of IP addresses of the nodes reserved to execute
  the task, where each line contains one IP address.
-
- - You can retrieve the number of nodes by :code:`echo "$SKYPILOT_NODE_IPS" | wc -l`
- and the IP address of the third node by :code:`echo "$SKYPILOT_NODE_IPS" | sed -n
- 3p`.
-
- - To manipulate these IP addresses, you can also store them to a file in the
- :code:`run` command with :code:`echo $SKYPILOT_NODE_IPS >> ~/sky_node_ips`.
+- :code:`SKYPILOT_NUM_NODES`: number of nodes reserved for the task, which can be specified by ``num_nodes: <n>``. Same value as :code:`echo "$SKYPILOT_NODE_IPS" | wc -l`.
 - :code:`SKYPILOT_NUM_GPUS_PER_NODE`: number of GPUs reserved on each node to execute the
  task; the same as the count in ``accelerators: <name>:<count>`` (rounded up if a fraction).
 
+See :ref:`sky-env-vars` for more details.
 
 Launching a multi-node task (new cluster)
 -------------------------------------------------
@@ -106,7 +100,7 @@ The following happens in sequence:
  and step 4).
 
 Executing a task on the head node only
------------------------------------------
+--------------------------------------
 To execute a task on the head node only (a common scenario for tools like
 ``mpirun``), use the ``SKYPILOT_NODE_RANK`` environment variable as follows:
 
@@ -141,7 +135,7 @@ This allows you directly to SSH into the worker nodes, if required.
 
 Executing a Distributed Ray Program
 ------------------------------------
-To execute a distributed Ray program on many VMs, you can download the `training script <https://github.com/skypilot-org/skypilot/blob/master/examples/distributed_ray_train/train.py>`_ and launch the `task yaml <https://github.com/skypilot-org/skypilot/blob/master/examples/distributed_ray_train/ray_train.yaml>`_:
+To execute a distributed Ray program on many nodes, you can download the `training script <https://github.com/skypilot-org/skypilot/blob/master/examples/distributed_ray_train/train.py>`_ and launch the `task yaml <https://github.com/skypilot-org/skypilot/blob/master/examples/distributed_ray_train/ray_train.yaml>`_:
 
 .. code-block:: console
 
@@ -171,19 +165,17 @@ To execute a distributed Ray program on many VMs, you can download the `training
  
  run: |
  sudo chmod 777 -R /var/tmp
- head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1`
- num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
+ HEAD_IP=`echo "$SKYPILOT_NODE_IPS" | head -n1`
  if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
  ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats --port 6379
  sleep 5
- python train.py --num-workers $num_nodes
+ python train.py --num-workers $SKYPILOT_NUM_NODES
  else
  sleep 5
- ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $head_ip:6379 --disable-usage-stats
+ ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $HEAD_IP:6379 --disable-usage-stats
  fi
 
 .. warning:: 
- **Avoid Installing Ray in Base Environment**: Before proceeding with the execution of a distributed Ray program, it is crucial to ensure that Ray is **not** installed in the *base* environment. Installing a different version of Ray in the base environment can lead to abnormal cluster status.
 
- It is highly recommended to **create a dedicated virtual environment** (as above) for Ray and its dependencies, and avoid calling `ray stop` as that will also cause issue with the cluster.
+ When using Ray, avoid calling ``ray stop`` as that will also cause the SkyPilot runtime to be stopped.