0.0.5 release

pleiszenburg · Feb 6, 2022 · 5e3033d · 5e3033d
2 parents 1c0ebda + 3028e66
commit 5e3033d
Show file tree

Hide file tree

Showing 29 changed files with 585 additions and 212 deletions.
diff --git a/.gitignore b/.gitignore
@@ -12,3 +12,4 @@ dist/
 notebooks/
 dask-worker-space/
 _*.ipynb
+notes.md
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,5 +1,20 @@
 # Changes
 
+## 0.0.5 (2022-02-06)
+
+- FEATURE: Dask scheduler and worker processes run as systemd services, allowing them to be restarted and the nodes to be rebooted, see #1.
+- FEATURE: Raise proper exceptions when trying to connect to a broken or not existing cluster.
+- FEATURE: CLI shows proper messages when trying to connect to a broken or not existing cluster.
+- FEATURE: Workers and scheduler run the same major & minor version of Python as the client does, see #2.
+- FEATURE: `scherbelberg ssh` can directly run commands on the remote host if passed as an optional string on the command line.
+- FEATURE: Added `scherbelberg scp` command to complement the already existing API.
+- FEATURE: Added log level option, `-l` or `--log_level`, to all CLI commands, see #5.
+- FEATURE: Run-time type checks become an optional debugging feature and can be activated via an environment variable, i.e. `SCHERBELBERG_DEBUG=1`.
+- FIX: Remove remaining dependencies to Python wheels.
+- FIX: Python language server dependency set to up-to-date package.
+- FIX: Inconsistent CLI output behavior depending on platform.
+- FIX: All error messages go to stderr.
+
 ## 0.0.4 (2022-02-02)
 
 - FIX: Remove old and empty `scripts` parameter from `setup.py`.

diff --git a/README.md b/README.md
@@ -51,9 +51,10 @@ Options:
 Commands:
  create create cluster
  destroy destroy cluster
- ls list cluster members
+ ls list cluster nodes
  nuke nuke cluster
- ssh ssh into cluster member
+ scp scp from/to cluster node
+ ssh ssh into cluster node
 ```
 
 At the moment, the ssh sub-command is broken on Windows.

diff --git a/docs/source/about.rst b/docs/source/about.rst
@@ -49,7 +49,7 @@ Motivation
 
 While Dask is wonderful for automating large, parallel, distributed computations, it can not solve the problem of its own deployment onto computer clusters. Instead, Dask plays nicely with established tools in the arena such as `slurm`_. Deploying Dask onto a custom cluster therefore requires a fair bit of time, background knowledge and technical skills in computer & network administration.
 
-One of the really appealing features of Dask is that it enables users to exploit huge quantities of cloud compute resources really efficiently. Cloud compute instances can usually be rented on a per-hour basis, making them an interesting target for sizable, short-lived, on-demand clusters. For cloud deployments like this, there is the Dask-related `cloud provider package`_, which surprisingly does not solve the entire problem of deployment. At the time of *scherbelberg*'s creation, it was both rather inflexible and lacking support for the Hetzner cloud. Companies like `Coiled`_, which is also the primary developer of Dask, have filled this niche with polished, proprietary web-front-end services (and equally polished APIs) for creating clusters on clouds, which effectively makes them resellers of cloud resources. In the good spirit of open source and tight R&D budgets, *scherbelberg* aims at eliminating the resellers from the equation.
+One of the really appealing features of Dask is that it enables users to exploit huge quantities of cloud compute resources really efficiently. Cloud compute instances can usually be rented on a per-hour basis, making them an interesting target for sizable, short-lived, on-demand clusters. For cloud deployments like this, there is the Dask-related `cloud provider package`_, which surprisingly does not solve the entire problem of deployment. At the time of *scherbelberg*'s creation, it was both rather inflexible and lacking support for the Hetzner cloud. Companies like `Coiled`_, which is also the primary developer of Dask, have filled this niche with polished, web-front-end services (and equally polished APIs) for creating clusters on clouds, which effectively makes them resellers of cloud resources. *scherbelberg* aims at eliminating the resellers from the equation while trying to provide a minimal, independent, self-contained, yet fully operational solution.
 
 .. note::
 

diff --git a/docs/source/debugging.rst b/docs/source/debugging.rst
@@ -0,0 +1,28 @@
+:github_url:
+
+.. _debugging:
+
+Debugging
+=========
+
+Every :ref:`CLI <cli>` command supports the ``-l`` or ``--log_level`` option, which adjusts `log level`_ of the application. Set it to ``INFO`` (i.e. ``20``) for general information on what is happening. Set it to ``DEBUG`` (i.e. ``10``) for full debugging output, e.g. ``scherbelberg create -l 10``.
+
+If *scherbelberg* is used via its :ref:`API <api>`, the log level can be adjusted via Python's standard library's logging module for instance as follows:
+
+.. code:: python
+
+ from logging import basicConfig, INFO
+
+ basicConfig(
+ format="%(name)s %(levelname)s %(asctime)-15s: %(message)s",
+ level=INFO,
+ )
+
+.. note::
+
+ The used logger is, by default, named after the cluster, i.e. its ``prefix``.
+
+For additional insights and debugging output, run-time type checks based on `typeguard`_ can be activated by setting the ``SCHERBELBERG_DEBUG`` environment variable to ``1`` prior to running a CLI command or prior to importing *scherbelberg* in Python.
+
+.. _log level: https://docs.python.org/3/library/logging.html#levels
+.. _typeguard: https://typeguard.readthedocs.io/
diff --git a/docs/source/gettingstarted.rst b/docs/source/gettingstarted.rst
@@ -13,91 +13,18 @@ Cluster Management via CLI
 .. code:: bash
 
  (env) user@computer:~> scherbelberg create
- cluster INFO 2022-01-28 14:24:33,141: Creating cloud client ...
- cluster INFO 2022-01-28 14:24:33,142: Creating ssl certificates ...
- cluster INFO 2022-01-28 14:24:35,778: Creating ssh key ...
- cluster INFO 2022-01-28 14:24:37,786: Uploading ssh key ...
- cluster INFO 2022-01-28 14:24:38,098: Getting handle on ssh key ...
- cluster INFO 2022-01-28 14:24:38,153: Creating network ...
- cluster INFO 2022-01-28 14:24:38,328: Getting handle on network ...
- cluster INFO 2022-01-28 14:24:38,408: Creating firewall ...
- cluster INFO 2022-01-28 14:24:38,508: Getting handle on firewall ...
- cluster INFO 2022-01-28 14:24:38,608: Creating nodes ...
- cluster INFO 2022-01-28 14:24:38,608: Creating node cluster-node-scheduler ...
- cluster INFO 2022-01-28 14:24:40,560: Waiting for node cluster-node-scheduler to become available ...
- cluster INFO 2022-01-28 14:24:40,739: Creating node cluster-node-worker000 ...
- cluster INFO 2022-01-28 14:24:41,709: Waiting for node cluster-node-worker000 to become available ...
- cluster INFO 2022-01-28 14:24:48,465: Attaching network to node cluster-node-scheduler ...
- cluster INFO 2022-01-28 14:24:49,034: Bootstrapping node cluster-node-scheduler ...
- cluster INFO 2022-01-28 14:24:49,034: [scheduler] [root] Waiting for SSH ...
- cluster INFO 2022-01-28 14:24:49,184: Attaching network to node cluster-node-worker000 ...
- cluster INFO 2022-01-28 14:24:49,864: Bootstrapping node cluster-node-worker000 ...
- cluster INFO 2022-01-28 14:24:49,865: [worker000] [root] Waiting for SSH ...
- cluster INFO 2022-01-28 14:24:54,046: [scheduler] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:24:54,882: [worker000] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:24:59,056: [scheduler] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:24:59,895: [worker000] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:25:01,064: [scheduler] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:25:01,905: [worker000] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:25:03,074: [scheduler] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:25:05,082: [scheduler] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:25:05,920: [worker000] [root] SSH up.
- cluster INFO 2022-01-28 14:25:05,920: [worker000] Copying root files to node ...
- cluster INFO 2022-01-28 14:25:06,927: [worker000] Running first bootstrap script ...
- cluster INFO 2022-01-28 14:25:08,091: [scheduler] [root] SSH up.
- cluster INFO 2022-01-28 14:25:08,091: [scheduler] Copying root files to node ...
- cluster INFO 2022-01-28 14:25:10,098: [scheduler] Running first bootstrap script ...
- cluster INFO 2022-01-28 14:25:49,004: [worker000] Rebooting ...
- cluster INFO 2022-01-28 14:25:49,317: [worker000] [root] Waiting for SSH ...
- cluster INFO 2022-01-28 14:25:53,328: [scheduler] Rebooting ...
- cluster INFO 2022-01-28 14:25:53,670: [scheduler] [root] Waiting for SSH ...
- cluster INFO 2022-01-28 14:25:55,431: [worker000] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:25:59,784: [scheduler] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:26:01,447: [worker000] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:26:03,456: [worker000] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:26:05,465: [worker000] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:26:05,801: [scheduler] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:26:06,473: [worker000] [root] SSH up.
- cluster INFO 2022-01-28 14:26:06,473: [worker000] Running second bootstrap script ...
- cluster INFO 2022-01-28 14:26:07,808: [scheduler] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:26:09,815: [scheduler] [root] Continuing to wait for SSH ...
- cluster INFO 2022-01-28 14:26:11,824: [scheduler] [root] SSH up.
- cluster INFO 2022-01-28 14:26:11,824: [scheduler] Running second bootstrap script ...
- cluster INFO 2022-01-28 14:27:00,573: [worker000] [clusteruser] Waiting for SSH ...
- cluster INFO 2022-01-28 14:27:01,581: [worker000] [clusteruser] SSH up.
- cluster INFO 2022-01-28 14:27:01,581: [worker000] Copying user files to node ...
- cluster INFO 2022-01-28 14:27:03,590: [worker000] Running third (user) bootstrap script ...
- cluster INFO 2022-01-28 14:27:06,883: [scheduler] [clusteruser] Waiting for SSH ...
- cluster INFO 2022-01-28 14:27:07,891: [scheduler] [clusteruser] SSH up.
- cluster INFO 2022-01-28 14:27:07,891: [scheduler] Copying user files to node ...
- cluster INFO 2022-01-28 14:27:09,900: [scheduler] Running third (user) bootstrap script ...
- cluster INFO 2022-01-28 14:29:11,100: [scheduler] Bootstrapping done.
- cluster INFO 2022-01-28 14:29:11,101: [scheduler] [clusteruser] Waiting for SSH ...
- cluster INFO 2022-01-28 14:29:11,812: [worker000] Bootstrapping done.
- cluster INFO 2022-01-28 14:29:12,107: [scheduler] [clusteruser] SSH up.
- cluster INFO 2022-01-28 14:29:12,108: [scheduler] Staring dask scheduler ...
- cluster INFO 2022-01-28 14:29:13,114: [scheduler] Dask scheduler started.
- cluster INFO 2022-01-28 14:29:13,115: [worker000] [clusteruser] Waiting for SSH ...
- cluster INFO 2022-01-28 14:29:14,122: [worker000] [clusteruser] SSH up.
- cluster INFO 2022-01-28 14:29:14,123: [worker000] Staring dask worker ...
- cluster INFO 2022-01-28 14:29:15,130: [worker000] Dask worker started.
- cluster INFO 2022-01-28 14:29:15,130: Successfully created new cluster.
 
 .. note::
 
- Creating a cluster requires around 3 to 10 minutes.
+ Creating a cluster requires around 3 to 10 minutes. If you want to get a better idea of what is going on, you can adjust the `log level`_ using the ``-l`` flag for instance to the ``INFO`` level: ``scherbelberg create -l 20``.
+
+.. _log level: https://docs.python.org/3/library/logging.html#levels
 
 Once the cluster has been created, it can be inspected at any time using the ``scherbelberg ls`` command:
 
 .. code:: bash
 
  (env) user@computer:~> scherbelberg ls
- cluster INFO 2022-01-28 14:34:53,789: Creating cloud client ...
- cluster INFO 2022-01-28 14:34:53,790: Getting handle on scheduler ...
- cluster INFO 2022-01-28 14:34:54,099: Getting handles on workers ...
- cluster INFO 2022-01-28 14:34:54,273: Getting handle on firewall ...
- cluster INFO 2022-01-28 14:34:54,346: Getting handle on network ...
- cluster INFO 2022-01-28 14:34:54,418: Successfully attached to existing cluster.
  <Cluster prefix="cluster" alive=True workers=1 ipc=9753 dash=9756 nanny=9759>
  <node name=cluster-node-worker000 public=188.34.155.13 private=10.0.1.100>
  <node name=cluster-node-scheduler public=78.47.76.87 private=10.0.1.200>
@@ -111,18 +38,11 @@ Sometimes, it is necessary to log into worker nodes or the scheduler. *scherbelb
 .. code:: bash
 
  (env) user@computer:~> scherbelberg ssh worker000
- cluster INFO 2022-01-28 14:35:49,774: Creating cloud client ...
- cluster INFO 2022-01-28 14:35:49,775: Getting handle on scheduler ...
- cluster INFO 2022-01-28 14:35:49,979: Getting handles on workers ...
- cluster INFO 2022-01-28 14:35:50,157: Getting handle on firewall ...
- cluster INFO 2022-01-28 14:35:50,235: Getting handle on network ...
- cluster INFO 2022-01-28 14:35:50,319: Successfully attached to existing cluster.
  To run a command as administrator (user "root"), use "sudo <command>".
  See "man sudo_root" for details.
 
  (clusterenv) clusteruser@cluster-node-worker000:~$ exit
  logout
- (env) user@computer:~>
 
 .. note::
 
@@ -133,46 +53,23 @@ The scheduler node is accessible as follows:
 .. code:: bash
 
  (env) user@computer:~> scherbelberg ssh scheduler
- cluster INFO 2022-01-28 14:36:23,019: Creating cloud client ...
- cluster INFO 2022-01-28 14:36:23,019: Getting handle on scheduler ...
- cluster INFO 2022-01-28 14:36:23,243: Getting handles on workers ...
- cluster INFO 2022-01-28 14:36:23,477: Getting handle on firewall ...
- cluster INFO 2022-01-28 14:36:23,543: Getting handle on network ...
- cluster INFO 2022-01-28 14:36:23,618: Successfully attached to existing cluster.
  To run a command as administrator (user "root"), use "sudo <command>".
  See "man sudo_root" for details.
 
  (clusterenv) clusteruser@cluster-node-scheduler:~$ exit
  logout
- (env) user@computer:~>
 
 Once a cluster is not required anymore, it can be destroyed using the ``scherbelberg destroy`` command:
 
 .. code:: bash
 
  (env) user@computer:~> scherbelberg destroy
- cluster INFO 2022-01-28 14:37:17,612: Creating cloud client ...
- cluster INFO 2022-01-28 14:37:17,612: Getting handle on scheduler ...
- cluster INFO 2022-01-28 14:37:18,377: Getting handles on workers ...
- cluster INFO 2022-01-28 14:37:18,564: Getting handle on firewall ...
- cluster INFO 2022-01-28 14:37:18,638: Getting handle on network ...
- cluster INFO 2022-01-28 14:37:18,706: Successfully attached to existing cluster.
- cluster INFO 2022-01-28 14:37:18,868: Deleting cluster-node-scheduler ...
- cluster INFO 2022-01-28 14:37:19,221: Deleting cluster-node-worker000 ...
- cluster INFO 2022-01-28 14:37:20,334: Deleting cluster-network ...
- cluster INFO 2022-01-28 14:37:20,647: Deleting cluster-key ...
- cluster INFO 2022-01-28 14:37:20,792: Deleting cluster-firewall ...
- cluster INFO 2022-01-28 14:37:20,913: Cluster cluster destroyed.
- (env) user@computer:~>
 
 Under certain circumstances, the creation or destruction of a cluster may fail or result in an unclean state, for instance due to connectivity issues. In cases like this, it might be necessary to "nuke" the remains of the cluster before it can possibly be recreated:
 
 .. code:: bash
 
  (env) user@computer:~> scherbelberg nuke
- cluster INFO 2022-01-28 15:43:19,549: Creating cloud client ...
- cluster INFO 2022-01-28 15:43:20,285: Cluster cluster nuked.
- (env) user@computer:~>
 
 Cluster Management via API
 --------------------------
@@ -282,7 +179,7 @@ So far, only minimal clusters have been shown for demonstration purposes. In rea
 
 .. note::
 
- Hetzner cloud serves tend to achieve a `network bandwidth`_ of around 300 to 500 Mbit/s. Larger instances might end up with more bandwidth because the underlying host has to deal with fewer instances sharing bandwidth. This has to be kept in mind when designing a cluster and ideally measured as well as monitored afterwards.
+ Hetzner cloud servers tend to achieve a `network bandwidth`_ of around 300 to 500 Mbit/s. Larger instances might end up with more bandwidth because the underlying host has to deal with fewer instances sharing bandwidth. This has to be kept in mind when designing a cluster and ideally measured as well as monitored afterwards.
 
 .. warning::
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -69,6 +69,7 @@ User's guide
  :caption: Advanced
 
  security
+ debugging
  changes
  faq
  contributing

diff --git a/src/scherbelberg/__init__.py b/src/scherbelberg/__init__.py
@@ -27,14 +27,23 @@
 # VERSION
 # +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
-__version__ = "0.0.4"
+__version__ = "0.0.5"
 
 # +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 # EXPORT
 # +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
-from ._core.cluster import Cluster
+from ._core.cluster import (
+ Cluster,
+ ClusterSchedulerNotFound,
+ ClusterWorkerNotFound,
+ ClusterFirewallNotFound,
+ ClusterNetworkNotFound,
+)
 from ._core.command import Command
-from ._core.node import Node
+from ._core.node import (
+ Node,
+ NodeNotFound,
+)
 from ._core.process import Process
 from ._core.sshconfig import SSHConfig
diff --git a/src/scherbelberg/_cli/create.py b/src/scherbelberg/_cli/create.py
@@ -29,6 +29,7 @@
 # +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
 from asyncio import run
+from logging import ERROR
 
 import click
 
@@ -72,6 +73,7 @@
 @click.option("-c", "--dask_ipc", default=DASK_IPC, type=int, show_default=True)
 @click.option("-d", "--dask_dash", default=DASK_DASH, type=int, show_default=True)
 @click.option("-e", "--dask_nanny", default=DASK_NANNY, type=int, show_default=True)
+@click.option("-l", "--log_level", default=ERROR, type=int, show_default=True)
 def create(
  prefix,
  tokenvar,
@@ -84,9 +86,10 @@ def create(
  dask_ipc,
  dask_dash,
  dask_nanny,
+ log_level,
 ):
 
- configure_log()
+ configure_log(log_level)
 
  run(
  Cluster.from_new(

diff --git a/src/scherbelberg/_cli/destroy.py b/src/scherbelberg/_cli/destroy.py
@@ -29,10 +29,18 @@
 # +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
 from asyncio import run
+from logging import ERROR
+import sys
 
 import click
 
-from .._core.cluster import Cluster
+from .._core.cluster import (
+ Cluster,
+ ClusterSchedulerNotFound,
+ ClusterWorkerNotFound,
+ ClusterFirewallNotFound,
+ ClusterNetworkNotFound,
+)
 from .._core.const import PREFIX, TOKENVAR, WAIT
 from .._core.log import configure_log
 
@@ -43,20 +51,39 @@
 
 async def _main(prefix, tokenvar, wait):
 
- cluster = await Cluster.from_existing(
- prefix=prefix,
- tokenvar=tokenvar,
- wait=wait,
- )
+ try:
+ cluster = await Cluster.from_existing(
+ prefix=prefix,
+ tokenvar=tokenvar,
+ wait=wait,
+ )
+ except ClusterSchedulerNotFound:
+ click.echo(
+ "Cluster scheduler could not be found. Cluster likely does not exist.",
+ err=True,
+ )
+ sys.exit(1)
+ except (
+ ClusterWorkerNotFound,
+ ClusterFirewallNotFound,
+ ClusterNetworkNotFound,
+ ) as e:
+ click.echo(
+ f"Cluster component missing ({type(e).__name__:s}). Cluster likely needs to be nuked.",
+ err=True,
+ )
+ sys.exit(1)
+
  await cluster.destroy()
 
 
 @click.command(short_help="destroy cluster")
 @click.option("-p", "--prefix", default=PREFIX, type=str, show_default=True)
 @click.option("-t", "--tokenvar", default=TOKENVAR, type=str, show_default=True)
 @click.option("-a", "--wait", default=WAIT, type=float, show_default=True)
-def destroy(prefix, tokenvar, wait):
+@click.option("-l", "--log_level", default=ERROR, type=int, show_default=True)
+def destroy(prefix, tokenvar, wait, log_level):
 
- configure_log()
+ configure_log(log_level)
 
  run(_main(prefix, tokenvar, wait))