Skip to content

Commit

Permalink
0.0.5 release
Browse files Browse the repository at this point in the history
  • Loading branch information
s-m-e committed Feb 6, 2022
2 parents 1c0ebda + 3028e66 commit 5e3033d
Show file tree
Hide file tree
Showing 29 changed files with 585 additions and 212 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ dist/
notebooks/
dask-worker-space/
_*.ipynb
notes.md
15 changes: 15 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
# Changes

## 0.0.5 (2022-02-06)

- FEATURE: Dask scheduler and worker processes run as systemd services, allowing them to be restarted and the nodes to be rebooted, see #1.
- FEATURE: Raise proper exceptions when trying to connect to a broken or not existing cluster.
- FEATURE: CLI shows proper messages when trying to connect to a broken or not existing cluster.
- FEATURE: Workers and scheduler run the same major & minor version of Python as the client does, see #2.
- FEATURE: `scherbelberg ssh` can directly run commands on the remote host if passed as an optional string on the command line.
- FEATURE: Added `scherbelberg scp` command to complement the already existing API.
- FEATURE: Added log level option, `-l` or `--log_level`, to all CLI commands, see #5.
- FEATURE: Run-time type checks become an optional debugging feature and can be activated via an environment variable, i.e. `SCHERBELBERG_DEBUG=1`.
- FIX: Remove remaining dependencies to Python wheels.
- FIX: Python language server dependency set to up-to-date package.
- FIX: Inconsistent CLI output behavior depending on platform.
- FIX: All error messages go to stderr.

## 0.0.4 (2022-02-02)

- FIX: Remove old and empty `scripts` parameter from `setup.py`.
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,10 @@ Options:
Commands:
create create cluster
destroy destroy cluster
ls list cluster members
ls list cluster nodes
nuke nuke cluster
ssh ssh into cluster member
scp scp from/to cluster node
ssh ssh into cluster node
```

At the moment, the ssh sub-command is broken on Windows.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/about.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ Motivation

While Dask is wonderful for automating large, parallel, distributed computations, it can not solve the problem of its own deployment onto computer clusters. Instead, Dask plays nicely with established tools in the arena such as `slurm`_. Deploying Dask onto a custom cluster therefore requires a fair bit of time, background knowledge and technical skills in computer & network administration.

One of the really appealing features of Dask is that it enables users to exploit huge quantities of cloud compute resources really efficiently. Cloud compute instances can usually be rented on a per-hour basis, making them an interesting target for sizable, short-lived, on-demand clusters. For cloud deployments like this, there is the Dask-related `cloud provider package`_, which surprisingly does not solve the entire problem of deployment. At the time of *scherbelberg*'s creation, it was both rather inflexible and lacking support for the Hetzner cloud. Companies like `Coiled`_, which is also the primary developer of Dask, have filled this niche with polished, proprietary web-front-end services (and equally polished APIs) for creating clusters on clouds, which effectively makes them resellers of cloud resources. In the good spirit of open source and tight R&D budgets, *scherbelberg* aims at eliminating the resellers from the equation.
One of the really appealing features of Dask is that it enables users to exploit huge quantities of cloud compute resources really efficiently. Cloud compute instances can usually be rented on a per-hour basis, making them an interesting target for sizable, short-lived, on-demand clusters. For cloud deployments like this, there is the Dask-related `cloud provider package`_, which surprisingly does not solve the entire problem of deployment. At the time of *scherbelberg*'s creation, it was both rather inflexible and lacking support for the Hetzner cloud. Companies like `Coiled`_, which is also the primary developer of Dask, have filled this niche with polished, web-front-end services (and equally polished APIs) for creating clusters on clouds, which effectively makes them resellers of cloud resources. *scherbelberg* aims at eliminating the resellers from the equation while trying to provide a minimal, independent, self-contained, yet fully operational solution.

.. note::

Expand Down
28 changes: 28 additions & 0 deletions docs/source/debugging.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
:github_url:

.. _debugging:

Debugging
=========

Every :ref:`CLI <cli>` command supports the ``-l`` or ``--log_level`` option, which adjusts `log level`_ of the application. Set it to ``INFO`` (i.e. ``20``) for general information on what is happening. Set it to ``DEBUG`` (i.e. ``10``) for full debugging output, e.g. ``scherbelberg create -l 10``.

If *scherbelberg* is used via its :ref:`API <api>`, the log level can be adjusted via Python's standard library's logging module for instance as follows:

.. code:: python
from logging import basicConfig, INFO
basicConfig(
format="%(name)s %(levelname)s %(asctime)-15s: %(message)s",
level=INFO,
)
.. note::

The used logger is, by default, named after the cluster, i.e. its ``prefix``.

For additional insights and debugging output, run-time type checks based on `typeguard`_ can be activated by setting the ``SCHERBELBERG_DEBUG`` environment variable to ``1`` prior to running a CLI command or prior to importing *scherbelberg* in Python.

.. _log level: https://docs.python.org/3/library/logging.html#levels
.. _typeguard: https://typeguard.readthedocs.io/
111 changes: 4 additions & 107 deletions docs/source/gettingstarted.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,91 +13,18 @@ Cluster Management via CLI
.. code:: bash
(env) user@computer:~> scherbelberg create
cluster INFO 2022-01-28 14:24:33,141: Creating cloud client ...
cluster INFO 2022-01-28 14:24:33,142: Creating ssl certificates ...
cluster INFO 2022-01-28 14:24:35,778: Creating ssh key ...
cluster INFO 2022-01-28 14:24:37,786: Uploading ssh key ...
cluster INFO 2022-01-28 14:24:38,098: Getting handle on ssh key ...
cluster INFO 2022-01-28 14:24:38,153: Creating network ...
cluster INFO 2022-01-28 14:24:38,328: Getting handle on network ...
cluster INFO 2022-01-28 14:24:38,408: Creating firewall ...
cluster INFO 2022-01-28 14:24:38,508: Getting handle on firewall ...
cluster INFO 2022-01-28 14:24:38,608: Creating nodes ...
cluster INFO 2022-01-28 14:24:38,608: Creating node cluster-node-scheduler ...
cluster INFO 2022-01-28 14:24:40,560: Waiting for node cluster-node-scheduler to become available ...
cluster INFO 2022-01-28 14:24:40,739: Creating node cluster-node-worker000 ...
cluster INFO 2022-01-28 14:24:41,709: Waiting for node cluster-node-worker000 to become available ...
cluster INFO 2022-01-28 14:24:48,465: Attaching network to node cluster-node-scheduler ...
cluster INFO 2022-01-28 14:24:49,034: Bootstrapping node cluster-node-scheduler ...
cluster INFO 2022-01-28 14:24:49,034: [scheduler] [root] Waiting for SSH ...
cluster INFO 2022-01-28 14:24:49,184: Attaching network to node cluster-node-worker000 ...
cluster INFO 2022-01-28 14:24:49,864: Bootstrapping node cluster-node-worker000 ...
cluster INFO 2022-01-28 14:24:49,865: [worker000] [root] Waiting for SSH ...
cluster INFO 2022-01-28 14:24:54,046: [scheduler] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:24:54,882: [worker000] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:24:59,056: [scheduler] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:24:59,895: [worker000] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:25:01,064: [scheduler] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:25:01,905: [worker000] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:25:03,074: [scheduler] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:25:05,082: [scheduler] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:25:05,920: [worker000] [root] SSH up.
cluster INFO 2022-01-28 14:25:05,920: [worker000] Copying root files to node ...
cluster INFO 2022-01-28 14:25:06,927: [worker000] Running first bootstrap script ...
cluster INFO 2022-01-28 14:25:08,091: [scheduler] [root] SSH up.
cluster INFO 2022-01-28 14:25:08,091: [scheduler] Copying root files to node ...
cluster INFO 2022-01-28 14:25:10,098: [scheduler] Running first bootstrap script ...
cluster INFO 2022-01-28 14:25:49,004: [worker000] Rebooting ...
cluster INFO 2022-01-28 14:25:49,317: [worker000] [root] Waiting for SSH ...
cluster INFO 2022-01-28 14:25:53,328: [scheduler] Rebooting ...
cluster INFO 2022-01-28 14:25:53,670: [scheduler] [root] Waiting for SSH ...
cluster INFO 2022-01-28 14:25:55,431: [worker000] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:25:59,784: [scheduler] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:26:01,447: [worker000] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:26:03,456: [worker000] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:26:05,465: [worker000] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:26:05,801: [scheduler] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:26:06,473: [worker000] [root] SSH up.
cluster INFO 2022-01-28 14:26:06,473: [worker000] Running second bootstrap script ...
cluster INFO 2022-01-28 14:26:07,808: [scheduler] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:26:09,815: [scheduler] [root] Continuing to wait for SSH ...
cluster INFO 2022-01-28 14:26:11,824: [scheduler] [root] SSH up.
cluster INFO 2022-01-28 14:26:11,824: [scheduler] Running second bootstrap script ...
cluster INFO 2022-01-28 14:27:00,573: [worker000] [clusteruser] Waiting for SSH ...
cluster INFO 2022-01-28 14:27:01,581: [worker000] [clusteruser] SSH up.
cluster INFO 2022-01-28 14:27:01,581: [worker000] Copying user files to node ...
cluster INFO 2022-01-28 14:27:03,590: [worker000] Running third (user) bootstrap script ...
cluster INFO 2022-01-28 14:27:06,883: [scheduler] [clusteruser] Waiting for SSH ...
cluster INFO 2022-01-28 14:27:07,891: [scheduler] [clusteruser] SSH up.
cluster INFO 2022-01-28 14:27:07,891: [scheduler] Copying user files to node ...
cluster INFO 2022-01-28 14:27:09,900: [scheduler] Running third (user) bootstrap script ...
cluster INFO 2022-01-28 14:29:11,100: [scheduler] Bootstrapping done.
cluster INFO 2022-01-28 14:29:11,101: [scheduler] [clusteruser] Waiting for SSH ...
cluster INFO 2022-01-28 14:29:11,812: [worker000] Bootstrapping done.
cluster INFO 2022-01-28 14:29:12,107: [scheduler] [clusteruser] SSH up.
cluster INFO 2022-01-28 14:29:12,108: [scheduler] Staring dask scheduler ...
cluster INFO 2022-01-28 14:29:13,114: [scheduler] Dask scheduler started.
cluster INFO 2022-01-28 14:29:13,115: [worker000] [clusteruser] Waiting for SSH ...
cluster INFO 2022-01-28 14:29:14,122: [worker000] [clusteruser] SSH up.
cluster INFO 2022-01-28 14:29:14,123: [worker000] Staring dask worker ...
cluster INFO 2022-01-28 14:29:15,130: [worker000] Dask worker started.
cluster INFO 2022-01-28 14:29:15,130: Successfully created new cluster.
.. note::

Creating a cluster requires around 3 to 10 minutes.
Creating a cluster requires around 3 to 10 minutes. If you want to get a better idea of what is going on, you can adjust the `log level`_ using the ``-l`` flag for instance to the ``INFO`` level: ``scherbelberg create -l 20``.

.. _log level: https://docs.python.org/3/library/logging.html#levels

Once the cluster has been created, it can be inspected at any time using the ``scherbelberg ls`` command:

.. code:: bash
(env) user@computer:~> scherbelberg ls
cluster INFO 2022-01-28 14:34:53,789: Creating cloud client ...
cluster INFO 2022-01-28 14:34:53,790: Getting handle on scheduler ...
cluster INFO 2022-01-28 14:34:54,099: Getting handles on workers ...
cluster INFO 2022-01-28 14:34:54,273: Getting handle on firewall ...
cluster INFO 2022-01-28 14:34:54,346: Getting handle on network ...
cluster INFO 2022-01-28 14:34:54,418: Successfully attached to existing cluster.
<Cluster prefix="cluster" alive=True workers=1 ipc=9753 dash=9756 nanny=9759>
<node name=cluster-node-worker000 public=188.34.155.13 private=10.0.1.100>
<node name=cluster-node-scheduler public=78.47.76.87 private=10.0.1.200>
Expand All @@ -111,18 +38,11 @@ Sometimes, it is necessary to log into worker nodes or the scheduler. *scherbelb
.. code:: bash
(env) user@computer:~> scherbelberg ssh worker000
cluster INFO 2022-01-28 14:35:49,774: Creating cloud client ...
cluster INFO 2022-01-28 14:35:49,775: Getting handle on scheduler ...
cluster INFO 2022-01-28 14:35:49,979: Getting handles on workers ...
cluster INFO 2022-01-28 14:35:50,157: Getting handle on firewall ...
cluster INFO 2022-01-28 14:35:50,235: Getting handle on network ...
cluster INFO 2022-01-28 14:35:50,319: Successfully attached to existing cluster.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
(clusterenv) clusteruser@cluster-node-worker000:~$ exit
logout
(env) user@computer:~>
.. note::

Expand All @@ -133,46 +53,23 @@ The scheduler node is accessible as follows:
.. code:: bash
(env) user@computer:~> scherbelberg ssh scheduler
cluster INFO 2022-01-28 14:36:23,019: Creating cloud client ...
cluster INFO 2022-01-28 14:36:23,019: Getting handle on scheduler ...
cluster INFO 2022-01-28 14:36:23,243: Getting handles on workers ...
cluster INFO 2022-01-28 14:36:23,477: Getting handle on firewall ...
cluster INFO 2022-01-28 14:36:23,543: Getting handle on network ...
cluster INFO 2022-01-28 14:36:23,618: Successfully attached to existing cluster.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
(clusterenv) clusteruser@cluster-node-scheduler:~$ exit
logout
(env) user@computer:~>
Once a cluster is not required anymore, it can be destroyed using the ``scherbelberg destroy`` command:

.. code:: bash
(env) user@computer:~> scherbelberg destroy
cluster INFO 2022-01-28 14:37:17,612: Creating cloud client ...
cluster INFO 2022-01-28 14:37:17,612: Getting handle on scheduler ...
cluster INFO 2022-01-28 14:37:18,377: Getting handles on workers ...
cluster INFO 2022-01-28 14:37:18,564: Getting handle on firewall ...
cluster INFO 2022-01-28 14:37:18,638: Getting handle on network ...
cluster INFO 2022-01-28 14:37:18,706: Successfully attached to existing cluster.
cluster INFO 2022-01-28 14:37:18,868: Deleting cluster-node-scheduler ...
cluster INFO 2022-01-28 14:37:19,221: Deleting cluster-node-worker000 ...
cluster INFO 2022-01-28 14:37:20,334: Deleting cluster-network ...
cluster INFO 2022-01-28 14:37:20,647: Deleting cluster-key ...
cluster INFO 2022-01-28 14:37:20,792: Deleting cluster-firewall ...
cluster INFO 2022-01-28 14:37:20,913: Cluster cluster destroyed.
(env) user@computer:~>
Under certain circumstances, the creation or destruction of a cluster may fail or result in an unclean state, for instance due to connectivity issues. In cases like this, it might be necessary to "nuke" the remains of the cluster before it can possibly be recreated:

.. code:: bash
(env) user@computer:~> scherbelberg nuke
cluster INFO 2022-01-28 15:43:19,549: Creating cloud client ...
cluster INFO 2022-01-28 15:43:20,285: Cluster cluster nuked.
(env) user@computer:~>
Cluster Management via API
--------------------------
Expand Down Expand Up @@ -282,7 +179,7 @@ So far, only minimal clusters have been shown for demonstration purposes. In rea

.. note::

Hetzner cloud serves tend to achieve a `network bandwidth`_ of around 300 to 500 Mbit/s. Larger instances might end up with more bandwidth because the underlying host has to deal with fewer instances sharing bandwidth. This has to be kept in mind when designing a cluster and ideally measured as well as monitored afterwards.
Hetzner cloud servers tend to achieve a `network bandwidth`_ of around 300 to 500 Mbit/s. Larger instances might end up with more bandwidth because the underlying host has to deal with fewer instances sharing bandwidth. This has to be kept in mind when designing a cluster and ideally measured as well as monitored afterwards.

.. warning::

Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ User's guide
:caption: Advanced

security
debugging
changes
faq
contributing
Expand Down
15 changes: 12 additions & 3 deletions src/scherbelberg/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,23 @@
# VERSION
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

__version__ = "0.0.4"
__version__ = "0.0.5"

# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# EXPORT
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

from ._core.cluster import Cluster
from ._core.cluster import (
Cluster,
ClusterSchedulerNotFound,
ClusterWorkerNotFound,
ClusterFirewallNotFound,
ClusterNetworkNotFound,
)
from ._core.command import Command
from ._core.node import Node
from ._core.node import (
Node,
NodeNotFound,
)
from ._core.process import Process
from ._core.sshconfig import SSHConfig
5 changes: 4 additions & 1 deletion src/scherbelberg/_cli/create.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

from asyncio import run
from logging import ERROR

import click

Expand Down Expand Up @@ -72,6 +73,7 @@
@click.option("-c", "--dask_ipc", default=DASK_IPC, type=int, show_default=True)
@click.option("-d", "--dask_dash", default=DASK_DASH, type=int, show_default=True)
@click.option("-e", "--dask_nanny", default=DASK_NANNY, type=int, show_default=True)
@click.option("-l", "--log_level", default=ERROR, type=int, show_default=True)
def create(
prefix,
tokenvar,
Expand All @@ -84,9 +86,10 @@ def create(
dask_ipc,
dask_dash,
dask_nanny,
log_level,
):

configure_log()
configure_log(log_level)

run(
Cluster.from_new(
Expand Down
43 changes: 35 additions & 8 deletions src/scherbelberg/_cli/destroy.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,18 @@
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

from asyncio import run
from logging import ERROR
import sys

import click

from .._core.cluster import Cluster
from .._core.cluster import (
Cluster,
ClusterSchedulerNotFound,
ClusterWorkerNotFound,
ClusterFirewallNotFound,
ClusterNetworkNotFound,
)
from .._core.const import PREFIX, TOKENVAR, WAIT
from .._core.log import configure_log

Expand All @@ -43,20 +51,39 @@

async def _main(prefix, tokenvar, wait):

cluster = await Cluster.from_existing(
prefix=prefix,
tokenvar=tokenvar,
wait=wait,
)
try:
cluster = await Cluster.from_existing(
prefix=prefix,
tokenvar=tokenvar,
wait=wait,
)
except ClusterSchedulerNotFound:
click.echo(
"Cluster scheduler could not be found. Cluster likely does not exist.",
err=True,
)
sys.exit(1)
except (
ClusterWorkerNotFound,
ClusterFirewallNotFound,
ClusterNetworkNotFound,
) as e:
click.echo(
f"Cluster component missing ({type(e).__name__:s}). Cluster likely needs to be nuked.",
err=True,
)
sys.exit(1)

await cluster.destroy()


@click.command(short_help="destroy cluster")
@click.option("-p", "--prefix", default=PREFIX, type=str, show_default=True)
@click.option("-t", "--tokenvar", default=TOKENVAR, type=str, show_default=True)
@click.option("-a", "--wait", default=WAIT, type=float, show_default=True)
def destroy(prefix, tokenvar, wait):
@click.option("-l", "--log_level", default=ERROR, type=int, show_default=True)
def destroy(prefix, tokenvar, wait, log_level):

configure_log()
configure_log(log_level)

run(_main(prefix, tokenvar, wait))
Loading

0 comments on commit 5e3033d

Please sign in to comment.