Skip to content

Simulate network failures in a datacenter topology

License

Notifications You must be signed in to change notification settings

kpetremann/claw-network

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

status Go CI GitHub

Overview

ClawNetwork is a tool to simulate a network and evaluate failures impacts on Top of Racks.

It has been specially crafted for Clos Matrix network. For now, cyclic graphs are not supported. Only trees are.

ClawNetwork is in active development.
The main features are implemented, but endpoints, structures and functions may change.

But it is now usable and you are free to play with it :)

Usecases

Operations

The main usecase it to evaluate if an operation on a device in your core network will impact a Top of Rack.

Concerned operations can be: upgrade, reboot, risky maintenance etc...

Detect anomalies / SPOF

ClawNetwork can be leveraged to detect SPOF of any anomalies such as spine without downlinks.

Quickstart

From source

Simply run ClawNetwork app using go run .

Alternative: build the binary via go build and run it.

Using Docker compose

Default backend

Run ClawNetwork with default backend (FileRepository):

docker-compose -f compose/docker-compose.yml up -d

FileRepository stores the topologies in dedicated JSON files on the disk.

By default, this uses examples/ directory provided in this repository.

At the moment this is not customizable, but it will be very soon.

Run with the Backend of your choice

docker-compose -f compose/docker-compose.yml -f <backend>.yml up -d

RedisJSON

recommended backend for production if you need to store topologies

At the moment, Redis JSON is the only alternative backend:

docker-compose -f compose/docker-compose.yml -f redisjson.yml up -d

This backend leverages RedisJSON module to store pure JSON to Redis. Persistence is enabled and forced at each changes (ADD/DELETE) by ClawNetwork.

Configuration

Configuration can be configured either via environment variables or YAML file (settings.yaml).

List of parameters available (varenv format | YAML format):

  • CLAW_LISTENADDRESS | ListenAddress: ClawNetwork API listen address (default: "0.0.0.0")
  • CLAW_LISTENPORT | ListenPort: ClawNetwork API listen port (default: "8080")
  • CLAW_TOPDEVICEROLE | TopDeviceRole: Role of device at the top of the topology graph (default: "edge")
  • CLAW_BOTTOMDEVICEROLE | BottomDeviceRole: Role of device at the Bottom of the topology graph (default: "tor")
  • CLAW_BACKEND | Backend: Choose backend to store topologies (choices: "file", "redis", default: "file")
  • CLAW_BACKENDS_FILE_PATH | Backends.Redis.Path: Redis DB to use (default: "./topologies/")
  • CLAW_BACKENDS_REDIS_HOST | Backends.Redis.Host: Redis server address (default: "localhost")
  • CLAW_BACKENDS_REDIS_PORT | Backends.Redis.Port: Redis server port (default: "6379")
  • CLAW_BACKENDS_REDIS_PASSWORD | Backends.Redis.Password: Redis password (default: "")
  • CLAW_BACKENDS_REDIS_DB | Backends.Redis.DB: Redis DB to use (default: 0)

Usage

Manage stored topologies

  • GET /topology: list stored topologies
  • GET /topology/:topology_name: get topology definition
  • POST /topology/:topology_name: create a new topology
  • DELETE /topology/:topology_name: delete a topology
  • GET /topology/details: list stored topologies with some stats
  • GET /topology/:topology_name/details: get topology stats

Simulation on a stored topology

  • GET /topology/:topology_name/device/:device/down/impact: run simulations on existing topology
  • POST /topology/custom/device/:device/down/impact: run simulations on topology provided in the request body

It will run a simulation on a stored topology.

If :device is set to each, it will simulate failure impact of each devices excluding Top of Racks.

Anomaly detection

  • GET /topology/:topology_name/anomalies: get topology anomalies

It list all anomalies in the topology graph.

Link anomalies

A node is not connected properly to the graph.

For example:

  • a ToR does not have any uplinks
  • a spine does not have any downlinks or any uplinks
  • an edge does not have any downlinks

This does not consider the status of the link, it only checks if there is a link.

Topology structure

The topology to provide looks like this in JSON:

{
  "nodes": [
    {
      "hostname": "tor-01-01",
      "role": "tor",
      "status": true,
      "layer": 1
    },
    {
      "hostname": "fabric-1-01",
      "role": "fabric",
      "status": true,
      "layer": 2
    }
  ],
  "links": [
    {
      "south_node": "tor-01-01",
      "north_node": "fabric-1-01",
      "status": true,
      "uid": "10.0.0.0->10.0.0.1"
    }
  ]
}

This structure is subject to change, as the API is not considered stable at the moment

Example

Topology = 4 healthy fabric nodes + 4 healthy ToR

Simulations:

  • first simulation considering first fabric node as down
  • second simulation considering second fabric node as down but with the first up
  • ...

Example usecase

You can query the following endpoint to simulate down impact of each devices. It get the tppology example from the example/full_topology_with_issues.json.

$ curl http:https://127.0.0.1:8080/topology/full_topology_with_issues/device/each/down/impact | jq
{
  "scenarios_result": {
    "edge-0": {
      "impacts": null,
      "parameters": {
        "devices_down": [
          "edge-0"
        ],
        "links_down": null
      }
    },
    "edge-1": {
      "impacts": null,
      "parameters": {
        "devices_down": [
          "edge-1"
        ],
        "links_down": null
      }
    },
    "fabric-1-01": {
      "impacts": [
        "tor-01-01"
      ],
      "parameters": {
        "devices_down": [
          "fabric-1-01"
        ],
        "links_down": null
      }
    },
    ...,
    "compute_time": "89 ms"
}

As you can see, tor-01-01 would be down if we shut fabric-1-01.

The topology defined in example/full_topology_with_issues.json, also specifies some devices as down. Here all the fabric of pod 01 has been set to down except for fabric-1-01. This is why if there is a failure on this device, it will impact tor-01-01 as this ToR only had one healthy uplink.

Note: more advanced examples will be provided soon, with more complex scenarios.

Integrations

Below some ideas of possible integrations:

  • the client push the topology with the simulation request. The topology is not stored.
+-------------------------+
|  Observability metrics  |
|   example: Prometheus   |
+-------------------------+
             ^
             |
             | get metrics
             |
             |
             |
 +-----------------------+
 |                       |           get impact
 |        Client         |        on custom topology        +---------------+
 |   => convert metrics  |--------------------------------->|  ClawNetwork  |
 |      to topology      |                                  +---------------+
 +-----------------------+
  • the client provides the topologies and they are stored
+-------------------------+
|  Observability metrics  |
|   example: Prometheus   |
+-------------------------+
             ^
             |
             | get metrics
             |
             |
             |
 +-----------------------+
 |        Client         |       push topology      +---------------+      save topology       +-------------------------+
 |   => convert metrics  |------------------------->|  ClawNetwork  |<------------------------>| Storage (FS, redis,...) |
 |      to topology      |        get impact        +---------------+       get topology       +-------------------------+
 +-----------------------+
  • dedicated topology provider
                                                 +---------------------+
+-------------------------+                      |  Topology provider  |
|  Observability metrics  | <------------------- | => convert metrics  |
+-------------------------+                      |    to topology      |
                                                 +---------------------+
                                                            |
                                                            |
                                                            | push topology
                                                            |
                                                            |
                                                            |
                                                            v
 +-----------------------+        get impact        +---------------+      save topology       +-------------------------+
 |        Client         |------------------------->|  ClawNetwork  |<------------------------>| Storage (FS, redis,...) |
 +-----------------------+                          +---------------+       get topology       +-------------------------+