# Resilience <!-- omit from toc -->

- [Basic: Load Balance](#basic-load-balance)
- [More Livingness: Resilience of Service](#more-livingness-resilience-of-service)
  - [CircuitBreaker](#circuitbreaker)
  - [RateLimiter](#ratelimiter)
  - [Retry](#retry)
  - [TimeLimiter](#timelimiter)
- [References](#references)
  - [CircuitBreaker](#circuitbreaker-1)
  - [RateLimiter](#ratelimiter-1)
  - [Retry](#retry-1)
  - [Concepts](#concepts)

As a Cloud Native traffic orchestrator, Easegress supports build-in resilience
features. It is the ability of your system to react to failure and still
remain functional. It's not about avoiding failure, but accepting failure
and constructing your cloud-native services to respond to it. You want to
return to a fully functioning state quickly as possible.[1]

## Basic: Load Balance

```yaml
name: pipeline-reverse-proxy
kind: Pipeline

flow:
- filter: proxy

filters:
- name: proxy
  kind: Proxy
  pools:
  - servers:
    - url: http://127.0.0.1:9095
    - url: http://127.0.0.1:9096
    - url: http://127.0.0.1:9097
    loadBalance:
      policy: roundRobin
```

## More Livingness: Resilience of Service

### CircuitBreaker

CircuitBreaker leverges a finite state machine to implement the processing
logic, the state machine has three states: `CLOSED`, `OPEN`, and `HALF_OPEN`.
When the state is `CLOSED`, requests pass through normally, state transits
to `OPEN` if request failure rate or slow request rate reach a configured
threshold and requests will be shor-circuited in this state. After a
configured duration, state transits from `OPEN` to `HALF_OPEN`, in which a
limited number of requests are permitted to pass through while other
requests are still short-circuited, and state transit to `CLOSED` or `OPEN`
based on the results of the permitted requests.

When `CLOSED`, it uses a sliding window to store and aggregate the result
of recent requests, the window can either be `COUNT_BASED` or `TIME_BASED`.
The `COUNT_BASED` window aggregates the last N requests and the `TIME_BASED`
window aggregates requests in the last N seconds, where N is the window size.

Below is an example configuration with a `COUNT_BASED` policy. `GET` request
to paths begin with `/books/` uses this policy, which short-circuits requests
if more than half of the last 100 requests failed with status code 500, 503,
or 504.

```yaml
name: pipeline-reverse-proxy
kind: Pipeline

flow:
- filter: proxy

filters:
- name: proxy
  kind: Proxy
  pools:
  - servers:
    - url: http://127.0.0.1:9095
    - url: http://127.0.0.1:9096
    - url: http://127.0.0.1:9097
    loadBalance:
      policy: roundRobin
    circuitBreakerPolicy: countBased
    failureCodes: [500, 503, 504]

resilience:
- name: countBased
  kind: CircuitBreaker
  slidingWindowType: COUNT_BASED
  failureRateThreshold: 50
  slidingWindowSize: 100
```

And we can also use a `TIME_BASED` policy, which short-circuits requests
if more than 60% of the requests within the last 200 seconds failed.

```yaml
resilience:
- name: time-based-policy
  kind: CircuitBreaker
  slidingWindowType: TIME_BASED
  failureRateThreshold: 60
  slidingWindowSize: 200
```

In addition to failures, we can also short-circuit slow requests. Below
configuration regards requests which cost more than 30 seconds as slow
requests and short-circuits requests if 60% of recent requests are slow.

```yaml
resilience:
- name: countBased
  kind: CircuitBreaker
  slowCallRateThreshold: 60
  slowCallDurationThreshold: 30s
```

For a policy, if the first request fails, the failure rate could be 100%
because there's only one request. This is not the desired behavior in most
cases, we can avoid it by specifying `minimumNumberOfCalls`.

```yaml
resilience:
- name: countBased
  kind: CircuitBreaker
  minimumNumberOfCalls: 10
```

We can also configure the wait duration in the `open` state and the max
wait duration in the `half-open` state:

```yaml
resilience:
- name: countBased
  kind: CircuitBreaker
  waitDurationInOpenState: 2m
  maxWaitDurationInHalfOpenState: 1m
```

In the `half-open` state, we can limit the number of permitted requests:

```yaml
resilience:
- name: countBased
  kind: CircuitBreaker
  permittedNumberOfCallsInHalfOpenState: 10
```

For the full YAML, see [here](#circuitbreaker-1), and please refer
[CircuitBreaker Policy](../07.Reference/7.01.Controllers.md#circuitbreaker-policy)
for more information.

### RateLimiter

> NOTE: When there are multiple instances of Easegress, the configuration
> will be applied for every instance equally. For example, TPS of RateLimiter
> is configured with 100 in 3-instances cluster, so the total TPS will be 300.

The below configuration limits the request rate for requests to `/admin`
and requests that match regular expression `^/pets/\d+$`.

```yaml
name: pipeline-reverse-proxy
kind: Pipeline

flow:
- filter: rate-limiter
- filter: proxy

filters:
- name: rate-limiter
  kind: RateLimiter
  policies:
  - name: policy-example
    timeoutDuration: 100ms
    limitRefreshPeriod: 10ms
    limitForPeriod: 50
  defaultPolicyRef: policy-example
  urls:
  - methods: [GET, POST, PUT, DELETE]
    url:
      exact: /admin
      regex: ^/pets/\d+$
    policyRef: policy-example
- name: proxy
  kind: Proxy
```

For the full YAML, see [here](#ratelimiter-1).

### Retry

If we want to retry a failed request, for example, retry on HTTP status
codes 500, 503, and 504, we can create a `RetryerPolicy` with the below
configuration, it makes at most 3 attempts on failure.

```yaml
name: pipeline-reverse-proxy
kind: Pipeline

flow:
- filter: proxy

filters:
- name: proxy
  kind: Proxy
  pools:
  - servers:
    - url: http://127.0.0.1:9095
    - url: http://127.0.0.1:9096
    - url: http://127.0.0.1:9097
    loadBalance:
      policy: roundRobin
    retryPolicy: retry3Times
    failureCodes: [500, 503, 504]

resilience:
- name: retry3Times
  kind: Retry
  maxAttempts: 3
  waitDuration: 500ms
```

By default, the wait duration between two attempts is `waitDuration`, but
this can be changed by specifying `backOffPolicy` and `randomizationFactor`.

```yaml
resilience:
- name: retry3Times
  kind: Retry
  backOffPolicy: Exponential
  randomizationFactor: 0.5
```

For the full YAML, see [here](#retry-1), and please refer
[Retry Policy](../07.Reference/7.01.Controllers.md#retry-policy) for more information.

### TimeLimiter

TimeLimiter limits the time of requests, a request is canceled if it cannot
get a response in configured duration. As this resilience type only requires
config a timeout duration, it is implemented directly on filters like `Proxy`.

```yaml
name: pipeline-reverse-proxy
kind: Pipeline

flow:
- filter: proxy

filters:
- name: proxy
  kind: Proxy
  pools:
  - servers:
    - url: http://127.0.0.1:9095
    - url: http://127.0.0.1:9096
    - url: http://127.0.0.1:9097
    loadBalance:
      policy: roundRobin
    timeout: 500ms
```

## References

### CircuitBreaker

```yaml
name: pipeline-reverse-proxy
kind: Pipeline

flow:
- filter: proxy

filters:
- name: proxy
  kind: Proxy
  pools:
  - servers:
    - url: http://127.0.0.1:9095
    - url: http://127.0.0.1:9096
    - url: http://127.0.0.1:9097
    loadBalance:
      policy: roundRobin
    circuitBreakerPolicy: countBasedPolicy
    failureCodes: [500, 503, 504]

resilience:
- name: countBasedPolicy
  kind: CircuitBreaker
  slidingWindowType: COUNT_BASED
  failureRateThreshold: 50
  slidingWindowSize: 100
  slowCallRateThreshold: 60
  slowCallDurationThreshold: 30s
  minimumNumberOfCalls: 10
  waitDurationInOpenState: 2m
  maxWaitDurationInHalfOpenState: 1m
  permittedNumberOfCallsInHalfOpenState: 10
- name: timeBasedPolicy
  kind: CircuitBreaker
  slidingWindowType: TIME_BASED
  failureRateThreshold: 60
  slidingWindowSize: 200
```

### RateLimiter

```yaml
name: pipeline-reverse-proxy
kind: Pipeline

flow:
  - filter: rate-limiter
  - filter: proxy

filters:
- name: rate-limiter
  kind: RateLimiter
  policies:
  - name: policy-example
    timeoutDuration: 100ms
    limitRefreshPeriod: 10ms
    limitForPeriod: 50
  defaultPolicyRef: policy-example
  urls:
  - methods: [GET, POST, PUT, DELETE]
    url:
      exact: /admin
      regex: ^/pets/\d+$
    policyRef: policy-example
- name: proxy
  kind: Proxy
  pools:
  - servers:
    - url: http://127.0.0.1:9095
    - url: http://127.0.0.1:9096
    - url: http://127.0.0.1:9097
    loadBalance:
      policy: roundRobin
```

### Retry

```yaml
name: pipeline-reverse-proxy
kind: Pipeline

flow:
- filter: proxy

filters:
- name: proxy
  kind: Proxy
  pools:
  - servers:
    - url: http://127.0.0.1:9095
    - url: http://127.0.0.1:9096
    - url: http://127.0.0.1:9097
    loadBalance:
      policy: roundRobin
    retryPolicy: retry3Times
    failureCodes: [500, 503, 504]

resilience:
- name: retry3Times
  kind: Retry
  backOffPolicy: Exponential
  randomizationFactor: 0.5
  maxAttempts: 3
  waitDuration: 500ms
```

### Concepts
1. https://docs.microsoft.com/en-us/dotnet/architecture/cloud-native/resiliency