Skip to content

palkan/faqueue

Repository files navigation

FaQueue

This repo contains code to experiment with different background jobs fairness strategies.

About

The problem of fair background jobs processing usually occurs in multi-tenant applications using shared queues to process tasks (e.g., using Sidekiq). Imagine you have an named queue to process some tasks, say, outgoing E-mail or SMS notifications. This single queue serves all the tenants, big and small. Whenever a large tenant triggers a massive notifications delivery (User.find_each { UserMailer.notify_smth.deliver_later }) and enqueues thousands of jobs. Delivering notifications takes some time (hundreds of milliseconds or even seconds), the jobs processor would be occupied by this tenant; others would experience noticable delays.

There are multiple ways to mitigate this problem and we research them in this repo.

Since this is a pure academical research, we avoid using any specific executor. Instead, we use Ruby 3 Ractors for concurrent jobs execution. However, we assume that a background job executor has a fixed predefined set of queues (like Sidekiq); thus, we cannot use a queue-per-tenant approach, we need to introduce the fairness at the client-side.

The method

We perform the following experiment for each strategy.

Given N numbers each representing the number of jobs to enqueue per tenant, we enqueue N background jobs (bulk jobs). Each bulk jobs enqueues N[i] jobs.

Then we wait for all jobs to complete.

We measure the latency for each executed job and calculate the following statistical data:

  • Mean and p90 latency per tenant.
  • Mean and p90 latency of the first K jobs (head) per tenant, where K = min(N[i]).
  • Standard deviation for the calculated means and percentiles.

The most important metrics is a standard deviation of the heads means/percentiles. Why so? We're interested in minimizing delays caused by large tenants. In other words, we want jobs from all tenants to be executed at the same speed. On the other hand, it's OK for latency to grow if we enqueue thousands of jobs, but that should not affect those who enqueue dozens.

Usage

Run a specific strategy emulation like this:

ruby baseline.rb -c 16 -n 300,20,500,200,1000,120

You will see a visualization of executed jobs (each color represents a particular tenant) and the final statistics information (see below).

To learn about available CLI options use -h switch:

$ ruby baseline.rb -h
Baseline: just a queue, totally unfair
    -n, --number=SCALES              The total number of jobs to enqueue per tenant (comma-separated)
    -c, --concurrency=CONCURRENCY    The concurrency factor (depends on implementation)

Strategies

Baseline

This is the default behavior: do not care about the fairness.

Baseline profile

Shuffle shards

This strategy is described here.

With two shards:

Shuffle shards (2) profile

With three shards:

Shuffle shards (3) profile

With four shards (unfair):

Shuffle shards (4) profile

With four shards (fair):

Shuffle shards (4) profile 2

With four shards total and each batch using two shards:

Shuffle shards (4, 2 per batch) profile

Defined Shards

This approach assumes assigning a shard to each tenant. If we know how to distrbute tenants accross the shards so that they do not block each other, that would be a good solution. However, that task by itself is not easy (and a totally different story).

Predefined shards

Throttling + Rescheduling

This approach has been implemnted in one of the Evil Martians projects back in the days.

The idea is the following: we define a cooldown period for each tenant, i.e., a period during which only a single job is allowed to be performed (actually, enqueued). Every time a job is executed, we store a deadline (now + cooldown) in a distributed cache. If the next job arrives earlier than the deadline, we increase the deadline and re-schedules this job to be executed later.

Throttling/Rescheduling profile

See the example implementation for Sidekiq.

Interruptible iteration

This approach is inspired by the job-iteration technique used in Shopify: instead of enqueuing atomic jobs for each batch, we perform them synchrounously in a loop and pause the iteration if we took more than the specified amount of time. "Pause" means re-enqueuing the current jobs with the cursor specified to indicate the starting point for the iteration.

Iteration profile

You can achieve a similar behaviour for Sidekiq via job-iteration by configaring an appropriate max wait time:

JobIteration.max_job_runtime = 2.minutes

Balanced shards (or fast-medium-slow)

This strategy is somewhat similar to sharding, but a shard is chosen for each job depending on the current tenat's usage so far: when you enqueue more jobs, they switched from the fast to the medium queue first and to the slow queue after that. The virtual credits are used to evaluate the usage, they're getting refilled periodically.

Balanced shards profile

As we can see, routing jobs to queues simply based on the tenant's usage could have an underutilization problem in case the load is not constant.

Continious batches

In real-life, we do not enqueue all the jobs at the same time and wait for them to complete. We have tenant enqueueing jobs at different moments in time and at different rates. You can emulate such scenarios by passing batches of jobs with delays between them using the following format:

ruby baseline.rb -c 8 -n "100|5s|100|5s|100,10|5s|10,500,50|5s|100|5s|50,200|4s|400|10s|200|5s|200"

# or
ruby throttle_schedule.rb -c 8 -n "200|5s|100|20s|200|6s|200|14s|100,10|15s|10|14s|10|2s|10"

Here -n accepts a comma-separated list of tenant configs, where each value has a form: <batch_size>|<delay>|<batch_size>|delay>.... Note that we use s prefix for delays, but it's added only for readability and could be omitted.

In this case, each tenant could head multiple heads separated by gaps in enqueueing jobs. The default gap threshold is 10s, you can change it via --stats-reset-interval.

Resources

About

Researching background jobs fairness

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages