The transport layer of the Raft protocol should deliver RPC requests and snapshots. Rafter employs the Seastar's RPC framework and refers to the design of the ScyllaDB's messaging service.
To fully utilize the hardware resources and honor the Seastar's shared-nothing architecture, transport layer is also sharded and maintains a shard-to-node connection map.
- Each shard will connect to a remote node (the shard with overlapping raft clusters) independently
- the target shard is specified with the help of Seastar's port policy
- all shards use a special meta gateway to exchange the knowledge of hosting nodes (e.g. shard count, raft cluster info, etc.)
- For a cross-shard message, e.g. the connection delivers a
cluster#2
's message fromnode#1.shard#2
tonode#3.shard#0
, shared-nothing is honored andsmp::submit_to
is invoked to forward this message tonode#3.shard#2
- Once a message from a new node is found, the knowledge will propagate to all shards to notify the join/rejoin of the node
- Once a broken pipe is found in one shard, the exception will propagate to all shards to notify the lost of the node
The implementation is still evolving to achieve the above goals.
Once received a send_snapshot
request, transport layer will instantiate a new express
object for snapshot loading
and sending. The snapshot files (snapshot itself and corresponding disk files specified by the replicated state machine)
will be loaded and sent concurrently as they belong to two different IO stack:
- load snapshot chunks (disk IO) and push them to the worker queue
- fetch chunks from worker queue and send them out (network IO)
TODO: introduce the receiving-side management of ongoing/out-dated snapshots.
TODO: introduce the registry.