Configuring node "flavors" in a cluster #1338

kocolosk · 2018-05-23T22:02:53Z

For a while now we’ve had this capability to control the placement of database shards on various nodes throughout a cluster by using the “zone” attribute of documents in the /nodes DB:

http:https://docs.couchdb.org/en/2.1.1/cluster/databases.html?highlight=zone#placing-a-database-on-specific-nodes

Recently I’ve been exploring whether we might benefit from introducing an ability for nodes in a cluster to further specialize in their roles; e.g., a pool of nodes in a cluster could be reserved just for HTTP processing and request coordination (but not data storage), another pool could mediate replication jobs, etc. I think there are a lot of benefits to this design in busy clusters, particularly that it improves the observability of each node in the cluster as the specific nature of the workload consuming resources on that node is more clear.

I think any work in this direction would benefit for a richer interface for labeling nodes with specific attributes. The current “zone" design has a couple of limitations that I see:

zone is a single tag, whereas we may want to assign multiple labels / tags / roles to a node
zone can only be set once the node is up and running which complicates cluster setup

I'd like to solicit input on an alternative design that would be more flexible and user-friendly. I would certainly recommend that we support setting these properties in our normal config files (although we may still find it important internally to copy them into /nodes so that every node in the cluster learns about the capabilities of its peers). One thought I have is introducing a new top-level configuration section like so:

[node]
coordinator = true ; listens on HTTPS, issues fabric requests
replicator = false ; will not mediate any replication jobs
storage = false ; will not store any user data
zone = us-metro-1a

Of course, coordinator, replicator, storage would all default to true which would recover the current configuration where all cluster nodes share in all workloads equally.

What do you all think?

The text was updated successfully, but these errors were encountered:

iilyak · 2018-05-24T12:12:07Z

I think that we need to specialize type of nodes for following reasons:

we want to modernize CouchDB API and deprecate not scalable interfaces.
we want to experiment with latest research on distributed databases
we want to scale the parts independently
it makes economic sense to run some types of nodes in containerized environment
we might want to consider rich client architecture

In order to achieve #1 we need to introduce API versioning. Since it would take time to update all clients which use CouchDB. We would have to support multiple versions of API for a while. How we can do it? Option 1) restructure the codebase to have a dispatcher module and have separate versions of modules implementing http endpoints and fabric logic. 2) split the node to api and storage and connect them using stable (non Erlang distribution based) protocol. I do believe that second approach is better because in such case the api node for old version of API could be implemented using adapter pattern. Which makes it possible to run two versions of api nodes at the same time.

This architecture makes it possible to instantiate dedicated api nodes per user. In such case we can do:

throttle requests on api nodes based on the current load of storage layer.
place api nodes in VPC available to customer and tunnel traffic to storage backend
implement application firewall on api node
over time move authentication and authorization modules to api node so unauthorized requests do not reach storage layer at all

The same for replicator. We might want to support new versions of replication protocol which is based on modern fast set reconciliation algorithms. We need a way to run two versions of protocol side by side.

This patch allows administrators to configure settings in the [node] block of the server configuration and have those settings propagated to the internal nodes database. A current use case for this functionality is to configure the `zone` of the node for use in cluster placement topologies. Additional attributes can be written directly into the document in the nodes DB, but in case of a conflict between the entry in the database and the entry in the server configuration the latter will be preferred (and the attribute in the database overwritten). Related to #1338.

See apache/couchdb#1338

wohali · 2018-05-25T00:14:58Z

It occurs to me: We have some of these things already in the GET / response; the idea was to expose different features there. It'd be good if we surfaced these ideas via the same mechanism.

Still munching over the idea in general. Other than proposing the general infrastructure for this @kocolosk are you willing to share what other roles you're considering - or alternative deployment architectures that differ from the current configuration? That would help me think about the proposal more reasonably.

kocolosk · 2018-05-25T18:21:04Z

Good point on the capabilities thing in the GET / response @wohali.

The gist of my thinking is to make it possible to have the different capabilities inside a CouchDB cluster delivered by processes that have a slightly higher degree of isolation between them (say, separate containers). I think the best ROI comes from these different groupings:

replicator - replication jobs can be fairly resource-intensive on the cluster node where they are mediated. Technically one could just spin up a separate cluster for running replications (and that's not a bad idea) but this would enable some isolation of the replication resources without any changes to the API.
coordinator / api (I like @iilyak 's suggestion on the name here) - these containers would have just enough ephemeral storage to hold the internal databases like /_nodes and /_dbs that are replicated to every node. They would handle client TCP connections, run the chttpd logic, and submit fabric_rpc requests to the storage nodes hosting the actual user data. One could auto-scale this group up and down, although we'd likely want to make the /_up endpoint smarter so a load balancer in front of this tier could automatically determine when a new api container is actually ready to start handling requests (i.e., it has replicated the full content of /_nodes and /_dbs).
storage - these containers would not do any clustered coordination or HTTP traffic; they would simply receive fabric_rpc requests coming in on rexi_server and respond to them. (Of course we could leave the fabric and chttpd apps running for debugging purposes here, but the point is that the workload running on this container is simpler.
4 compute / functions - this would be a future item, but if we could iterate on the view server protocol and come up with something that made sense over the wire I could see running an autoscaled pool of JS processes separated from the storage containers. This could give us some additional sandboxing tools and also allow for an easier and more efficient response to variable compute demands.

In this design steps 1-3 don't change anything about the communication protocols and interfaces used to communicate between different parts of the stack inside CouchDB. Each of those containers is a full Erlang VM and a full member of the distributed Erlang cluster. In the future it may well make sense to investigate some alternatives there but no changes would be required initially.

Does that help?

nickva · 2018-05-31T21:53:45Z

I like the idea. It could allow for interesting configurations. Here are some thoughts I had:

replicator - This would work trivially for _replicator jobs because they are not tied a particular replicator document and placement is based on hashing source, target and a few other parameters. There just getting the replicator node set would be good enough. However things become more complicated for replications backed by a replicator db. Currently ownership (placement) of a replication job is determined by where the replicator doc shards live, such that state updates (completed, failed etc) happen as local doc writes. If some nodes can accept document writes but not be replicator nodes, would mean that some replications would never be able to run. We could fix it in a couple of ways:

Allow non-local doc IO. So let replicator doc updates / reads happen remotely and distribute replication jobs based on a replication ID instead of document ID. Then only the replicator nodes would run replications. But with remove doc IO this would introduce new failure modes.
_replicator db shards placement is customized such that _replicator docs end up only on replicator nodes. This might work but not sure what would happen when replicator node set changes (a node stops being a replicator node). Then might need to do auto-migration of replicator shards

coordinator vs storage. This would be great to have. Maybe the coordinator ones could have more CPU but less memory or other such hardware customizations.

It is also interesting to think about what happens during membership changes. That is in the period of time during and immediately after a node changes its membership. A bit like the case with a replicator above. But say if a node becomes a storage node, then stops being a storage node. Does all data it stored auto-migrate to other storage nodes and what happens during that transition. That would be great to have, and it would open the door to auto-resharding (but that's for a different discussion).

mikerhodes · 2018-06-25T12:40:43Z

Thoughts on config, I'd likely go for something that put the node's "role" as a single config concept in some manner, either:

[node]
roles = coordinator,replicator,storage

or:

[node-roles]
coordinator = true ; listens on HTTPS, issues fabric requests
replicator = false ; will not mediate any replication jobs
storage = false ; will not store any user data

What would happen to existing clusters if we expand the role set? I can see the true/false approach handles this quite simply by defaulting true.

For @iilyak:

over time move authentication and authorization modules to api node so unauthorized requests do not reach storage layer at all

I am definitely for this approach. I think consolidating access checks makes it easier to be confident in the security properties of the system and makes unit testing many combinations simpler.

For @nickva:

That would be great to have, and it would open the door to auto-resharding (but that's for a different discussion).

I think it opens the door to automatic rebalancing of shards rather than resharding (isn't that about increasing or decreasing the number of shards for a given database?).

A key thing for me about this is allowing a shard replica itself to be able to flag its (including primary data and indexes) "readiness state".

It is also interesting to think about what happens during membership changes.

I think dealing with this scenario is likely very important, even if it's along the lines of a refusal to change one's role within a cluster. The semantics otherwise end up quite subtle (e.g., storage -> api is hard to detect because the replication factor of a shard decreases but you still likely have copies in service).

_replicator db shards placement is customized such that _replicator docs end up only on replicator nodes.

I would likely put them into the storage nodes, as they'd be what you'd be backing up in this scenario.

kocolosk · 2018-06-25T14:39:55Z

I think defaulting to true on each role is important, and harder to represent using a comma-delimited list of roles that the node assumes; using a dedicated section like [node-roles] instead of [node] is OK by me. No strong preference there.

wohali · 2018-06-25T15:32:19Z

@kocolosk Yes, that does help, thank you!

While we're bike-shedding it, [roles] seems good to me, with a full list of type = {true|false} clauses. it's obvious that the scope of a .ini file right now is a single node. (Changing it to be a cluster-wide ini file is out of scope for this issue.)

Another thing that comes up often are nodes that are ephemeral (in-memory-only) replicas for performance. Many CouchDB users at scale are using Redis alongside (with nutty client libraries that know how to speak Redis and CouchDB) treated as write-thru caches. I'd love to see room for a native CouchDB implementation within your proposal. @chewbranca 's already prototyped this. Given we're talking about node roles, how would you separate the storage role between persistent and ephemeral storage? Would it be storage = {ephemeral|persistent} or similar?

mikerhodes · 2018-06-26T13:12:49Z

@wohali I wonder about that proposal, whether we're overloading the database by giving it different storage layers in that sense vs. using a well known caching layer like redis?

I'd also like to understand what the semantics of these nodes are -- and whether to be effective at decreasing load on the cluster whether first you have to consider things like preferentially asking certain nodes for answers to queries/writes/reads (i.e., the in-mem ones) before considering this. It also feels like a potentially complicated design for what should be more like, as you say, a write-through cache.

wohali · 2018-06-26T18:36:38Z

@mikerhodes Perhaps, but it's an extremely prevalent design choice at scale for CouchDB. I leave the hard thinking to others, and am only thinking along the lines of - we play with it, or we play against it. And we've already built the idea of an in-memory-only Couch PSE backend.

kocolosk · 2018-06-26T19:36:57Z

Definitely open to more hard thinking on the caching topic. Typically I'd say that a caching system should be separate from the database cluster so it can be closer to the app; i.e. I'd not want to run a mixed cluster with storage and cache together.

kocolosk added design discussion feature labels May 23, 2018

kocolosk added a commit to apache/couchdb-documentation that referenced this issue May 24, 2018

Document the ability to set zone in config

0779648

See apache/couchdb#1338

kocolosk mentioned this issue Jun 27, 2018

Improve ability to discover when a node is ready to handle requests #1416

Closed

wohali removed the design label Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuring node "flavors" in a cluster #1338

Configuring node "flavors" in a cluster #1338

kocolosk commented May 23, 2018

iilyak commented May 24, 2018

wohali commented May 25, 2018

kocolosk commented May 25, 2018

nickva commented May 31, 2018

mikerhodes commented Jun 25, 2018

kocolosk commented Jun 25, 2018

wohali commented Jun 25, 2018

mikerhodes commented Jun 26, 2018

wohali commented Jun 26, 2018

kocolosk commented Jun 26, 2018

Configuring node "flavors" in a cluster #1338

Configuring node "flavors" in a cluster #1338

Comments

kocolosk commented May 23, 2018

iilyak commented May 24, 2018

wohali commented May 25, 2018

kocolosk commented May 25, 2018

nickva commented May 31, 2018

mikerhodes commented Jun 25, 2018

kocolosk commented Jun 25, 2018

wohali commented Jun 25, 2018

mikerhodes commented Jun 26, 2018

wohali commented Jun 26, 2018

kocolosk commented Jun 26, 2018