Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add couchdb clustering #2810

Merged
merged 6 commits into from
Nov 29, 2017
Merged

Conversation

style95
Copy link
Member

@style95 style95 commented Sep 28, 2017

This pr is to deploy couchdb cluster.
From the version 2.0, CouchDB officially supports clustering.

Notable changes are as follow:

Image of CouchDB is changed to the one from Apache.

klaemo/couchdb is superseded by apache/couchdb.
Accordingly version is also updated to 2.1

Coordinator node.

Since we need one coordinator node to setup CouchDB cluster, first host in db group becomes coordinator node.

New environment variables for cluster.

Important factors for CouchDB cluster are the number of shards and the number of quorums.
Corresponding environment variables are introduced.

Note: Please be informed that, current version of apache/couchdb:2.1 does not support these values, so I created a pull request for this.
Once this pr is merged, those environment variables will be available.
apache/couchdb-docker#32

This pr also can address the problem reported by this issue: #2808

@style95
Copy link
Member Author

style95 commented Oct 4, 2017

Any more thing to do for this pr?

@rabbah
Copy link
Member

rabbah commented Oct 18, 2017

@style95 as you shared on slack, the use of db_num_read_quorum and db_num_write_quorum I think is discouraged. Can you link the ansible related issue here and is there an update?

@style95
Copy link
Member Author

style95 commented Oct 20, 2017

@rabbah Let me explain the issues here.

In this pr, apache/couchdb is being used.
That image does not support environment variables for n, q, r, w, and that's the reason why I initially created this PR(apache/couchdb-docker#32).

And I got comment from couchdb community that,

  1. n will be automatically configured while setting up cluster.
  2. q only matters when creating database and can be specified ?q=1 parameter.
  3. Explicit r and w configurations are deprecated, instead couchdb will use half of the number of replicas for those two values.
    (These can also be specified via parameter in API)
  4. After setting up cluster, cluster configurations will be stored in /opt/couchdb/etc/local.d, so couchdb community highly recommend to use volume mapping of that directory.

In short, we don't need those configurations any more.
Instead, we need to add q param when creating a database.

I will update the code accordingly.

One question is, would it be required to add volume mapping for configuration directory as well?
For this, we need to keep configuration directory and files in OW repo such as etc/default.ini etc/default.d/*.ini etc/local.d because there will be no default configuration when volume is mapped.

@style95
Copy link
Member Author

style95 commented Oct 20, 2017

@rabbah
Code is updated accordingly.
Regarding q value, default is 8.
I think it would be a good starting point, how do you think?

@style95 style95 force-pushed the feature/add-couchdb-cluster branch 2 times, most recently from 1666f2e to 08f9594 Compare October 26, 2017 09:23
@style95
Copy link
Member Author

style95 commented Oct 27, 2017

It seems Travis error is not related to this pr.
Could anyone trigger Travis build again?

set_fact:
nodeName: "couchdb{{ groups['db'].index(inventory_hostname) }}"
coordinator: "{{ groups['db'][0] }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use the ansible_host of groups['db'][0] here. Otherwise we are not able to deploy two dbs on one machine (which is very useful for testing)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, to setup cluster, we need only one coordinator node.
This is to get the first host among many hosts.
Is it possible to do the same thing with ansible_host?

For examaple, if there are two nodes running on same ansible_host, we need only one node.

state: started
recreate: true
restart_policy: "{{ docker.restart.policy }}"
volumes: "{{volume_dir | default([])}}"
ports:
- "{{ db_port }}:5984"
- "4369:4369"
- "9100:9100"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these ports (outside of the container) configurable and increment them for each container? Again, to be able to deploy two dbs on the same host.

Copy link
Member Author

@style95 style95 Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to discuss this further.
Please refer to this: https://docs.couchdb.org/en/latest/cluster/setup.html#firewall

Those two ports are used when Erlang VM finds other nodes.
APAIK, 4369 is reserved to find other nodes, and 9100 is reserved to use for communication among nodes.

Normally, erlang VM uses random ports between 9100~9200, but it is forced to use 9100 only in apache/couchdb.

Regarding 9100, I think we can change it, but we need to update apache/couchdb image.
That should be different on every nodes, I am not sure it works in that situation.
You can refer to this: https://github.com/apache/couchdb-docker/blob/master/2.1.1/vm.args#L14

And with regard to 4369, since it is reserved at Erlang VM layer, I am not sure we can change it.

@cbickel Do you have any idea?

@style95
Copy link
Member Author

style95 commented Nov 6, 2017

@cbickel any comments on this?

url: "{{ db_protocol }}:https://{{ ansible_host }}:{{ db_port }}/_cluster_setup"
method: POST
body: >
{"action": "enable_cluster", "bind_address":"0.0.0.0", "username": "{{ db_username }}", "password":"{{ db_password }}", "port": {{ db_port }}, "node_count": "{{ num_instances }}", "remote_node": "{{ ansible_host }}", "remote_current_user": "{{ db_username }}", "remote_current_password": "{{ db_password }}"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using

{{ groups['db'] | length }}

Instead of defining num_instances on the top?

@cbickel
Copy link
Contributor

cbickel commented Nov 8, 2017

This is a great Pull request.
The only thing, that should be there (if it is technically possible) is, that we should have the ability to deploy two instances on the same machine. With this feature, testing would be much easier. We have it for the controllers and invokers as well.

And I have an additional question. Is the intention of this PR only, to share data across multiple instances? Wouldn't it be great, if the requests against the database would be distributed across the instances? Or do you have in mind to do this later?

@style95
Copy link
Member Author

style95 commented Nov 10, 2017

Agree. It would be great if we can deploy more than one instances on a single machine for testability.

Regarding the data sharing across multiple instances, do we need any further configurations to distribute data?

AFAIK, from CouchDB 2.0, the number of nodes who are in charge of specific shards will be decided based on n(normally recommended value is 3).
For example, if we have 5 nodes, and n is 3, and let's say node 1,2,3 are in charge of some data, and if client sent request to node 4, it will automatically redirect the request to one of node 1,2,3 and they will handle the requests properly.

Based on the value of w, r, it will respond to the clients.
So no matter who receives the request, client will get proper response.
Did I understand your question correctly?
If I misunderstood about couchdb, please rectify me.

If you are mentioning about load-balancing(using single endpoint address) among multi nodes for HA, currently i have no idea. We may locate load-balancer in front of them or, other components to use list of endpoints.

@cbickel
Copy link
Contributor

cbickel commented Nov 10, 2017

Ok, I didn't know yet, that if you send a request to instance 1, that it could be handled by instance2.

I think for HA we need one of your proposed ideas. I just wanted to ask, if you want to address that problem already in this PR, or maybe later.
But for this PR I'm totally fine without loadbalancing to the different nodes.

@style95
Copy link
Member Author

style95 commented Nov 12, 2017

Yes I think it would be better to address load-balancing problem in separate pr.

Regarding epmd port, 4369, initially I thought we can change the port using ERL_EPMD_PORT: https://erlang.org/doc/man/epmd.html.

But for example, if we configure the port as 4369 on first node and 4370 for second node, first node will try to find other node using 4369 port but on second node, only 4370 is opened, so look-up will fail.

I am looking for the configuration to segregate its binding port(4369) and communication port(4370). But I am not sure it is feasible.

@style95
Copy link
Member Author

style95 commented Nov 13, 2017

It seems, all Erlang nodes must use same epmd port.
https://stackoverflow.com/questions/40189097/erlang-epmd-connects-to-other-host-with-non-default-epmd-port

It means, we cannot run a couchdb cluster on a single machine.

As of now I am trying the following scenario.

  1. use same epmd port(4369) for all couchdb nodes.
  2. do not map that port to host (remove -p 4369:4369).
  3. trying to form a cluster.

Since both nodes will be running on a single machine, they will be in the same subnet.
They might be able to communicate with each other using their container IPs.

So though port is not opened in host layer, they may be able to connect to other node using 172.17.0.2:4369.
For this, when we deploy couchdb, we need to specify its container IPs.
(currently, we are specifying host IP).

Since it assumes nodes are running on same subnet, it won't work on distributed environments.
So we need to differentiate local deployment with distributed deployment.

Is it worth to take this approach?

@style95
Copy link
Member Author

style95 commented Nov 15, 2017

Let me share the discussion with @cbickel.
It would be better for us to have test cases for HA or failover of CouchDB.
But it's not that simple and seems to require many dirty changes to run multiple couchdb on a single machine.

So it could be an option that running a test case in case there are more than 1 couchdb deployed(in distributed environment).
Same approach is already taken for controllers.

I will look into ShootComponentsTests.scala and apply similar way for couchdb as well.

@cbickel
Copy link
Contributor

cbickel commented Nov 15, 2017

One idea for one of the tests could be:

  • Write documents to the first DB
  • Shoot the first DB
  • Check that the documents are available (maybe also some view results) on the second db
  • Write some documents to second DB
  • Start first DB again and wait until the first DB is up again
  • Check that the new documents are replicated back to the first database

Do you have some more ideas?

@chetanmeh
Copy link
Member

@style95 What would be consistency support in clustered setup. Would it be possible to read-your-own-writes?

Reading here indicates that for reads and writes it would use quorum which should allow one to read its own write

The number of copies of a document with the same revision that have to be read before CouchDB returns with a 200 is equal to a half of total copies of the document plus one. It is the same for the number of nodes that need to save a document before a write is returned with 201

However its not clear what is the semantics for queries or view access. Per here it looks like queries would not be using quorum

The coordinator waits for a given number of responses to arrive from the nodes it contacted. The number of responses it waits for differ based on the type of request. Reads and writes wait for two responses. A view read only waits for one.

So in clustered setup it can happen that an action added in one step is not visible in query made in subsequent request?

@rabbah
Copy link
Member

rabbah commented Nov 20, 2017

Correct: views are subject to eventual consistency and may exclude a just added/modified action (for example) until all nodes have replicated and indexed.

@chetanmeh
Copy link
Member

Thanks @rabbah for confirming the eventual consistent nature for queries.

Next question - Does Couchdb supports specifying the sharding key? Or it always shards on the basis of _id key only. In absence of sharding key all queries would need to query all shards which may have some impact on performance

@style95
Copy link
Member Author

style95 commented Nov 20, 2017

@chetanmeh
AFAIK, there is no support to specify the custom sharding key.
What kinds of performance increment do you expect with custom sharding key?

If we do view query, we should access all relevant shards to create a combined result.
That's the reason why, it is generally recommended to have small q value for read-intensive DB and large value for write-intensive DB.

If I misunderstood, kindly correct me.

@chetanmeh
Copy link
Member

What kinds of performance increment do you expect with custom sharding key?

I am exploring support for running OW on Mongo and Cosomosdb which support specifying custom partition key. Looking at views used by Couchdb I see following types of queries

  • Activations
    • byDateWithLogs - [doc.start]
    • cleanUpActivations - [doc.start]
    • activations - [doc.namespace, doc.start]
  • Whisks
    • [root namespace, date] - For each of rules, packages, actions, triggers

From my understanding of flow the "whisks" are mostly read. In a sharded setup this may benefit by using "root namespace" as sharding (partition key). In such a case the query would only need to be evaluated on single shard and avoid hitting all the shards for the whisks access.

This would be specially useful for databases where sharded queries are done from client which may add higher latency. Compared to this Couchdb proxies the queries and talks to all the shards on behalf of client. As such calls happen within same data center they may incur less latency.

So I am interested in this PR to see how to model OW entities on other sharded storage systems

@style95
Copy link
Member Author

style95 commented Nov 20, 2017

So you are thinking of assigning a shard per namespace, right?
Great idea. It would be great for small scale and latency.

But if there are huge number of data, one shard may not be enough.
Since shard is a unit of distribution and replication, too big shard is not good for performance.
As data grows, at some scale we need to rebalance shards but if we use namespace as a partition key, how can we rebalance the shards into small shards?

And accessing to shards are proceeded in parallel.
Generally recommended value for n is 3 as so, w and r will be 2.
That means the number of shards to access is 2.
Though we decrease the access time to 1, there might not be huge latency improvement.

And in my experience, storing activation has more huge impact on performance than reading whisk entity because there is cache layer in controller.

@chetanmeh
Copy link
Member

chetanmeh commented Nov 21, 2017

But if there are huge number of data, one shard may not be enough.
Since shard is a unit of distribution and replication, too big shard is not good for performance.
As data grows, at some scale we need to rebalance shards but if we use namespace as a partition key, how can we rebalance the shards into small shards?

Ack. But it depends on kind of data being stored. This scheme may be ok for whisks where one namespace may contain say 1M entities (which can be stored in single shard). This may not work fine for activations where number of activations may of the order of 10-100M. So use of sharding key would depend on number of objects being stored

And accessing to shards are proceeded in parallel.
Generally recommended value for n is 3 as so, w and r will be 2.
That means the number of shards to access is 2.
Though we decrease the access time to 1, there might not be huge latency improvement.

This is fine for read and write of single instance. However for queries all shards would need to be queried

The Query request handler must ask every node that hosts a replica of a shard in the database for their answer to the query. Depending on how the shard replicas are distributed across the cluster, this might be all nodes in the cluster, or only a subset.

And in my experience, storing activation has more huge impact on performance than reading whisk entity because there is cache layer in controller.

Is the cache layer used for queries also or its only used for individual entity access i.e. lookups by docId?

@style95
Copy link
Member Author

style95 commented Nov 21, 2017

@chetanmeh
Yes. There might not be such huge number of data in one namespace.
But what if data grows by any possibility, how can we handle that?
And if there are not much data, do you expect it will cause any trouble in terms of performance with such small size of data?

Second, query is only used to list entities.
All other APIs, such as get/update/delete and even invoke an action uses normal document get/put api.(surely they can take advantage of cache.)

And I can't expect such huge number of list API calls.
As you mentioned, your approach can improve latency of list API.
But normally list API is called by user rather than any systematic call because list is to see what kinds of entities are there. So there may not be 10K or 100K TPS calls.

That's the reason why I think action invocation(which includes storing activations) will affect more on the performance and incline to scalable approach.

And, I could not find any support for custom partition key.

@chetanmeh
Copy link
Member

But normally list API is called by user rather than any systematic call because list is to see what kinds of entities are there. So there may not be 10K or 100K TPS calls.
That's the reason why I think action invocation(which includes storing activations) will affect more on the performance and incline to scalable approach.

Makes sense. Would keep that in mind when modeling for other dbs

And, I could not find any support for custom partition key.

Thanks for confirming. So for couchdb the partition key decision is moot!

@rabbah
Copy link
Member

rabbah commented Nov 22, 2017

The queries (list operations) are not cached today. They could be since we allow stale results and queue indexing operations in couch as configured. In general as already noted the bulk of the load isn’t on crud or query operations.

@style95 style95 changed the title Add couchdb clustering [WIP] Add couchdb clustering Nov 22, 2017
@style95 style95 force-pushed the feature/add-couchdb-cluster branch 2 times, most recently from c31ecad to 6e30f92 Compare November 26, 2017 12:39
@style95 style95 changed the title [WIP] Add couchdb clustering Add couchdb clustering Nov 26, 2017
@style95
Copy link
Member Author

style95 commented Nov 26, 2017

I have tested this pr is working with a couchdb cluster and a single couchdb setup.

@style95
Copy link
Member Author

style95 commented Nov 26, 2017

Travis complains there is no db_local.ini, is it because of this PR?

@@ -94,6 +94,8 @@ db.whisk.activations={{ db.whisk.activations }}
db.whisk.actions.ddoc={{ db.whisk.actions_ddoc }}
db.whisk.activations.ddoc={{ db.whisk.activations_ddoc }}
db.whisk.activations.filter.ddoc={{ db.whisk.activations_filter_ddoc }}
db.hosts={{ groups["db"] | map('extract', hostvars, 'ansible_host') | list | join(",") }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about renaming this value to db.hostsList as well? In the following code you always use dbHostsList.

}
}

def pingDB(host: String, port: Int) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about using one ping method and passing the path as an argument.

Copy link
Contributor

@cbickel cbickel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll take care about merging.
Thanks a lot for your contribution to OpenWhisk :)

@cbickel cbickel merged commit e587bf8 into apache:master Nov 29, 2017
NaohiroTamura added a commit to NaohiroTamura/incubator-openwhisk that referenced this pull request Dec 22, 2017
This is a follow-up patch to the below:
- Add couchdb clustering. (apache#2810)
  commit e587bf8
NaohiroTamura added a commit to NaohiroTamura/incubator-openwhisk that referenced this pull request Dec 25, 2017
This is a follow-up patch to the below:
- Add couchdb clustering. (apache#2810)
  commit e587bf8
NaohiroTamura added a commit to NaohiroTamura/incubator-openwhisk that referenced this pull request Jan 9, 2018
This is a follow-up patch to the below:
- Add couchdb clustering. (apache#2810)
  commit e587bf8
BillZong pushed a commit to BillZong/openwhisk that referenced this pull request Nov 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants