Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for collections #4044

Open
nkosi23 opened this issue May 31, 2022 · 1 comment
Open

Add support for collections #4044

nkosi23 opened this issue May 31, 2022 · 1 comment

Comments

@nkosi23
Copy link

nkosi23 commented May 31, 2022

Summary

Map/reduce indexes may take a lot of time to be built on very large databases. In addition to working on improving the Query Server protocol (which is the topic of another ticket), another way to mitigate this problem would be to reduce the number of documents that a view needs to process. At least for me, each of my views begin with an "if" statement to check the document type, so it would probably make sense to give couchdb a way to know that only specific documents need to be passed to a certain view.

Desired Behaviour

Exactly the same API than the one offered by the _partitions feature.

Possible Solution

I believe we can take inspiration from the implementation of partitions. In particular the fact that partitions can have their own design documents only applying to the specific partition is exactly what is needed here. This is exactly the spirit of this proposal. Unfortunately there is a major caveat making partitions unsuitable for this use case, as per the documentation:

A good partition has two basic properties. First, it should have a high cardinality. That is, a large partitioned database should have many more partitions than documents in any single partition. A database that has a single partition would be an anti-pattern for this feature. Secondly, the amount of data per partition should be “small”. The general recommendation is to limit individual partitions to less than ten gigabytes (10 GB) of data. Which, for the example of sensor documents, equates to roughly 60,000 years of data.

In the use case I describe, there would instead be a small number of partitions each having a large number of documents. This would therefore be the main difference between partitions and collections. Partitions are meant to hold the data of a specific sensor, while collections would be meant to store all documents being of type "sensor". This would still be a major improvement over passing the data of all the documents of the database to the view server to build an index since document serialization is a major cost center of the indexing process.

Partitioned Collections

An interesting nice-to-have feature would be the ability to partition collections. And maybe conveniences making it easy for users to start with a non-partitioned collection, and then transition to a partitioned collection where relevant. For this purpose, maybe a collection could contain an optional "partition_key_properties" configuration option that can only be defined at collection creation time, and accepting either a simple value, or a composite key. A document not containing the properties would not be allowed to be added to the partitioned collection. A compliant document would automatically be added to a partition of the collection named after the document partition-key.

From that point, enabling users to replicate documents from one collection to another one of the same database would make it easy to transition from a non-partitioned collection to a partitioned collection. This is important since it can be very difficult to anticipate if partitions should be used or not, in practice the decision will often only be made after the need manifests itself in production.

Migrating existing databases

Finally, to ease the transition for existing databases (extremely important), it should be possible to add an option to filtered replication allowing users to indicate that filtered documents should be added to a specific collection of the destination database.

Alternatively, similarly to the "partition_key_properties", every collection may have a "collection_key_properties" so that new documents added to the database and matching the rule are automatically added to the collection. In addition to this, for existing documents, such an option would require a full scan of the database every time collection_key_properties is modified. There is also the question of what to do if a document matches the collection_key_properties of multiple documents. Also, this would need to only affect documents not currently belonging to any collection.

Finally, an endpoint (potentially the all_docs endpoint) should still allow users to replicate all documents without replicating collections. This is to allow users to start from scratcg and reorganize the collection schema of their database (by first replicating all documents to a database no having any collection, and then replicating this database to the database with the new collection schema)

Retaining the ability to define global views

Just like the partition features does not prevent users from defining global views, being able to define global views would also be of the essence for databases having collections. Now, maybe the collections features would also make it easy to add a "collections" property to global views enabling users to specify that only the documents of the listed collections must be sent to the view.

Additional context

@nickva
Copy link
Contributor

nickva commented Jun 8, 2022

Thank you for your proposal and for reaching out, @nkosi23!

Yeah, I think something like that would be nice to have. Or, just in general it would just be nice to remove the partitioned shard size caveat and have it all work as expected. Then partitions and collections would just work the same.

However, given how partitions are implemented currently that would be a bit hard to achieve. A partition prefix is used to decide which shard the document belongs to. In the normal, non-partitioned case, the scheme is hash(doc1) -> couch.1, or hash(doc2) -> couch.2 etc. In the partitioned case, document x:a does hash(x) and that ends up on 1.couch. Document y:b does hash(y) and the hash result puts it on shard 2.couch. So, if partition x has 1B documents, it would get quite large and essentially each partition is a bit like a q=1 database. That's why that warning in the docs. Of course, there could be multiple partitions mapping to the same shard too, so z:c with hash(z) mapping to 1.couch as well.

Now, the idea might be to have a multi-level shard mapping scheme. First based on the the collection/partition, then another level based on the rest of the doc id. So a hash(x) then would be followed by hash(a) which would determine if the document should live on 1-1.couch, 1-2.couch ... 1-N.couch shard files etc. And that gets a bit tricky as you can imagine.

At a higher level, this goes down to the fundamental sharding strategy used with their associated trade-offs:

  1. random sharding (using hashing) which is what CouchDB uses. This favors fast writes with uniformly distributed load across shards. However, it makes reads cumbersome as they have to spread the query to all the shard ranges and then wait for all to respond.
  2. range based sharding which is what other databases use (FoundationDB for instance), where the sorted key range is split. So, it might end up with ranges a..r on file 1.couch and s...z on 2.couch. This scheme makes it easy to end up with a single hot shard during incrementing key updates (inserting 1M docs starting with "user15..."). And of course, as in shown in the example on purpose, the shards become easily imbalanced (a lot more a..r docs vs s..z ones). However, reads becomes really efficient, because now start/end key ranges can pick out just the relevant shards.

Of course you can have multi-level schemes - levels of sharding followed by range based hashing or vice versa.

Another interesting idea we had been kicking around is to have the main db shards use the random hashing scheme but indices using a range-based one to allow more efficient start/end key on views. That would be nice but have the same hot shard issue when building the views since now that won't be parallelized as easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants