-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Couchdb 3287 pluggable storage engines #496
Conversation
Also for documentation I'm not sure what we want to do there. Should I open a PR for the docs now or should we wait and let this soak on master and just let the more adventurous folk take cracks at running and writing new storage engines to make sure there's nothing major that needs reworked? |
Also, I forgot to link out to the first complete alternative storage engines. https://github.com/cloudant/couch_ngen The couch_ngen engine (its a super terrible name I know) was one that I wrote fairly quickly just to try and prove that the API wasn't artificially restrictive. The biggest thing here is that I'm using multiple files for a single database to try and make things like compaction faster. This is the same general shape as the existing legacy engine with a few tweaks that we've considered over the years but never actually tried to implement due to upgrade concerns. This also uses a NIF for file I/O with dirty schedulers (thus your Erlang VM needs to have dirty schedulers enabled to run this engine). Initial benchmarks show this engine to be significantly faster than the legacy engine at the local node interface. Unfortunately those gains don't show up nearly as well at the clustered interface (so we've got more work to do there :) https://github.com/cloudant-labs/lethe/tree/initial-branch Lethe is the first engine written by !me. This work is by @chewbranca trying to investigate the storage engine API by actually using it to write a storage engine. Lethe is an ephemeral/in-memory storage engine. Think ramdisk database thing. Its not been benchmarked to the best of my knowledge but last I heard it was passing the storage engine test suite and could run most of the existing test suite (it doesn't support attachments which are optional in the API). There are some other obvious candidates for storage engines as well that I've not tried (on purpose to let others play with something new and interesting). LevelDB/RocksDB, LMDB, PostgreSQL, SQLite, Bitcask, etc all come to mind as obvious things to play with. Also something I've been contemplating is writing a custom storage engine in C that is behaviorally similar to the existing code but hopefully significantly faster (though remember if speed is your goal we've got more work to do in the clustering layer :) |
@davisp IMO write the docs and have that PR land at the same time as this one, or they might get left behind. |
@@ -14,10 +14,11 @@ | |||
|
|||
-include_lib("couch/include/couch_eunit.hrl"). | |||
|
|||
-define(ENGINE, {couch_bt_engine_stream, {Fd, []}}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name is hard-coded in a define. Could we do something about it? For example:
-define(ENGINE(Var), {couch_bt_engine_stream, {Var, []}}).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, that's better.
src/couch/src/couch_att.erl
Outdated
@@ -233,6 +234,14 @@ transform(Field, Fun, Att) -> | |||
store(Field, NewValue, Att). | |||
|
|||
|
|||
copy(Att, DstStream) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any calls to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange. Looks like I wrote it and then never used it. Will remove it.
@@ -62,6 +63,37 @@ normparts(["." | RestParts], Acc) -> | |||
normparts([Part | RestParts], Acc) -> | |||
normparts(RestParts, [Part | Acc]). | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very similar to https://github.com/erlang/otp/blob/master/lib/stdlib/src/filelib.erl#L103. Is there a reason to write own implementation? filelib's implementation considered slow (based on info from erlang mailing list) due to use of regexp. However this implementation uses it as well.
I'll read through the implementation some more tomorrow (maybe there is a reason).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answering my own question:
This implementation doesn't do useless call to file:read_file_info(File)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah. So, it looks like I'm adding a comment to explain this cause it took me at least 10m of staring at this and the original to remember why I did this. While it does avoid the extra read_file_info, that's ancillary to the main change which is to accept both directories and files when matching the regular expression. This way a pluggable storage engine can be a directory or single file (or group of files not utilizing a directory structure for some odd reason).
Here are the steps to use this branch with a couch_ngen Erlang dirty scheduler
Setup dependanciesIn the CouchDB repo add these two lines to the
Then run Edit local.iniEdit `rel/overlay/etc/local.ini and add these settings in:
This sets up the list of supported engines and which is the default engine. If you change Important to note, don't add these settings to the bottom of the file if you are using Build CouchDB
Run CouchDB
Create database with couch_ngenIf
Boom! You now have multiple storage engines running. |
Awesome, thanks for the notes @garrensmith! |
DbFile = filename:join([DbDir, ?b2l(DbName) ++ ".couch"]), | ||
{ok, _} = file:copy(DbFile, DbFile ++ ".backup"), | ||
ok. | ||
{ok, Db} = couch_db:open_int(DbName, []), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we abstract it away into get_filepath(DbName)
?
In this case it could be used from both backup_db_file
and restore_backup_db_files
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not see what you're suggesting to abstract away for this.
@@ -414,37 +390,34 @@ get_before_doc_update_fun(#db{before_doc_update = Fun}) -> | |||
get_committed_update_seq(#db{committed_update_seq=Seq}) -> | |||
Seq. | |||
|
|||
get_update_seq(#db{update_seq=Seq})-> | |||
Seq. | |||
get_update_seq(#db{} = Db)-> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick. Some of the get_
functions return {ok, Result}
and some just Result
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit for me too but its historical and if I were to change it here the diff would spider out to all places that call these functions so I left them all as they were.
@@ -531,17 +541,25 @@ map_fold({{Key, Id}, Val}, _Offset, Acc) -> | |||
user_acc=UAcc1, | |||
last_go=Go | |||
}}; | |||
map_fold({<<"_local/",_/binary>> = DocId, {Rev0, Body}}, _Offset, #mracc{} = Acc) -> | |||
map_fold(#doc{id = <<"_local/", _/binary>>} = Doc, _Offset, #mracc{} = Acc) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to stop here for quite some time. Leaving comment for future reviewers. This change would work because we open local btree differently. We pass split and join callbacks to couch_btree:open for local_tree now.
@@ -236,7 +237,9 @@ init_state(Db, Fd, State, Header) -> | |||
view_states=ViewStates | |||
} = Header, | |||
|
|||
IdBtOpts = [{compression, couch_db:compression(Db)}], | |||
IdBtOpts = [ | |||
{compression, couch_compress:get_compression_method()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would work since old implementation calls to couch_compress:get_compression_methods()
from couch_db_updater:init_db
. However this would mean that we wouldn't be able to support different compression methods for different databases. Also it wouldn't be possible to delegate the decision on a compression algorithm to the storage engine (if we are not dispatching through couch_engine
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this comment but I think you have it right. To make sure we agree this was the intent:
The pluggable storage engine API avoids dictating a compression algorithm to use so that each storage engine can choose its own approach to compression. For instance a LevelDB engine could just rely on LevelDB applying snappy under the covers.
However, since we don't have a notion of "database compression" we now have to change the scheme for view compression. Given that the existing approach is to just set it to the result of couch_compress:get_compression_method() (via what is returned when a database is opened, and more to the point not when it was created) this is effectively not a behavior change.
@@ -840,6 +843,8 @@ maybe_convert_reductions({KVs0, UserReductions}) -> | |||
|
|||
maybe_convert_kv({<<"_local/", _/binary>> = DocId, _}) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we convert this clause to maybe_convert_kv(#doc{id = <<"_local/", _/binary>>} = Doc) ->
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its an upgrade clause for the time being. The #doc version will already be caught on line 846.
src/couch/src/couch_server.erl
Outdated
exists(DbName) -> | ||
RootDir = config:get("couchdb", "database_dir", "."), | ||
Engines = get_configured_engines(), | ||
Possible = lists:foldl(fun({Extension, Engine}, Acc) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have exactly same lines in get_engine(Server, DbName)
function. We can create a common function such as:
select_candidate_engines(DbName, RootDir, Engines)
select_possible_engines(DbName, RootDir, Engines)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
1558cb7
to
18f5226
Compare
Let's rebase it on master when you have a chance. |
30d29f4
to
559cd0b
Compare
Rebased. Forgot to push earlier. Just have that one function I'm not sure about where it went. I'm guessing the couch_server revert but will have to look harder on Monday. This branch is rebased on master but only loosely. I haven't done a full make check yet so there might be a lingering issue or two. |
559cd0b
to
6375cd4
Compare
I ran couchdyno's replication tests against PSE and they all passed:
Ninja edit from @davisp: CouchDyno can be found here: https://github.com/cloudant-labs/couchdyno |
@davisp: I followed @garrensmith 's setup and created a db using the ngen engine. It's probably documented somewhere, but how do I verify that the db created is using another engine or are we hiding that abstraction? For instance, I can do |
@tonysun83 each storage engine should have a different file extension. So if you do From what I understand of this code couchdb/src/couch/src/couch_server.erl Lines 629 to 642 in 6375cd4
If you do ?engine=blahgen and that engine is not supported it will fallback to the configured default engine.
|
With the |
@tonysun83 The shard's db info blob should include the engine, however this wasn't included in the clustered db info because I had originally had plans on being able to upgrade the storage engine for a database. However that's not been written yet but if it were then a clustered db could have multiple engines involved for different shards and we don't currently surface shard specific information. @garrensmith That's correct. It currently just defaults to the configured default engine. And if nothing is configured the default is the legacy couch engine. Good call on the features list. That was added after the majority of the PSE dev but that'd be a good thing to support and as I recall it was fairly simple. I'll take a look at adding that today. |
@davisp: given the negative scenario above with |
@davisp @nickva : I'm not sure if there has been work to rebase the latest scheduler replicator with PSE. but I'm seeing some function clauses and weird behavior in my weird scenario.
So after I did this, the odd behavior was that my original mydb had 7 documents, but the repngen keeps increasing in number of docs. Currently, it's at 20. and the original mydb has not changed at all. Must be something to do with the update_seq. Log is provided below. https://gist.github.com/tonysun83/b600e855ea0d3ae0e6ef92e388cf05c0 |
@tonysun83 Hrm, I'll add a test for that case and try and track down what's causing the 500's with the bad engine. For that and for the replicator function clause its probably a bad rebase or refactor somewhere. For the replication you're trying I'd expect that to not work I think. Adding URL strings to replication URLs seems like it'd not work so hot. Can you check what documents have been added? My guess is that the URL string is screwing things up and the replicator ends up posting random data that gets a UUID since the target database would interpret everything after the ? as a query parameter. Once its created any post would probably be interpreted as a request to create a new document. And that's not PSE specific. A similar test on not-PSE would be to try and specify a Q or N parameter in the target URL. I'd expect similar results there. |
d07c28d
to
4e693f3
Compare
d3fdfcf
to
b5d3e6f
Compare
d7943f3
to
fc3939c
Compare
b5d3e6f
to
1bb435e
Compare
fc3939c
to
56ce55a
Compare
src/couch/src/couch_db_updater.erl
Outdated
MSecSinceLastRead = couch_file:msec_since_last_read(Fd), | ||
case MSecSinceLastRead > IdleLimitMSec of | ||
LastActivity = couch_db_engine:last_activity(Db), | ||
DtMSec = time:now_diff(os:timestamp(), LastActivity) div 1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we need to use timer:now_diff/2
instead of time:now_diff/2
?
src/couch/src/couch_bt_engine.erl
Outdated
|
||
|
||
last_activity(#st{fd = Fd}) -> | ||
couch_file:get_last_read(Fd). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no couch_file:get_last_read/1
. Should we need to change to couch_file:last_read/1
?
56ce55a
to
49b05fe
Compare
1bb435e
to
5260a77
Compare
1c8b846
to
742ffed
Compare
742ffed
to
5439cc4
Compare
Has anyone got any reason for me to not merge this in 72h or whenever? I've rebased it against master and its passing CI checks. As far as I remember everyone was happy with the current state for the V1 pluggable storage engines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passes for me locally as well.
+1 to merge from me.
5439cc4
to
065185f
Compare
+1 from my side. All tests passed locally. |
This relied on having introspection into the state of a #doc{} record that is no longer valid with PSE. This is a harmless change as it was only necessary when performing a rolling reboot to a version that included the new get_meta_body_size function. We can remove it now because anyone wanting to do a rolling reboot with PSE will have to step through a version that includes commit aee57bf which includes the previous upgrade logic as well.
This is the primary API for pluggable storage engines. This module serves as both a behavior and a call dispatch module for handling the engine state updates. COUCHDB-3287
This is the legacy storage engine code. I've kept it as part of the core couch application because we'll always need to have at least one storage engine available. COUCHDB-3287
This change moves the main work of storage engines to run through the new couch_db_engine behavior. This allows us to replace the storage engine with different implementations that can be tailored to specific work loads and environments. COUCHDB-3287
This allows other storage engine implementations to reuse the same exact test suite without having to resort to shenanigans like keeping vendored copies up to date. COUCHDB-3287
This re-fixes a corner case when recreating a document with an attachment in a single multipart request. Since we don't detect that we need a new revision until after the document has been serialized we need to be able to deserialize the body so that we can generate the same revisions regardless of the contents of the database. If we don't do this then we end up including information from the position of the attachment on disk in the revision calculation which can introduce branches in the revision tree. I've left this as a separate commit from the pluggable storage engine work so that its called out clearly for us to revisit. COUCHDB-3255
065185f
to
360c4df
Compare
Last rebase. Will merge once CI comes back green. |
Overview
Pluggable Storage Engines (Oh my!)
I've finally covered enough bases to start asking for reviews on the pluggable storage engine work. I know that this is a fairly large change so I don't expect this to actually be merged for a number of weeks. So everyone that wants to review any part of this feel free to take your time and be thorough.
I highly suggest reviewing this work a commit at a time as there are a few fairly large commits. A few signposts for the intrepid reader:
This is a single file addition that just introduces the new pluggable storage engine API. Its fairly thoroughly documented and hopefully makes sense as to the level that this API is geared at. When I initially started this work I spent a lot of time trying to figure out what level of the API would be the best place to separate the storage engine from database logic. Go too high and every engine is reimplementing core bits of logic. Go to low and there's not enough room for interesting changes to the actual storage engine algorithms. I believe this is a happy middle ground that gives storage engines the room to play and invent alternative implementations while also not requiring an extremely large amount of reimplementation for various bits of behavior that are required.
This commit is quite big but its important to note that all its doing is copying the existing parts of couch_db.erl and couch_db_updater.erl and moving them to a new set of modules named couch_bt_engine_*.erl. Nothing uses this code yet. I have this in its own commit since its fairly large. However, if you mostly just want to check it you should be able to see that the implementation is just pulled from existing functions. The behavior of this engine is identical to the "pre-PSE" engine because that's what it is. Its just been reformatted a bit and had its name changed.
Also, I've kept this and couch_db_engine.erl in the main couch application since we'll always want to have at least one storage engine. Though this was pre-monorepo when adding another git repo to our builds seemed awfully heavy. Now that we're monorepo we're just creating a folder which is easy enough. For any reviewer I'd like to have them keep this in mind as one thing I expect to discuss is whether we should split this out into its own application.
This is the doozy. There are two main bits of work going on in this commit. First, the removal of all the code that was in the previous commit and second, the addition of all the code to have couch_db and couch_db_updater start using the couch_db_engine APIs. Its long and big but really if you take your time there's nothing magical going on here. For the core bits to study I'd recommend spending some extra time on couch_server to see how engines are configured and chosen at runtime.
This is a fairly complete test suite for the entire storage engine API. I don't remember what coverage of couch_bt_engine was but I know its fairly high. The useful bit about this is when devs are creating new storage engines they can just pull this in with a single eunit case to test their implementations. I'll show a few examples on this below.
I may end up squashing this into the implementation commit. This was a fix while I was developing PSE that I ended up refixing after the fact slightly differently. The original goal was to make sure everything compiled and all tests suites ran for each commit. I think I've just convinced myself to squash this. So if its missing its because I did that and then forgot to edit this description to remove this paragraph...
Testing recommendations
$ make check
JIRA issue number
https://issues.apache.org/jira/browse/COUCHDB-3287
Related Pull Requests
This PR is based on the previous mixed cluster upgrade PR. However when merge time comes around we'll merge #495 first and then this will be merged into master so that users can deploy the first changes before these changes. (Uptime is fun time).
#495
Checklist
(We should probably remove the todo item for updating rebar.config since we're doing almost all work on the monorepo now)