Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/user partitioned databases #1605

Closed
wants to merge 6 commits into from

Conversation

rnewson
Copy link
Member

@rnewson rnewson commented Sep 18, 2018

Overview

This PR introduces a new feature, user-defined partitioned databases.

A new kind of database can be created with the ?partitioned=true option. All documents within the database must have document ids of the following format;

partition_name:doc_id

both partition_name and doc_id must follow the couchdb id format (can't begin with _, etc).

All documents with the same partition_name are guaranteed to be mapped to the same shard range. When querying an index, the new /db/_partition/$partition/_view endpoint can query the view more efficiently, by only consulting the single shard range holding the partition. This is much more efficient and scales the same way that primary key lookup (GET /dbname/docid) does (approximately linearly).

Testing recommendations

The PR contains multiple tests for basic functionality and all existing tests still pass. When testing the PR, it is important to try the feature yourself, interactively, with docs and views of your choosing, to give us confidence in this new feature.

Related Issues or Pull Requests

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • Documentation reflects the changes;

@garrensmith
Copy link
Member

The majority of tests for partitions are currently here https://github.com/cloudant/partition_tests
I wanted to write the tests in elixir so I've kept them in a separate repo until the elixir branch is merged in. Once that is merged in I can move the partition tests into the CouchDB repo

@wohali
Copy link
Member

wohali commented Sep 18, 2018

@garrensmith would you consider merging the tests into our elixir branch?

@garrensmith
Copy link
Member

@wohali sure I can do that once this PR is a little further down the line

@jaydoane
Copy link
Contributor

@garrensmith I notice that the tests in https://github.com/cloudant/partition_tests/tree/master/test don't have a common prefix, which although fine for now, I think will become less cohesive once they are merged with the main elixir suite. Would you consider renaming them so they all start with partition_?

This props list is recorded in each database shard as well as the
shard document in the special _dbs database.

Co-authored-by: Garren Smith <[email protected]>
Co-authored-by: Robert Newson <[email protected]>
@rnewson rnewson force-pushed the feature/user-partitioned-databases branch from 8b7f146 to fcb1795 Compare October 1, 2018 16:33
@rnewson rnewson force-pushed the feature/user-partitioned-databases branch 4 times, most recently from 3e14644 to 07bacf0 Compare October 8, 2018 13:16
% cannot partition a system database
validate_partition_database_create(DbName, Partitioned) ->
SystemId = DbName =:= ?l2b(config:get("mem3", "shards_db", "_dbs")) orelse
lists:member(DbName, ?SYSTEM_DATABASES),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work for cases when the database name contain /. For example the following would be broken (example is taken from #1647 (comment) where one of CouchDb users mention his use case).

account%2F85%2Fea%2F6075c6c1e266f8512e2233541bdb-201807

I think this should use couch_db:is_systemdb(DbName) which abstracts that logic (keep in mind that there is an open PR which fixes it a little #1647).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to make is_systemdb function public and possibly rename it since we have public is_system_db already.


validate_docid(DocId, DbName, Options) ->
SystemId = DbName =:= ?l2b(config:get("mem3", "shards_db", "_dbs")) andalso
lists:member(DocId, ?SYSTEM_DATABASES),
Copy link
Contributor

@iilyak iilyak Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use couch_db:is_systemdb(DbName) here? We would need to make that function public.

-define(LOWEST_KEY, null).
-define(HIGHEST_KEY, {<<255, 255, 255, 255>>}).
-define(PARTITION_START(P), <<P/binary, $:>>).
-define(PARTITION_END(P), <<P/binary, $;>>).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use semicolon here is it intentional or a typo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentional. it is the next character in order after : and therefore marks the end of the partition key range.

@@ -43,3 +44,6 @@ design_handler(<<"_update">>) -> fun chttpd_show:handle_doc_update_req/3;
design_handler(<<"_info">>) -> fun chttpd_db:handle_design_info_req/3;
design_handler(<<"_rewrite">>) -> fun chttpd_rewrite:handle_rewrite_req/3;
design_handler(_) -> no_match.

partition_design_handler(<<"_view">>) -> fun chttpd_view:handle_partition_view_req/4;
partition_design_handler(_) -> no_match.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test it yet, but I am pretty sure we need partition_handler(_) -> no_match here (and export it as well).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've run our test suite and manual testing without needing that, what fails without it?

{<<"stable">>, Args#mrargs.stable, true},
{<<"conflicts">>, Args#mrargs.conflicts, true}
],
lists:foreach(fun ({Param, Field, Value}) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Field and Value naming is confusing in this context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to rename to ArgValue and RestrictedValue

set_prop(#st{header = Header} = St, Key, Value) ->
OldProps = get_props(St),
NewProps = lists:ukeymerge(1, [{Key, Value}], OldProps),
Options = [{compression, St#st.compression}],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can call copy_props/2 here. Since the lines are identical.

@@ -752,6 +848,16 @@ copy_security(#st{header = Header} = St, SecProps) ->
needs_commit = true
}}.

copy_props(#st{header = Header} = St, Props) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not really copying props here. So the name of the function is not accurate. Should we call it write_props instead?


get_partition_info(#st{} = St, Partition) ->
StartKey = <<Partition/binary, ":">>,
EndKey = <<Partition/binary, ";">>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess ; is intentional since I see it second time.

@iilyak
Copy link
Contributor

iilyak commented Oct 10, 2018

How we can hide this feature behind a feature flag, so we can stabilize it?

get_prop(St, Key, DefaultValue) ->
case get_prop(St, Key) of
{error, no_value} -> DefaultValue;
Value -> Value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a discrepancy in return value here. get_prop/2 returns {ok, Value}. So we either change return value for get_prop/2 or change {error, no_value} -> {ok, DefaultValue}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional but I see your point. get_prop/3 will always return a value whereas get_prop/2 might not, which is why the result shape is different. We could change get_prop/2 to return Value or undefined instead, which matches proplists:get_value behaviour.

@@ -133,7 +133,9 @@ make_group_fun(Bt, exact) ->
end;
make_group_fun(Bt, GroupLevel) when is_integer(GroupLevel), GroupLevel > 0 ->
fun
({[_|_] = Key1, _}, {[_|_] = Key2, _}) ->
GF({{p, _Partition, Key1}, Val1}, {{p, _Partition, Key2}, Val2}) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any possibility that we would we have a case were Partition value is different? I think we would have such cases. That would mean that there would be no match and we crash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the partition for Key1 and Key2 were different, that's a bug and we should crash.

@@ -88,6 +88,14 @@ handle_call({set_security, NewSec}, _From, #db{} = Db) ->
ok = gen_server:call(couch_server, {db_updated, NewSecDb}, infinity),
{reply, ok, NewSecDb, idle_limit()};

handle_call({set_prop, Key, Value}, _From, #db{} = Db) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should detect a case when we set the same property multiple times. I.e. the case when the result from couch_db_engine:get_props/1 is the same before and after calling set_prop. In such case we might want to avoid doing a commit and sending db_updated.

@@ -60,6 +60,7 @@ setup_legacy() ->
lists:foreach(fun(File) -> file:delete(File) end, Files),

% copy old db file into db dir
?debugFmt("copy ~p ~p", [OldDbFilePath, NewDbFilePath]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to remove prior to merge

Indexes;
filter_indexes_by_partitioned(DbPartitioned, Indexes, PQ) ->
FilterFun = fun (Idx)->
PartitionedIdx = case lists:keyfind(partitioned, 1, Idx#idx.design_opts) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is over-complicated. DbPartitioned is always true since we have a dedicated function clause. Therefore this part can be written as:

PartitionedIdx = couch_util:get_value(partitioned, Idx#idx.design_opts, true),

@@ -119,6 +147,16 @@ validate_new(Idx, Db) ->
Mod:validate_new(Idx, Db).


validate_design_opts(Props) ->
case lists:keyfind(<<"options">>, 1, Props) of
{<<"options">>, {[{<<"partitioned">>, P}]}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are making a dangerous assumption here that options would contain a single element and it will be {<<"partitioned">>, P}.

ref = undefined,
order = 1,
opts = [{partitioned,true}]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add non matching shards into mock result as well?

Partitioned = couch_util:get_value(partitioned, Idx#idx.design_opts, DbPartitioned),
Args2 = couch_mrview_util:set_extra(Args1, partitioned, Partitioned),
Args3 = case couch_util:get_value(partition, Opts) of
<<>> ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am missing something. I cannot figure out how we can get <<>> here. I'll get back to it latter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we handle the case when partition is not in Opts? Is it possible? If so we would call couch_mrview_util:set_extra(Args2, partition, undefined) which is not correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<<>> is the default_value in mango_opts:validate_idx_create

Copy link
Contributor

@jaydoane jaydoane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All unit tests passed, but I got a significant number of failures with the elixir partition tests. I also built a fresh couchdb clone just to ensure I didn't have any junk lying around, and still get the same failures.

For now, I'm going to focus on the first failure:

MangoPartitionTest
  * test partitioned database partitioned query does not use global index (326.4ms)

  1) test partitioned database partitioned query does not use global index (MangoPartitionTest)
     test/mango_partition_test.exs:474
     Assertion with == failed
     code:  assert length(partitions) == 50
     left:  25
     right: 50
     stacktrace:
       test/mango_partition_test.exs:496: (test)

which results in the following in the logs, where the first line is the last entry from test cleanup, and then 5 minutes later, we see lots of rexi timeouts like this:

[error] 2018-10-11T19:26:39.221358Z [email protected] <0.1502.0> -------- rexi_server: from: [email protected](<0.1499.0>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,431}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,735}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]

UPDATE

The rexi timeouts appear to be a red herring, and have been seen in production logs without the partitioning work present. However, the intermittent test failures are real, and I've seen them in both bare metal and containerized dev environments. They appear to be from races between indexing newly inserted documents, and http queries against those indexes. Failures can be individually mitigated on a per test basis by wrapping the request+assertions inside a retry_until function. For example, this:

    create_docs(db_name)

    url = "/#{db_name}/_partition/foo/_find"
    resp = Couch.post(url, body: %{
      selector: %{
        some: "field"
      },
      limit: 20
    })

    assert resp.status_code == 200
    partitions = get_partitions(resp)
    assert length(partitions) == 20, Kernel.inspect(resp)
    assert_correct_partition(partitions, "foo")

becomes

    create_docs(db_name)

    url = "/#{db_name}/_partition/foo/_find"
    retry_until fn ->
      resp = Couch.post(url, body: %{
        selector: %{
          some: "field"
        },
        limit: 20
      })

      assert resp.status_code == 200
      partitions = get_partitions(resp)
      assert length(partitions) == 20, Kernel.inspect(resp)
      assert_correct_partition(partitions, "foo")
    end

retry_until also needs to be modified from it's current form to handle assertions (and fix a bug in current epoch ms calculation).

@@ -60,6 +60,7 @@ setup_legacy() ->
lists:foreach(fun(File) -> file:delete(File) end, Files),

% copy old db file into db dir
?debugFmt("copy ~p ~p", [OldDbFilePath, NewDbFilePath]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove debugging?

Docs = [couch_doc:from_json_obj_validate(JsonObj) || JsonObj <- DocsArray],
Docs = [begin
couch_doc:from_json_obj_validate(JsonObj)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why begin and end were inserted here

@@ -66,8 +66,22 @@ for_db(DbName, Options) ->
for_docid(DbName, DocId) ->
for_docid(DbName, DocId, []).

%% This function performs one or two lookups now as it is not known
%% ahead of time if the database is partitioned We first ask for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing . after "partitioned".

is_partitioned_false_shards_db_test() ->
meck:expect(config, get, fun (_, _, Default) -> Default end),
?assertEqual(is_partitioned(<<"_dbs">>), false),
meck:unload().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in a teardown function because leaving meck loaded in the case of test failure will cause later tests to fail in mysterious ways.

is_partitioned_false_nodes_db_test() ->
meck:expect(config, get, fun (_, _, Default) -> Default end),
?assertEqual(is_partitioned(<<"_nodes">>), false),
meck:unload().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another dangerous use of meck

meck:expect(mem3, shards, fun(<<"db">>) -> [] end),
shards(DbName, Args),
meck:validate(mem3),
meck:unload(mem3).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

meck:expect(mem3, shards, fun(<<"db">>) -> [] end),
shards(DbName, Args),
meck:validate(mem3),
meck:unload(mem3).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

meck:expect(mem3, shards, fun(<<"db">>) -> [] end),
shards(DbName, Args),
meck:validate(mem3),
meck:unload(mem3).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

meck:expect(mem3, shards, fun(<<"db">>) -> [] end),
shards(DbName, Args),
meck:validate(mem3),
meck:unload(mem3).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

ok;
{<<"r">>, _Value} ->
?MANGO_ERROR(invalid_partition_read)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space?

@jaydoane
Copy link
Contributor

@garrensmith I updated to the latest partition_tests commit, but am still getting the same 3 failures from make couch and the bevy of rexi timeouts which follow 5 minutes later. Next, I will try to repro on a different dev env.

docid_hash(DocId, []) when is_binary(DocId) ->
docid_hash(DocId, [{partitioned, false}]);

docid_hash(DocId, [{partitioned, false}]) when is_binary(DocId) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know up the call stack this parameter is referred to as HashOptions but this current structure will prevent this from ever actually being an options list given that it's matching on a single element list. So adding another option to HashOptions will cause a function_clause error here, so why have it at all? This seems like unnecessary future proofing, especially given that to use this field as an actual options list will require restructuring the function definitions as you can't define a function head to match on an element in a proplist.

I suggest either making this an actual proplist and using a variation of couch_util:get_value(partioned, Options), or eliminating the list entirely and making this a boolean parameter. You could rewrite this series of functions as something like:

docid_hash(DocId) ->
  docid_hash(DocId, false).

docid_hash(DocId, false) ->
  erlang:crc32(DocId);
docid_hash(<<"_design/", _/binary>>=DocId, _) ->
  erlang:crc32(DocId);
docid_hash(DocId, true) ->
    case binary:split(DocId, <<":">>) of
        [Partition, _Rest] ->
            erlang:crc32(Partition);
        _ ->
            throw({illegal_docid, <<"doc id must be of form partition:id">>})
    end.

docid_hash(DocId, []).

docid_hash(<<"_design/", _/binary>> = DocId, _Options) ->
erlang:crc32(DocId); % design docs are never placed by partition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the motivation to not use the hash/1 function is not wanting to call term_to_binary on the DocId, I think it would be wise to avoid redefining the use of erlang:crc32/1 and we should instead call out to a default hash function that is used ubiquitously. Perhaps something like:

hash_raw(Item) ->
  erlang:crc32(Item).

hash(Item) ->
  hash_raw(term_to_binary(Item)).

%% ...

%% example usage
docid_hash(DocId, ...) ->
  hash_raw(DocId).

@wohali
Copy link
Member

wohali commented Nov 15, 2018

Please note, we are moving swiftly now into the 2.3.0 release cycle. Until 2.3.x has branched, do not merge this PR into master.

I will update this PR when that has happened. Thanks!

@rnewson rnewson force-pushed the feature/user-partitioned-databases branch from ca603af to 1faa57a Compare November 23, 2018 09:48
@wohali
Copy link
Member

wohali commented Nov 29, 2018

The 2.3.x branch has now forked.

You are free to merge this to master at your convenience.

@davisp davisp mentioned this pull request Dec 3, 2018
3 tasks
@davisp davisp closed this Dec 3, 2018
@davisp
Copy link
Member

davisp commented Dec 3, 2018

Closed in favor of #1789

@davisp davisp deleted the feature/user-partitioned-databases branch April 1, 2020 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants