Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choose index based on fields match #469

Merged
merged 3 commits into from
May 9, 2017
Merged

Choose index based on fields match #469

merged 3 commits into from
May 9, 2017

Conversation

garrensmith
Copy link
Member

@garrensmith garrensmith commented Apr 4, 2017

Overview

If two indexes can be used, then instead of choosing the index based on
alphabet order of index names, rather choose based on a score. This
score is calculated by determining which index has the least number of
fields

Testing recommendations

There is a test that proves the issue and passes based on this fix.

JIRA issue number

https://issues.apache.org/jira/browse/COUCHDB-3357

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • Documentation reflects the changes;
  • I will not forget to update rebar.config.script
    with the correct commit hash once this PR get merged.

@garrensmith
Copy link
Member Author

I ran the pouchdb-find tests locally against this fix and all tests passed.

Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part this looks pretty good. Although it doesn't solve the case where we have to choose an index with the same number of columns. Ie, if you have [a, b, c] and [a, d, e] and you have a selector for a.

While I don't want to tack it on as part of this PR, we do have a _count reduce for every view query. In the future we could get fancy and cache selectors and a reduce query with the prefix range and prefer the view that has more documents. Granted that gets complicated and does strange things to latencies when hitting cached vs. not. So basically ignore this paragraph for now and we'll want to pick some other arbitrary metric. The docid thing I think is wrong in hindsight but the only slightly less terrible thing I can think of is using the field names left in the index.

@@ -12,16 +12,16 @@

DreyfusAppFile = filename:join(filename:dirname(SCRIPT),
"../dreyfus/src/dreyfus.app.src"),
RenameFile = filename:join(filename:dirname(SCRIPT),
"src/mango_cursor_text.erl"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you forget a rebase? This looks like its reverting an unrelated change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there's a weird breakage down below with a file rename that looks not right to me.

Cmp = fun({A1, A2}, {B1, B2}) ->
case length(A2) - length(B2) of
Cmp = fun({IdxA, PrefixA, ScoreA}, {IdxB, PrefixB, ScoreB}) ->
case length(PrefixA) - length(PrefixB) of
N when N < 0 -> true;
N when N == 0 ->
% This is a really bad sort and will end
% up preferring indices based on the
% (dbname, ddocid, view_name) triple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to remove this comment since its no longer relevant.

A1 =< B1;
%IdxA =< IdxB;
%prefer using the index with the lower score
ScoreA =< ScoreB;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Score is a bit opaque. I'd just call it length and say that we prefer using the index with the least number of fields. Also I wouldn't bother calculating it in composite indices just to use it here and then have to ignore the return value. You've got all the info here so I'd just calclulate it on the fly.

@@ -270,4 +276,4 @@ is_design_doc(RowProps) ->
case couch_util:get_value(id, RowProps) of
<<"_design/", _/binary>> -> true;
_ -> false
end.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another unrelated change we should try to avoid.

[{Idx, Prefix} | Acc]
% create a score based on how close the number of fields
% the index has to the number of fields in the selector
Score = length(Cols) - length(FieldKeys),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should be looking at the length of the prefix, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand on this, there are three things we're contemplating here:

  1. Number of {Field, Range} pairs in the selector.
  2. Number of columns in the index
  3. The shared prefix between those two.

The important case to remember here is that the selector may have more fields than are being satisfied by the index. Ie, a selector may only use the first one or two out of a three column index.

Which means that if you have a selector that has more fields than the index has columns it ends up having a negative score which seems.... odd. However, what we're really caring about here is that we're being the least specific once we have some part of an index satisfied. Its bit odd given that we also want the longest prefix. So the sort in words is basically something like:

I want to use the index with the most matching fields to my selector, but with the least extra fields not used by my selector (ie, the smallest length(Cols) - length(Prefix)).

Though that leaves us with the same current connundrum when we have two indexes with the same number of columns but sharing a common prefix with the selector. (the [a, b, c], [a, d, e] example I mentioned).

For that case I'd still change it away from the docid sort since that's rather ambiguous and generally not supplied by the user and as such is essentially random (ie, its a hash which is hard to predict, its not uuid random though (if memory serves)).

Anyway, so I'd use the length preference you've got and then break ties with the field name past the end of the sort and then I guess fall back to docid if we have two indices that have the same set of columns cause it shouldn't matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made an attempt at fixing this. I'm not 100% I fully understood this. In the situation of [a, b, c] and [a, d, e] I've defaulted to the indices names. The only problem with that is there is a possibility of a doc being in one index but not in another and this could lead to unexpected results for a user.

@tonysun83
Copy link
Contributor

tonysun83 commented Apr 5, 2017

Paul hit the nail for the gist of this, so the only thing I'm going to add is that for the tests, we should probably use the _explain endpoint and make sure we're choosing the right index. So

  1. Create a bunch of indexes with various columns lengths.

  2. Create selectors that use various combinations of fields

  3. Execute those queries but with _explain, and ensure that the correct index with the right prefix was used by checking the index name.

@garrensmith
Copy link
Member Author

garrensmith commented Apr 5, 2017

This got lost:
I've made an attempt at fixing this. I'm not 100% I fully understood @davisp explanation. In the situation of [a, b, c] and [a, d, e] I've defaulted to the indices names. The only problem with that is there is a possibility of a doc being in one index but not in another and this could lead to unexpected results for a user.

@tonysun83
Copy link
Contributor

tonysun83 commented Apr 6, 2017

@garrensmith : @davisp was just saying that if you have the following indexes:

Idx1 - [a, b, c, d]
Idx2 - [a, b, c, e]
Idx3 - [a, b, c]

and you issued a query with selectors using a,b,c, then we would choose Idx3.

If Idx3 didn't exist, then you need some sort of tiebreaker. The old way used a combination of
[dbname, ddocid, viewname]. He was just saying you could still use this crude method as a tiebreaker, and use the ddocid (not all three elements).

You are right that this won't solve the issue when Idx1 is always chosen, and then a new Idx4 [a, b, c, f] which is alphabetically ahead of Idx1is added, then the user could possibly get different results.

Paul did mention:

"In the future we could get fancy and cache selectors and a reduce query with the prefix range and prefer the view that has more documents."

so that we could possibly choose the index with more documents and be a little more consistent. However, it's still probable that some indexes will contain different docs. I'll think about it some more, but do you think that's more of index design on their part?

@garrensmith
Copy link
Member Author

@tonysun83 great. That makes sense. I think I have it to the point now where it will cover the above situations. Then in terms of choosing indexes with different docs, I think for now the best we can do is add documentation to explain that part.

[{Idx, Prefix} | Acc]
% create a ColPrefixLength based on how close the number of fields
% the index has to the number of Prefixes the index can use
ColPrefixLength = abs(length(Cols) - length(FieldKeys)),
Copy link
Contributor

@tonysun83 tonysun83 Apr 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This should be length(Prefix) instead of length(FieldKeys)?

  2. Not sure abs would work here because of the following scenario:
    IdxA has columns [a,b,c]
    IdxB has columns [a,b,c,d]
    Now our selector queries on a,b,c,f,g
    ColPrefixLength for IdxA becomes 2, ColPrefLength for IdxB becomes 1, then that would lead to IdxB being preferred, whereas IdxA should be used.

A1 =< B1;
% We have no other way to choose, so at this point
% select index based on (dbname, ddocid, view_name) triple
IdxA =< IdxB;
Copy link
Contributor

@tonysun83 tonysun83 Apr 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with the old code (the same thing you have here), I ran a simple test, and here's what happens:

IdxA: [a,b,c]
IdxB: [a,d,e]

I insert two docs, {"a":1, "b":1, "c":1} and {"a":1000, "d" : 1000, "e": 1000}

then I run the query with _explain:

{
  "selector": {
    "a": {
      "$gt": 0
    },
    "b": {
      "$gt": 0
    },
    "c": {
      "$gt": 0
    }
  }
}

and I get back IdxB

"index": {
        "ddoc": "_design/ccf468609d8ca7ef1744797c6b685d82cb552335",
        "name": "IdxB",
        "type": "json",
        "def": {
            "fields": [
                {
                    "a": "asc"
                },
                {
                    "d": "asc"
                },
                {
                    "e": "asc"
                }
            ]
        }
    }

I think we've been wrong all this time, and it should be IdxA right?
Can you try out this scenario in the tests? We might have to change this comparison.

@tonysun83
Copy link
Contributor

tonysun83 commented Apr 25, 2017

Good work @garrensmith, +1 for me. @davisp way want to double check

Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks pretty good with some minor style issues. We should also clear up the comments a bit. Took me a good fifteen minutes of contemplating to realize what was going on.

ColsLenA = length(mango_idx:columns(IdxA)),
ColsLenB = length(mango_idx:columns(IdxB)),
case ColsLenA - ColsLenB of
M when M < 0 -> true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true needs to be on a new line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the general formatting rule, when should it be on a new line?
Does the true on line 144 also need to be on a new line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, missed that one. The general rule is to not mix inline and multiline. If they any of the clauses can't be inlined then make all clauses start on a new line.

% We have no other way to choose, so at this point
% select index based on (dbname, ddocid, view_name) triple
IdxA =< IdxB;
_ -> false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false needs to be on a new line

end.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete these last two empty lines

@@ -124,25 +128,31 @@ composite_prefix([Col | Rest], Ranges) ->
[]
end.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this empty line back

lists:foldl(fun(Idx, Acc) ->
Cols = mango_idx:columns(Idx),
Prefix = composite_prefix(Cols, FieldRanges),
[{Idx, Prefix} | Acc]
% create a KeysPrefixLength based on how close the prefix of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize "Create" and remove "a".

[{Idx, Prefix} | Acc]
% create a KeysPrefixLength based on how close the prefix of
% the index is to the number of keys in the selector
KeysPrefixLength = length(FieldKeys) - length(Prefix),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating FieldKeys is unecessary. Just use length(FieldRanges) here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following the logic on subtracting the prefix from the total length. At least, the name seems to be misleading. Shouldn't it be something like "ColumnsAfterPrefix"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"NumExtraColumns" seems like it'd be better. Also the comment should be updated to specify that we're looking to use the index with the fewest extra columns. And we should add a comment below in choose_best_index.

Cmp = fun({A1, A2}, {B1, B2}) ->
case length(A2) - length(B2) of
Cmp = fun({IdxA, PrefixA, KeysPrefixLengthA}, {IdxB, PrefixB, KeysPrefixLengthB}) ->
case KeysPrefixLengthA - KeysPrefixLengthB of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to stare at the above code but I'm pretty sure this logic boils down to "prefer the index with the fewest extra columns" which while legit is hard to follow based on the variable names.

% pass: Sort the IndexRanges by (num_columns, idx_name)
% and return the first element. Yes. Its going to
% be that dumb for now.
% Low and behold our query planner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why'd you remove ". Or something." Seems an odd change to make. Removing the snark is fine but I'd just remove the whole line. Also I think it should actually be "Lo and behold" but my Google is too far away to check for certain.

% and return the first element. Yes. Its going to
% be that dumb for now.
% Low and behold our query planner
% First pass: sort the IndexRanges by KeysPrefixLength,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"First pass:" is kind of odd since we're not labeling anything else. I would rephrase this sentence to avoid the colon there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively, create a numbered list of the steps of the algorithm.

If two indexes can be used, then instead of choosing the index based on
alphabet order of index names, rather choose based on a query prefix and
number of columns.
Choose the index for the query based on Prefix and number of fields in
the index
Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to squerge

@garrensmith garrensmith merged commit 9568bb7 into apache:master May 9, 2017
@garrensmith garrensmith deleted the improve-choose-best-index branch December 12, 2019 17:20
nickva pushed a commit to nickva/couchdb that referenced this pull request Sep 7, 2022
Document Mango configuration options
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants