Choose index based on fields match #469

garrensmith · 2017-04-04T15:13:03Z

Overview

If two indexes can be used, then instead of choosing the index based on
alphabet order of index names, rather choose based on a score. This
score is calculated by determining which index has the least number of
fields

Testing recommendations

There is a test that proves the issue and passes based on this fix.

JIRA issue number

https://issues.apache.org/jira/browse/COUCHDB-3357

Checklist

Code is written and works correctly;
Changes are covered by tests;
Documentation reflects the changes;
I will not forget to update rebar.config.script
with the correct commit hash once this PR get merged.

garrensmith · 2017-04-04T15:32:08Z

I ran the pouchdb-find tests locally against this fix and all tests passed.

davisp

For the most part this looks pretty good. Although it doesn't solve the case where we have to choose an index with the same number of columns. Ie, if you have [a, b, c] and [a, d, e] and you have a selector for a.

While I don't want to tack it on as part of this PR, we do have a _count reduce for every view query. In the future we could get fancy and cache selectors and a reduce query with the prefix range and prefer the view that has more documents. Granted that gets complicated and does strange things to latencies when hitting cached vs. not. So basically ignore this paragraph for now and we'll want to pick some other arbitrary metric. The docid thing I think is wrong in hindsight but the only slightly less terrible thing I can think of is using the field names left in the index.

davisp · 2017-04-04T15:58:28Z

src/mango/rebar.config.script

@@ -12,16 +12,16 @@

 DreyfusAppFile = filename:join(filename:dirname(SCRIPT),
 "../dreyfus/src/dreyfus.app.src"),
-RenameFile = filename:join(filename:dirname(SCRIPT),
- "src/mango_cursor_text.erl"),


Did you forget a rebase? This looks like its reverting an unrelated change.

And there's a weird breakage down below with a file rename that looks not right to me.

davisp · 2017-04-04T16:03:39Z

src/mango/src/mango_cursor_view.erl

- Cmp = fun({A1, A2}, {B1, B2}) ->
- case length(A2) - length(B2) of
+ Cmp = fun({IdxA, PrefixA, ScoreA}, {IdxB, PrefixB, ScoreB}) ->
+ case length(PrefixA) - length(PrefixB) of
 N when N < 0 -> true;
 N when N == 0 ->
 % This is a really bad sort and will end
 % up preferring indices based on the
 % (dbname, ddocid, view_name) triple


You'll want to remove this comment since its no longer relevant.

davisp · 2017-04-04T16:14:17Z

src/mango/src/mango_cursor_view.erl

- A1 =< B1;
+ %IdxA =< IdxB;
+ %prefer using the index with the lower score
+ ScoreA =< ScoreB;


Score is a bit opaque. I'd just call it length and say that we prefer using the index with the least number of fields. Also I wouldn't bother calculating it in composite indices just to use it here and then have to ignore the return value. You've got all the info here so I'd just calclulate it on the fly.

davisp · 2017-04-04T16:14:31Z

src/mango/src/mango_cursor_view.erl

@@ -270,4 +276,4 @@ is_design_doc(RowProps) ->
 case couch_util:get_value(id, RowProps) of
 <<"_design/", _/binary>> -> true;
 _ -> false
- end.


Another unrelated change we should try to avoid.

davisp · 2017-04-04T16:21:59Z

src/mango/src/mango_cursor_view.erl

- [{Idx, Prefix} | Acc]
+ % create a score based on how close the number of fields
+ % the index has to the number of fields in the selector
+ Score = length(Cols) - length(FieldKeys),


Seems like we should be looking at the length of the prefix, no?

To expand on this, there are three things we're contemplating here:

Number of {Field, Range} pairs in the selector.

Number of columns in the index

The shared prefix between those two.

The important case to remember here is that the selector may have more fields than are being satisfied by the index. Ie, a selector may only use the first one or two out of a three column index.

Which means that if you have a selector that has more fields than the index has columns it ends up having a negative score which seems.... odd. However, what we're really caring about here is that we're being the least specific once we have some part of an index satisfied. Its bit odd given that we also want the longest prefix. So the sort in words is basically something like:

I want to use the index with the most matching fields to my selector, but with the least extra fields not used by my selector (ie, the smallest length(Cols) - length(Prefix)).

Though that leaves us with the same current connundrum when we have two indexes with the same number of columns but sharing a common prefix with the selector. (the [a, b, c], [a, d, e] example I mentioned).

For that case I'd still change it away from the docid sort since that's rather ambiguous and generally not supplied by the user and as such is essentially random (ie, its a hash which is hard to predict, its not uuid random though (if memory serves)).

Anyway, so I'd use the length preference you've got and then break ties with the field name past the end of the sort and then I guess fall back to docid if we have two indices that have the same set of columns cause it shouldn't matter?

I've made an attempt at fixing this. I'm not 100% I fully understood this. In the situation of [a, b, c] and [a, d, e] I've defaulted to the indices names. The only problem with that is there is a possibility of a doc being in one index but not in another and this could lead to unexpected results for a user.

tonysun83 · 2017-04-05T07:49:06Z

Paul hit the nail for the gist of this, so the only thing I'm going to add is that for the tests, we should probably use the _explain endpoint and make sure we're choosing the right index. So

Create a bunch of indexes with various columns lengths.
Create selectors that use various combinations of fields
Execute those queries but with _explain, and ensure that the correct index with the right prefix was used by checking the index name.

garrensmith · 2017-04-05T18:11:50Z

This got lost:
I've made an attempt at fixing this. I'm not 100% I fully understood @davisp explanation. In the situation of [a, b, c] and [a, d, e] I've defaulted to the indices names. The only problem with that is there is a possibility of a doc being in one index but not in another and this could lead to unexpected results for a user.

tonysun83 · 2017-04-06T07:22:32Z

@garrensmith : @davisp was just saying that if you have the following indexes:

Idx1 - [a, b, c, d]
Idx2 - [a, b, c, e]
Idx3 - [a, b, c]

and you issued a query with selectors using a,b,c, then we would choose Idx3.

If Idx3 didn't exist, then you need some sort of tiebreaker. The old way used a combination of
[dbname, ddocid, viewname]. He was just saying you could still use this crude method as a tiebreaker, and use the ddocid (not all three elements).

You are right that this won't solve the issue when Idx1 is always chosen, and then a new Idx4 [a, b, c, f] which is alphabetically ahead of Idx1is added, then the user could possibly get different results.

Paul did mention:

"In the future we could get fancy and cache selectors and a reduce query with the prefix range and prefer the view that has more documents."

so that we could possibly choose the index with more documents and be a little more consistent. However, it's still probable that some indexes will contain different docs. I'll think about it some more, but do you think that's more of index design on their part?

garrensmith · 2017-04-06T12:22:28Z

@tonysun83 great. That makes sense. I think I have it to the point now where it will cover the above situations. Then in terms of choosing indexes with different docs, I think for now the best we can do is add documentation to explain that part.

tonysun83 · 2017-04-07T00:43:15Z

src/mango/src/mango_cursor_view.erl

- [{Idx, Prefix} | Acc]
+ % create a ColPrefixLength based on how close the number of fields
+ % the index has to the number of Prefixes the index can use
+ ColPrefixLength = abs(length(Cols) - length(FieldKeys)),


This should be length(Prefix) instead of length(FieldKeys)?

Not sure abs would work here because of the following scenario:
IdxA has columns [a,b,c]
IdxB has columns [a,b,c,d]
Now our selector queries on a,b,c,f,g
ColPrefixLength for IdxA becomes 2, ColPrefLength for IdxB becomes 1, then that would lead to IdxB being preferred, whereas IdxA should be used.

tonysun83 · 2017-04-07T00:51:54Z

src/mango/src/mango_cursor_view.erl

- A1 =< B1;
+ % We have no other way to choose, so at this point 
+ % select index based on (dbname, ddocid, view_name) triple
+ IdxA =< IdxB;


So with the old code (the same thing you have here), I ran a simple test, and here's what happens:

IdxA: [a,b,c]
IdxB: [a,d,e]

I insert two docs, {"a":1, "b":1, "c":1} and {"a":1000, "d" : 1000, "e": 1000}

then I run the query with _explain:

{ "selector": { "a": { "$gt": 0 }, "b": { "$gt": 0 }, "c": { "$gt": 0 } } }

and I get back IdxB

"index": { "ddoc": "_design/ccf468609d8ca7ef1744797c6b685d82cb552335", "name": "IdxB", "type": "json", "def": { "fields": [ { "a": "asc" }, { "d": "asc" }, { "e": "asc" } ] } }

I think we've been wrong all this time, and it should be IdxA right?
Can you try out this scenario in the tests? We might have to change this comparison.

tonysun83 · 2017-04-25T06:34:30Z

Good work @garrensmith, +1 for me. @davisp way want to double check

davisp

Generally looks pretty good with some minor style issues. We should also clear up the comments a bit. Took me a good fifteen minutes of contemplating to realize what was going on.

davisp · 2017-04-27T22:08:32Z

src/mango/src/mango_cursor_view.erl

+ ColsLenA = length(mango_idx:columns(IdxA)),
+ ColsLenB = length(mango_idx:columns(IdxB)),
+ case ColsLenA - ColsLenB of
+ M when M < 0 -> true;


true needs to be on a new line

What is the general formatting rule, when should it be on a new line?
Does the true on line 144 also need to be on a new line?

Yes, missed that one. The general rule is to not mix inline and multiline. If they any of the clauses can't be inlined then make all clauses start on a new line.

davisp · 2017-04-27T22:08:40Z

src/mango/src/mango_cursor_view.erl

+ % We have no other way to choose, so at this point 
+ % select index based on (dbname, ddocid, view_name) triple
+ IdxA =< IdxB;
+ _ -> false


false needs to be on a new line

davisp · 2017-04-27T22:09:08Z

src/mango/src/mango_cursor_view.erl

+ end.
+
+


Delete these last two empty lines

davisp · 2017-04-27T22:09:30Z

src/mango/src/mango_cursor_view.erl

@@ -124,25 +128,31 @@ composite_prefix([Col | Rest], Ranges) ->
 []
 end.

-


Add this empty line back

davisp · 2017-04-27T22:10:15Z

src/mango/src/mango_cursor_view.erl

 lists:foldl(fun(Idx, Acc) ->
 Cols = mango_idx:columns(Idx),
 Prefix = composite_prefix(Cols, FieldRanges),
- [{Idx, Prefix} | Acc]
+ % create a KeysPrefixLength based on how close the prefix of


Capitalize "Create" and remove "a".

davisp · 2017-04-27T22:10:55Z

src/mango/src/mango_cursor_view.erl

- [{Idx, Prefix} | Acc]
+ % create a KeysPrefixLength based on how close the prefix of
+ % the index is to the number of keys in the selector
+ KeysPrefixLength = length(FieldKeys) - length(Prefix),


Creating FieldKeys is unecessary. Just use length(FieldRanges) here.

I'm not following the logic on subtracting the prefix from the total length. At least, the name seems to be misleading. Shouldn't it be something like "ColumnsAfterPrefix"?

"NumExtraColumns" seems like it'd be better. Also the comment should be updated to specify that we're looking to use the index with the fewest extra columns. And we should add a comment below in choose_best_index.

davisp · 2017-04-27T22:13:05Z

src/mango/src/mango_cursor_view.erl

- Cmp = fun({A1, A2}, {B1, B2}) ->
- case length(A2) - length(B2) of
+ Cmp = fun({IdxA, PrefixA, KeysPrefixLengthA}, {IdxB, PrefixB, KeysPrefixLengthB}) ->
+ case KeysPrefixLengthA - KeysPrefixLengthB of


I have to stare at the above code but I'm pretty sure this logic boils down to "prefer the index with the fewest extra columns" which while legit is hard to follow based on the variable names.

davisp · 2017-04-27T22:14:59Z

src/mango/src/mango_cursor_view.erl

-% pass: Sort the IndexRanges by (num_columns, idx_name)
-% and return the first element. Yes. Its going to
-% be that dumb for now.
+% Low and behold our query planner


Why'd you remove ". Or something." Seems an odd change to make. Removing the snark is fine but I'd just remove the whole line. Also I think it should actually be "Lo and behold" but my Google is too far away to check for certain.

davisp · 2017-04-27T22:15:49Z

src/mango/src/mango_cursor_view.erl

-% and return the first element. Yes. Its going to
-% be that dumb for now.
+% Low and behold our query planner
+% First pass: sort the IndexRanges by KeysPrefixLength,


"First pass:" is kind of odd since we're not labeling anything else. I would rephrase this sentence to avoid the colon there.

Or alternatively, create a numbered list of the steps of the algorithm.

If two indexes can be used, then instead of choosing the index based on alphabet order of index names, rather choose based on a query prefix and number of columns.

Choose the index for the query based on Prefix and number of fields in the index

davisp

+1 to squerge

Document Mango configuration options

davisp requested changes Apr 4, 2017

View reviewed changes

tonysun83 reviewed Apr 7, 2017

View reviewed changes

davisp requested changes Apr 27, 2017

View reviewed changes

wohali added the enhancement label Apr 30, 2017

garrensmith added 3 commits May 2, 2017 11:07

Choose index based on Prefix and fields

bd71cb3

If two indexes can be used, then instead of choosing the index based on alphabet order of index names, rather choose based on a query prefix and number of columns.

Improve index selector

04de610

Choose the index for the query based on Prefix and number of fields in the index

fixes from reviews

c6ca869

davisp approved these changes May 9, 2017

View reviewed changes

garrensmith merged commit 9568bb7 into apache:master May 9, 2017

garrensmith deleted the improve-choose-best-index branch December 12, 2019 17:20

nickva pushed a commit to nickva/couchdb that referenced this pull request Sep 7, 2022

Merge pull request apache#469 from apache/mango_config

9dc5e04

Document Mango configuration options

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose index based on fields match #469

Choose index based on fields match #469

garrensmith commented Apr 4, 2017 •

edited

Loading

garrensmith commented Apr 4, 2017

davisp left a comment

davisp Apr 4, 2017

davisp Apr 4, 2017

davisp Apr 4, 2017

davisp Apr 4, 2017

davisp Apr 4, 2017

davisp Apr 4, 2017

davisp Apr 4, 2017

garrensmith Apr 5, 2017

tonysun83 commented Apr 5, 2017 •

edited

Loading

garrensmith commented Apr 5, 2017 •

edited

Loading

tonysun83 commented Apr 6, 2017 •

edited

Loading

garrensmith commented Apr 6, 2017

tonysun83 Apr 7, 2017 •

edited

Loading

tonysun83 Apr 7, 2017 •

edited

Loading

tonysun83 commented Apr 25, 2017 •

edited

Loading

davisp left a comment

davisp Apr 27, 2017

garrensmith May 2, 2017

davisp May 2, 2017

davisp Apr 27, 2017

davisp Apr 27, 2017

davisp Apr 27, 2017

davisp Apr 27, 2017

davisp Apr 27, 2017

davisp Apr 27, 2017

davisp Apr 27, 2017

davisp Apr 27, 2017

davisp Apr 27, 2017

davisp Apr 27, 2017

davisp Apr 27, 2017

davisp left a comment

		@@ -124,25 +128,31 @@ composite_prefix([Col \| Rest], Ranges) ->
		[]
		end.

Choose index based on fields match #469

Choose index based on fields match #469

Conversation

garrensmith commented Apr 4, 2017 • edited Loading

Overview

Testing recommendations

JIRA issue number

Checklist

garrensmith commented Apr 4, 2017

davisp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonysun83 commented Apr 5, 2017 • edited Loading

garrensmith commented Apr 5, 2017 • edited Loading

tonysun83 commented Apr 6, 2017 • edited Loading

garrensmith commented Apr 6, 2017

tonysun83 Apr 7, 2017 • edited Loading

Choose a reason for hiding this comment

tonysun83 Apr 7, 2017 • edited Loading

Choose a reason for hiding this comment

tonysun83 commented Apr 25, 2017 • edited Loading

davisp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davisp left a comment

Choose a reason for hiding this comment

garrensmith commented Apr 4, 2017 •

edited

Loading

tonysun83 commented Apr 5, 2017 •

edited

Loading

garrensmith commented Apr 5, 2017 •

edited

Loading

tonysun83 commented Apr 6, 2017 •

edited

Loading

tonysun83 Apr 7, 2017 •

edited

Loading

tonysun83 Apr 7, 2017 •

edited

Loading

tonysun83 commented Apr 25, 2017 •

edited

Loading