Use efficient set storage for field names #471

tonysun83 · 2017-04-05T17:48:51Z

Overview

When indexing a set of fields for text search, we also create a special
field called $fieldnames. It contains values for all the fields that
need to be indexed. In order to do that, we need a unique list of the
form [[<<"$fieldnames">>, Name, []] | Rest]. The old code would add an
element to the list, and then check for membership via lists:member/2.
This is inefficient. Some documents can contain a large number of
fields, so we will use gb_sets to create a unique set of fields, and
then extract out the field names.

An alternative is to just use lists:usort/1. For small sets, it would be more efficient,
but if a document is large enough thousands of fields, gb_sets seems to perform
more. The tradeoff is that while gb_set could take longer overall for indexing,
it could reduce the probability of timing out when datasets are extremely large.

Testing recommendations

There are no usability changes for the user as this is a performance enhancement. A full regression should be sufficient. It requires https://github.com/cloudant-labs/dreyfus to be installed.

JIRA issue number

COUCHDB-3358

Related Pull Requests

Checklist

Code is written and works correctly;
Changes are covered by tests;
Documentation reflects the changes;
I will not forget to update rebar.config.script
with the correct commit hash once this PR get merged.

sagelywizard · 2017-04-05T19:01:08Z

Does this just push the failure from CouchDB to Clouseau? I think that this code originally assumed that the number of fields per doc is small enough to be considered constant, which is why O(n^2) with a fast operation like lists:member didn't matter. I think Clouseau makes the same assumption, so this change alone won't fix anything and will simply consume more CPU in the simple case. I might be wrong. @rnewson Have thoughts?

tonysun83 · 2017-04-06T05:47:13Z

@sagelywizard : yes, it does just push the failure over to Clouseau. We've already seen in production when users index everything, and the number of unique fields blows up the heap memory. In this particular case, there were so many fields that it never even made it to clouseau and got stuck in this process.

We have an option for users to disable the indexing of everything in the document, but if they still decide index a heavily nested field with objects and arrays, then this issue could still occur. A while back, we discussed the possibility of adding another limit (# of levels deep we go in the object, or some total number of fields). Unfortunately, that would mean we'd also lose some ability for querying because those fields won't be indexed. That's for a bigger PR.

This is more of a cosmetic change, so we don't necessarily have to merge it.

iilyak · 2017-04-06T18:08:27Z

src/mango/src/mango_native_proc.erl

- get_field_names(Rest, [[<<"$fieldnames">>, Name, []] | FAcc])
- end.
+get_field_names(Fields) ->
+ GBFieldSet = gb_sets:from_list(Fields),


Fields is a list of size 3 tuples of following structure {FieldName, Type, Value}. We would want to ignore Value. Something like this would do it.

Set = lists:foldl( fun({Name, _, _}, Set) -> gb_sets:add([<<"$fieldnames">>, Name, []], Set) end, gb_sets:new(), Fields), gb_sets:to_list(Set).

@iilyak : we're ignoring it below when use the list comprehension to build the return value. Do you think it's more efficient to build the gb_set without the value rather than just converting the whole thing?

I'll try to explain the problem by example.

14> Fields = [{foo, type, v1}, {foo, type, v2}]. [{foo,type,v1},{foo,type,v2}] 15> GBFieldSet = gb_sets:from_list(Fields). {2,{{foo,type,v2},{{foo,type,v1},nil,nil},nil}} 16> UFields = gb_sets:to_list(GBFieldSet). [{foo,type,v1},{foo,type,v2}] 17> [[<<"$fieldnames">>, Name, []] || {Name, _Type, _Value} <- UFields]. [[<<"$fieldnames">>,foo,[]],[<<"$fieldnames">>,foo,[]]]

The same field is present multiple times ^^^.

18> Fields = [{foo, type, v1}, {foo, type, v2}]. [{foo,type,v1},{foo,type,v2}] 19> Set = lists:foldl( 19> fun({Name, _, _}, Set) -> 19> gb_sets:add([<<"$fieldnames">>, Name, []], Set) 19> end, gb_sets:new(), Fields), 19> gb_sets:to_list(Set). [[<<"$fieldnames">>,foo,[]]]

@iilyak : I've updated with your suggestion

rnewson · 2017-04-07T21:52:22Z

I'm fine with improving mango performance. we should look at restricting field count in clouseau side though.

iilyak

+1 after squash

When indexing a set of fields for text search, we also create a special field called $fieldnames. It contains values for all the fields that need to be indexed. In order to do that, we need a unique list of the form [[<<"$fieldnames">>, Name, [] | Rest]. The old code would add an element to the list, and then check for membership via lists:member/2. This is inefficient. Some documents can contain a large number of fields, so we will use gb_sets to create a unique set of fields, and then extract out the field names. COUCHDB-3358

Current up through 2020.01.15, any changes to couchdb since then still need documenting.

iilyak reviewed Apr 6, 2017

View reviewed changes

wohali added the enhancement label Apr 30, 2017

iilyak approved these changes May 8, 2017

View reviewed changes

tonysun83 force-pushed the 3358-use-efficient-set branch from 13f6bc3 to 4e5e84e Compare June 21, 2017 15:26

tonysun83 merged commit 4e5e84e into master Jun 21, 2017

tonysun83 deleted the 3358-use-efficient-set branch July 17, 2017 15:43

nickva pushed a commit to nickva/couchdb that referenced this pull request Sep 7, 2022

3.0.0 relnotes (apache#471)

9abf3f2

Current up through 2020.01.15, any changes to couchdb since then still need documenting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use efficient set storage for field names #471

Use efficient set storage for field names #471

tonysun83 commented Apr 5, 2017 •

edited

Loading

sagelywizard commented Apr 5, 2017

tonysun83 commented Apr 6, 2017

iilyak Apr 6, 2017

tonysun83 Apr 10, 2017

iilyak Apr 14, 2017

tonysun83 May 4, 2017

rnewson commented Apr 7, 2017

iilyak left a comment

Use efficient set storage for field names #471

Use efficient set storage for field names #471

Conversation

tonysun83 commented Apr 5, 2017 • edited Loading

Overview

Testing recommendations

JIRA issue number

Related Pull Requests

Checklist

sagelywizard commented Apr 5, 2017

tonysun83 commented Apr 6, 2017

iilyak Apr 6, 2017

Choose a reason for hiding this comment

tonysun83 Apr 10, 2017

Choose a reason for hiding this comment

iilyak Apr 14, 2017

Choose a reason for hiding this comment

tonysun83 May 4, 2017

Choose a reason for hiding this comment

rnewson commented Apr 7, 2017

iilyak left a comment

Choose a reason for hiding this comment

tonysun83 commented Apr 5, 2017 •

edited

Loading