Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use efficient set storage for field names #471

Merged
merged 1 commit into from
Jun 21, 2017
Merged

Conversation

tonysun83
Copy link
Contributor

@tonysun83 tonysun83 commented Apr 5, 2017

Overview

When indexing a set of fields for text search, we also create a special
field called $fieldnames. It contains values for all the fields that
need to be indexed. In order to do that, we need a unique list of the
form [[<<"$fieldnames">>, Name, []] | Rest]. The old code would add an
element to the list, and then check for membership via lists:member/2.
This is inefficient. Some documents can contain a large number of
fields, so we will use gb_sets to create a unique set of fields, and
then extract out the field names.

An alternative is to just use lists:usort/1. For small sets, it would be more efficient,
but if a document is large enough thousands of fields, gb_sets seems to perform
more. The tradeoff is that while gb_set could take longer overall for indexing,
it could reduce the probability of timing out when datasets are extremely large.

Testing recommendations

There are no usability changes for the user as this is a performance enhancement. A full regression should be sufficient. It requires https://github.com/cloudant-labs/dreyfus to be installed.

JIRA issue number

COUCHDB-3358

Related Pull Requests

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • Documentation reflects the changes;
  • I will not forget to update rebar.config.script
    with the correct commit hash once this PR get merged.

@sagelywizard
Copy link
Member

Does this just push the failure from CouchDB to Clouseau? I think that this code originally assumed that the number of fields per doc is small enough to be considered constant, which is why O(n^2) with a fast operation like lists:member didn't matter. I think Clouseau makes the same assumption, so this change alone won't fix anything and will simply consume more CPU in the simple case. I might be wrong. @rnewson Have thoughts?

@tonysun83
Copy link
Contributor Author

@sagelywizard : yes, it does just push the failure over to Clouseau. We've already seen in production when users index everything, and the number of unique fields blows up the heap memory. In this particular case, there were so many fields that it never even made it to clouseau and got stuck in this process.

We have an option for users to disable the indexing of everything in the document, but if they still decide index a heavily nested field with objects and arrays, then this issue could still occur. A while back, we discussed the possibility of adding another limit (# of levels deep we go in the object, or some total number of fields). Unfortunately, that would mean we'd also lose some ability for querying because those fields won't be indexed. That's for a bigger PR.

This is more of a cosmetic change, so we don't necessarily have to merge it.

get_field_names(Rest, [[<<"$fieldnames">>, Name, []] | FAcc])
end.
get_field_names(Fields) ->
GBFieldSet = gb_sets:from_list(Fields),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fields is a list of size 3 tuples of following structure {FieldName, Type, Value}. We would want to ignore Value. Something like this would do it.

Set = lists:foldl(
    fun({Name, _, _}, Set) -> 
         gb_sets:add([<<"$fieldnames">>, Name, []], Set) 
    end, gb_sets:new(), Fields),
gb_sets:to_list(Set).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iilyak : we're ignoring it below when use the list comprehension to build the return value. Do you think it's more efficient to build the gb_set without the value rather than just converting the whole thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to explain the problem by example.

14> Fields = [{foo, type, v1}, {foo, type, v2}].
[{foo,type,v1},{foo,type,v2}]
15> GBFieldSet = gb_sets:from_list(Fields).
{2,{{foo,type,v2},{{foo,type,v1},nil,nil},nil}}
16> UFields = gb_sets:to_list(GBFieldSet).
[{foo,type,v1},{foo,type,v2}]
17> [[<<"$fieldnames">>, Name, []] || {Name, _Type, _Value} <- UFields].
[[<<"$fieldnames">>,foo,[]],[<<"$fieldnames">>,foo,[]]]

The same field is present multiple times ^^^.

18> Fields = [{foo, type, v1}, {foo, type, v2}].
[{foo,type,v1},{foo,type,v2}]
19> Set = lists:foldl(
19>     fun({Name, _, _}, Set) ->
19>          gb_sets:add([<<"$fieldnames">>, Name, []], Set)
19>     end, gb_sets:new(), Fields),
19> gb_sets:to_list(Set).
[[<<"$fieldnames">>,foo,[]]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iilyak : I've updated with your suggestion

@rnewson
Copy link
Member

rnewson commented Apr 7, 2017

I'm fine with improving mango performance. we should look at restricting field count in clouseau side though.

Copy link
Contributor

@iilyak iilyak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after squash

When indexing a set of fields for text search, we also create a special
field called $fieldnames. It contains values for all the fields that
need to be indexed. In order to do that, we need a unique list of the
form [[<<"$fieldnames">>, Name, [] | Rest]. The old code would add an
element to the list, and then check for membership via lists:member/2.
This is inefficient. Some documents can contain a large number of
fields, so we will use gb_sets to create a unique set of fields, and
then extract out the field names.

COUCHDB-3358
@tonysun83 tonysun83 merged commit 4e5e84e into master Jun 21, 2017
@tonysun83 tonysun83 deleted the 3358-use-efficient-set branch July 17, 2017 15:43
nickva pushed a commit to nickva/couchdb that referenced this pull request Sep 7, 2022
Current up through 2020.01.15, any changes to couchdb since then still need documenting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants