-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use efficient set storage for field names #471
Conversation
Does this just push the failure from CouchDB to Clouseau? I think that this code originally assumed that the number of fields per doc is small enough to be considered constant, which is why O(n^2) with a fast operation like |
@sagelywizard : yes, it does just push the failure over to Clouseau. We've already seen in production when users index everything, and the number of unique fields blows up the heap memory. In this particular case, there were so many fields that it never even made it to clouseau and got stuck in this process. We have an option for users to disable the indexing of everything in the document, but if they still decide index a heavily nested field with objects and arrays, then this issue could still occur. A while back, we discussed the possibility of adding another limit (# of levels deep we go in the object, or some total number of fields). Unfortunately, that would mean we'd also lose some ability for querying because those fields won't be indexed. That's for a bigger PR. This is more of a cosmetic change, so we don't necessarily have to merge it. |
src/mango/src/mango_native_proc.erl
Outdated
get_field_names(Rest, [[<<"$fieldnames">>, Name, []] | FAcc]) | ||
end. | ||
get_field_names(Fields) -> | ||
GBFieldSet = gb_sets:from_list(Fields), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fields
is a list of size 3 tuples of following structure {FieldName, Type, Value}
. We would want to ignore Value. Something like this would do it.
Set = lists:foldl(
fun({Name, _, _}, Set) ->
gb_sets:add([<<"$fieldnames">>, Name, []], Set)
end, gb_sets:new(), Fields),
gb_sets:to_list(Set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iilyak : we're ignoring it below when use the list comprehension to build the return value. Do you think it's more efficient to build the gb_set without the value rather than just converting the whole thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to explain the problem by example.
14> Fields = [{foo, type, v1}, {foo, type, v2}].
[{foo,type,v1},{foo,type,v2}]
15> GBFieldSet = gb_sets:from_list(Fields).
{2,{{foo,type,v2},{{foo,type,v1},nil,nil},nil}}
16> UFields = gb_sets:to_list(GBFieldSet).
[{foo,type,v1},{foo,type,v2}]
17> [[<<"$fieldnames">>, Name, []] || {Name, _Type, _Value} <- UFields].
[[<<"$fieldnames">>,foo,[]],[<<"$fieldnames">>,foo,[]]]
The same field is present multiple times ^^^.
18> Fields = [{foo, type, v1}, {foo, type, v2}].
[{foo,type,v1},{foo,type,v2}]
19> Set = lists:foldl(
19> fun({Name, _, _}, Set) ->
19> gb_sets:add([<<"$fieldnames">>, Name, []], Set)
19> end, gb_sets:new(), Fields),
19> gb_sets:to_list(Set).
[[<<"$fieldnames">>,foo,[]]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iilyak : I've updated with your suggestion
I'm fine with improving mango performance. we should look at restricting field count in clouseau side though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 after squash
When indexing a set of fields for text search, we also create a special field called $fieldnames. It contains values for all the fields that need to be indexed. In order to do that, we need a unique list of the form [[<<"$fieldnames">>, Name, [] | Rest]. The old code would add an element to the list, and then check for membership via lists:member/2. This is inefficient. Some documents can contain a large number of fields, so we will use gb_sets to create a unique set of fields, and then extract out the field names. COUCHDB-3358
13f6bc3
to
4e5e84e
Compare
Current up through 2020.01.15, any changes to couchdb since then still need documenting.
Overview
When indexing a set of fields for text search, we also create a special
field called $fieldnames. It contains values for all the fields that
need to be indexed. In order to do that, we need a unique list of the
form [[<<"$fieldnames">>, Name, []] | Rest]. The old code would add an
element to the list, and then check for membership via lists:member/2.
This is inefficient. Some documents can contain a large number of
fields, so we will use gb_sets to create a unique set of fields, and
then extract out the field names.
An alternative is to just use lists:usort/1. For small sets, it would be more efficient,
but if a document is large enough thousands of fields, gb_sets seems to perform
more. The tradeoff is that while gb_set could take longer overall for indexing,
it could reduce the probability of timing out when datasets are extremely large.
Testing recommendations
There are no usability changes for the user as this is a performance enhancement. A full regression should be sufficient. It requires https://github.com/cloudant-labs/dreyfus to be installed.
JIRA issue number
COUCHDB-3358
Related Pull Requests
Checklist
with the correct commit hash once this PR get merged.