Inconsistent execution statistics for Mango queries #4560

pgj · 2023-05-01T23:22:11Z

When requested, the Mango execution statistics do not always reflect their actual values but something that is less than or simply zero, most of the times. This happens on the latest version of CouchDB (as the time of writing, which is 82aa1625) and the issue can be reproduced as follows. The commands below create a simple database, called test with 25 documents of a single field, for which an index is defined.

curl -sS -X PUT "$COUCHDB_URL"/test
for i in $(jot - 1 25); do \
  curl -sS -X POST "$COUCHDB_URL"/test -H "Content-Type: application/json" -d '{"a": '"$i"'}'; done
curl -sS -X POST -H "Content-Type: application/json" "$COUCHDB_URL"/test/_index \
  -d '{"index": {"fields": ["a"]}, "name": "a", "type": "json"}'

Then, using a selector like {"a": {"$lt": 20}} and limit of 1, and with execution statistics enabled, the total_docs_examined field becomes zero. The other _examined fields are either not implemented or not affected by queries like that. It is visible that results_returned is properly accounted at the same time.

$ curl -sS -X POST -H "Content-Type: application/json" "$COUCHDB_URL"/test/_find \
  -d '{"execution_stats": true, "limit": 1, "selector": {"a": {"$lt": 20}}}' \
  | jq '.execution_stats'
{
  "total_keys_examined": 0,
  "total_docs_examined": 0,
  "total_quorum_docs_examined": 0,
  "results_returned": 1,
  "execution_time_ms": 1.067
}

Note that if either limit is increased or the search criteria on the a field is changed to find less documents, more realistic data is returned.

$ curl -sS -X POST -H "Content-Type: application/json" "$COUCHDB_URL"/test/_find \
  -d '{"execution_stats": true, "limit": 20, "selector": {"a": {"$lt": 20}}}' \
  | jq '.execution_stats'
{
  "total_keys_examined": 0,
  "total_docs_examined": 20,
  "total_quorum_docs_examined": 0,
  "results_returned": 19,
  "execution_time_ms": 2.378
}

or

$ curl -sS -X POST -H "Content-Type: application/json" "$COUCHDB_URL"/test/_find \
  -d '{"execution_stats": true, "limit": 1, "selector": {"a": {"$lt": 2}}}' \
  | jq '.execution_stats'
{
  "total_keys_examined": 0,
  "total_docs_examined": 2,
  "total_quorum_docs_examined": 0,
  "results_returned": 1,
  "execution_time_ms": 0.845
}

After some debugging, the source of the issue has been identified as the emission of stop in mango_cursor_view:handle_doc/2 when the limit reaches zero (last clause).

couchdb/src/mango/src/mango_cursor_view.erl

Lines 483 to 497 in 82aa162

 -spec handle_doc(#cursor{}, doc()) -> Response when 

 Response :: {ok, #cursor{}} | {stop, #cursor{}}. 

 handle_doc(#cursor{skip = S} = C, _) when S > 0 -> 

 {ok, C#cursor{skip = S - 1}}; 

 handle_doc(#cursor{limit = L, execution_stats = Stats} = C, Doc) when L > 0 -> 

 UserFun = C#cursor.user_fun, 

 UserAcc = C#cursor.user_acc, 

 {Go, NewAcc} = UserFun({row, Doc}, UserAcc), 

 {Go, C#cursor{ 

 user_acc = NewAcc, 

 limit = L - 1, 

 execution_stats = mango_execution_stats:incr_results_returned(Stats) 

 }}; 

 handle_doc(C, _Doc) -> 

 {stop, C}.

The stop action immediately stops the processing of messages from the shards, including the shard-level statistics that are submitted in response to the complete message.

couchdb/src/mango/src/mango_cursor_view.erl

Lines 385 to 390 in 82aa162

 view_cb(complete, Acc) -> 

 % Send shard-level execution stats 

 ok = rexi:stream2({execution_stats, {docs_examined, get(mango_docs_examined)}}), 

 % Finish view output 

 ok = rexi:stream_last(complete), 

 {ok, Acc};

That is why when the limit is too low, there is not enough time to receive and process the related message. Therefore the execution statistics are not, or only partially (from a set of shards) handled.

I could not yet come up with a satisfying solution but I create this ticket to raise awareness about this bug. Changing stop to ok in mango_cursor_view:handle_doc/2 helps with the consistency but then too many results are sent by the shards which will be known to be discarded already. This potentially has an impact on the performance, although it has not been measured how much.

The text was updated successfully, but these errors were encountered:

In case of map-reduce views, the arrival of the `complete` message is not guaranteed for the view callback (at the shard) when a `stop` is issued during the aggregation (at the coordinator). Due to that, interally collected shard-level statistics may not be fed back to the coordinator which can cause data loss hence inaccuracy in the overall execution statistics. Address this issue by switching to a "rolling" model where row-level statistics are immediately streamed back to the coordinator. Support mixed-version cluster upgrades by activating this model only if requested through the map-reduce arguments and the given shard supports that. Fixes apache#4560

In case of map-reduce views, the arrival of the `complete` message is not guaranteed for the view callback (at the shard) when a `stop` is issued during the aggregation (at the coordinator). Due to that, internally collected shard-level statistics may not be fed back to the coordinator which can cause data loss hence inaccuracy in the overall execution statistics. Address this issue by switching to a "rolling" model where row-level statistics are immediately streamed back to the coordinator. Support mixed-version cluster upgrades by activating this model only if requested through the map-reduce arguments and the given shard supports that. Fixes apache#4560

In case of map-reduce views, the arrival of the `complete` message is not guaranteed for the view callback (at the shard) when a `stop` is issued during the aggregation (at the coordinator). Due to that, internally collected shard-level statistics may not be fed back to the coordinator which can cause data loss hence inaccuracy in the overall execution statistics. Address this issue by switching to a "rolling" model where row-level statistics are immediately streamed back to the coordinator. Support mixed-version cluster upgrades by activating this model only if requested through the map-reduce arguments and the given shard supports that. Fixes #4560

pgj added bug needs-triage labels May 1, 2023

pgj mentioned this issue Aug 22, 2023

feat(mango): rolling execution statistics (exploration) #4735

Closed

2 tasks

pgj self-assigned this Aug 24, 2023

pgj added api mango and removed needs-triage labels Aug 24, 2023

pgj mentioned this issue Jan 10, 2024

feat(mango): rolling execution statistics #4958

Merged

2 tasks

pgj closed this as completed in #4958 Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent execution statistics for Mango queries #4560

Inconsistent execution statistics for Mango queries #4560

pgj commented May 1, 2023

Inconsistent execution statistics for Mango queries #4560

Inconsistent execution statistics for Mango queries #4560

Comments

pgj commented May 1, 2023