Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(fabric): switch to maps for view rows #4984

Merged
merged 1 commit into from
Mar 15, 2024

Conversation

pgj
Copy link
Contributor

@pgj pgj commented Feb 12, 2024

The #view_row{} record that is used for capturing messages with row data is not flexible enough to have it extended easily. If one wanted to introduce a fresh field, the change would have to be propagated through many functions and modules. Especially, if support for mixed-version clusters is a concern, this would come with some degree of duplication.

Leverage Erlang/OTP's built-in maps for mitigating this issue and offer the view callbacks the view_row_map Boolean key in #mrargs.extra to request this communication format. This way the old record-based format would be still in use unless requested otherwise. This facilitates the smooth interoperability of old coordinators and new workers. In parallel to that, the new coordinator could still receive view rows from old workers.

Testing recommendations

The unit and integration tests should cover all the changes. There are no user-facing changes, it is internal to the communication of the coordinator and the workers over fabric. Backwards compatibility is maintained.

make eunit apps=fabric,couch_replicator,mango
make elixir
make elixir-search
make mango-test

Unit test coverage improvements:

  • couch_replicator: 85% ⟶ 86%
  • fabric: 72% ⟶ 79%

Related Issues or Pull Requests

This is to set the stage for #4958.

Checklist

  • Code is written and works correctly
  • Changes are covered by tests

Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looked a lot smaller than expected! Well done. Just few minor suggestions and cleanup as review comments.

I noticed you opted to not use the {view_row, #{}} shape, like in the original PR and just went straight for a map as a row. That's probably simpler. I guess I could see it work either way. If we opt to send execution stats with a complete call that could look like {complete, #{stats => Stats}} or something of that nature. Let's keep it as a simple map for now.

src/fabric/src/fabric_rpc.erl Outdated Show resolved Hide resolved
src/couch_replicator/src/couch_replicator_fabric_rpc.erl Outdated Show resolved Hide resolved
src/fabric/src/fabric_view.erl Outdated Show resolved Hide resolved
src/fabric/src/fabric_view.erl Outdated Show resolved Hide resolved
{row, [{id, Id}, {key, Key}, {value, Value}, {doc, Doc}]}.
{row, [{id, Id}, {key, Key}, {value, Value}, {doc, Doc}]};
transform_row(#{} = Row) ->
transform_row(fabric_util:to_view_row_record(Row)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the PR smaller, but eventually when we remove the view_row it will be a bit more work, which is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. The rolling execution stats will already need to expand this body further to inject the stats field.

src/fabric/src/fabric_view_map.erl Outdated Show resolved Hide resolved
src/fabric/src/fabric_view_map.erl Outdated Show resolved Hide resolved
src/fabric/src/fabric_view_map.erl Show resolved Hide resolved
src/fabric/src/fabric_view_reduce.erl Outdated Show resolved Hide resolved
src/fabric/src/fabric_util.erl Outdated Show resolved Hide resolved
@rnewson
Copy link
Member

rnewson commented Feb 16, 2024

That looked a lot smaller than expected! Well done. Just few minor suggestions and cleanup as review comments.

I noticed you opted to not use the {view_row, #{}} shape, like in the original PR and just went straight for a map as a row. That's probably simpler. I guess I could see it work either way. If we opt to send execution stats with a complete call that could look like {complete, #{stats => Stats}} or something of that nature. Let's keep it as a simple map for now.

I'd prefer the {view_row, #{}} approach myself, it will help us clearly lay out clauses in receive code in future, unless we will also add some guaranteed key/value in the bare map?

@nickva
Copy link
Contributor

nickva commented Feb 16, 2024

I'd prefer the {view_row, #{}} approach myself, it will help us clearly lay out clauses in receive code in future, unless we will also add some guaranteed key/value in the bare map?

Thinking about it some more, I agree as well. It will make the receive clauses a bit clearer.

Another suggestion to investigate if it would make sense to make a separate fabric_row accessor module. So that row get/set, record/proplist/maps transform could be handled there, maybe even some merge/compare functions?

@pgj
Copy link
Contributor Author

pgj commented Feb 19, 2024

I have no strong feelings against using the view_row tag for the maps. I removed it because I felt the code simpler but I would not say it is future-proof that way. Also, at the same time, I could explore the fabric_row accessor module, it seems like a good vehicle of encapsulation.

@pgj
Copy link
Contributor Author

pgj commented Feb 20, 2024

For the record, the current version works fine in a mixed-version environment, in old coordinator—new worker, new coordinator—old worker configurations. This was tested with a database of 100K documents, 3 nodes, with 1 coordinator and 2 workers of the respective versions, for 7 minutes on 2 threads. I used _find queries and I am not sure how to exercise the replicator for example.

@pgj pgj force-pushed the feat/fabric/view_row_map branch 2 times, most recently from 82cbd9e to 6d33573 Compare March 1, 2024 10:58
@pgj pgj marked this pull request as ready for review March 1, 2024 11:41
Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice! Lots of tests, and the row wrapper module turned out decently, I thought.

I made few notes inline, few style nits and a few notes about the merge function.

Since we're eventually trying to report the work done by a worker up until the row it emitted, would it make sense to have the completed messages return the extra metadata object/stats object as well, if the view_row_map if flag is set? Then we'd have both a {view_row, #{}} and {completed, #{}} and both can report all the work or stats done on that worker node in one response.

src/fabric/include/fabric.hrl Show resolved Hide resolved
detach_partition(Row) ->
case fabric_view_row:get_key(Row) of
{p, _Partition, Key} -> fabric_view_row:set_key(Row, Key);
_PrimitiveKey -> Row
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primitive is a bit of an odd name, to avoid introducing new terminology maybe just use _?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Or we could just use _Key.

Comment on lines 111 to 123
merge(Dir, Row, Rows) ->
lists:merge(
fun(RowA, RowB) ->
IdA = get_id(RowA),
IdB = get_id(RowB),
case Dir of
fwd -> IdA =< IdB;
rev -> IdA > IdB
end
end,
[Row],
Rows
).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is specifically merging rows by ID. We also merge view rows by key and ID, with and without a keydict and with or without a unicode collator. I wonder if it would make sense to move all merging to the fabric_view_row then, or move this back to all_docs handler. Mainly so it looks consistent.

Copy link
Contributor Author

@pgj pgj Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is a good idea. In the light of this, I think it makes sense to move this version of merge rows over to fabric_view_row as an alternative to merge, as merge/5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aw, scratch that. Although couch_replicator_fabric and fabric_view uses exactly the same merge function, it looks odd to include this in fabric_view_row as it does not anything special related to view rows. The merge functions already use fabric_view_row to access the required properties so they could be considered abstract.

Comment on lines 117 to 118
fwd -> IdA =< IdB;
rev -> IdA > IdB
Copy link
Contributor

@nickva nickva Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder if it should be A<B and A>B?

Erlang lists module expect the comparison operator to basically be =< or >=. However our btrees use less (strictly <). I don't know if that's an issue here since we don't expect two equal document IDs to come through once we start streaming for all_docs case. But it might be good to stick to < and > just to emphasize that our sorting function on the btrees is a strictly less than function and we want to make sure we use the same sorting function in the workers as we do in the coordinator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no either deeper understanding of this part or strong preferences, so I am fine with the suggestion.

@pgj pgj force-pushed the feat/fabric/view_row_map branch from 6d33573 to 6b97660 Compare March 8, 2024 23:09
@pgj
Copy link
Contributor Author

pgj commented Mar 8, 2024

Since we're eventually trying to report the work done by a worker up until the row it emitted, would it make sense to have the completed messages return the extra metadata object/stats object as well, if the view_row_map if flag is set?

I am not sure about that.

Originally, the problem what this PR wants to solve is about the case when streaming of view rows is "abruptly" terminated in response to the stop in the callback (as a short circuit). The proposed {complete, #{}} reply would be sent when the callback is receives complete. But this means that everything that the worker wanted to stream has reached the destination by that time, assuming that no messages were lost in transit.

On the other hand, the change focuses on making the structure of the view rows more flexible, and it does not know anything about how this is utilized. It is not necessarily the statistics that could be sent along the streamed rows but anything that the view callback may consider meaningful.

If statistics is sent along with the standard view row data, I think there is no need to return a closing message with metadata. Statistical information could be sent as either deltas or absolute values, this is subject to the implementation of the given view. They are both sufficient to keep track of statistics about the view rows. I have tested this in the Mango rolling execution statistics PR (one of the future consumers of this change) and it worked without the need for extending the complete messages.

That said, I do not see any explicit need for this. But it could be added later on, if any related use case surfaces.

@nickva
Copy link
Contributor

nickva commented Mar 9, 2024

Originally, the problem what this PR wants to solve is about the case when streaming of view rows is "abruptly" terminated in response to the stop in the callback (as a short circuit). The proposed {complete, #{}} reply would be sent when the callback is receives complete. But this means that everything that the worker wanted to stream has reached the destination by that time, assuming that no messages were lost in transit.

The stop callback would normally be called when limit number of rows had been emitted by the coordinator to the client. With the new code we'd get the extra stats map to account for the work the worker did prior to emitting that row. However, the streaming may also end if we reach the end of the btree. Then, instead of a row we'd emit a complete message. The idea is to account for the work done since we emit the last row, until the complete message. Thinking of a selective filter that pics maybe every 200th row, with a total of 500 rows: if we emit row 400, we'd keep scanning until we reach 500 and emit a complete message, but then we'd wouldn't account for work done while scanning rows 400->500 if we never emitted any other view row during that time.

Now we emit a separate execution stats message and a callback message

% Send shard-level execution stats
ok = rexi:stream2({execution_stats, Stats}),
% Finish view output
ok = rexi:stream_last(complete),
, which is fine but since we're making these messages flexible and able to carry any metadata, if it would make sense let the complete message also carry the extra metadata. That way we wouldn't have to send separate execution_stats message at all, complete and view rows extra field would have a place to send back stats as well.

However, updating all the message handlers which complete to check the option and also handle {complete, #{}} might be kind of cumbersome, since there are quite a few of them. However we already have a place where we need it and it already returns an extra Props object:

handle_message({complete, Props}, Worker, #collector{limit = 0} = State) ->
O0 = State#collector.offset,
O1 =
case fabric_dict:lookup_element(Worker, O0) of

@pgj
Copy link
Contributor Author

pgj commented Mar 9, 2024

the streaming may also end if we reach the end of the btree. Then, instead of a row we'd emit a complete message.

I would not call those cases "abruptly terminated" because there is a chance to gracefully send out all the required messages. The easy extension of view rows is a future-proof alternative to creating a another view_row record structure with the missing field (which is the execution stats in the Mango PR) each time. This level of atomicity is required to protect the worker from failing to communicate its metadata before the stream is closed. When no view row was streamed, the metadata reaches the coordinator, tight coupling is not a concern.

Yes, this leaves us with separate execution_stats messages, but this is just refactor it does not add much to either the correctness or the performance. Ideally, I would keep this separation, because it is more compositional but deemed to be less performant. Stitching a map to the complete message does not do much more in my understanding.

However, updating all the message handlers which complete to check the option and also handle {complete, #{}} might be kind of cumbersome, since there are quite a few of them.

That is one of reasons why I am rather hesitant to take on this advice.

However we already have a place where we need it and it already returns an extra Props object:

handle_message({complete, Props}, Worker, #collector{limit = 0} = State) ->
O0 = State#collector.offset,
O1 =
case fabric_dict:lookup_element(Worker, O0) of

I would say this is out of scope of this PR. I assume this part of the code does not have any consistency or performance issues, introducing extended complete messages would only simplify the code. Which is nice, but it would leave it for another PR.

@nickva
Copy link
Contributor

nickva commented Mar 9, 2024

I would not call those cases "abruptly terminated" because there is a chance to gracefully send out all the required messages.

Yeah definitely. In fact, I am not sure if "abruptly terminated" exactly applies here in general. That part works like that by design. In the majority of cases, the complete message won't be sent unless scanning is towards the end of the shard for all_docs or it's a short index key range. So the "abrupt" case might be more common in non-test sized databases.

Yes, this leaves us with separate execution_stats messages, but this is just refactor it does not add much to either the correctness or the performance

Exactly. That would be more of a nice-to-have to make things more consistent. And since when we do that refactor later we'd need to introduce another extras option, I thought we could piggy back on this PR, but that does look a large change and so let's not bother with it currently.

Also in theory I think it could affect correctness. We are handling and processing execution stats messages out of band without going through the merge row semantics. Notice, we call the Callback right away

{Go, Acc} = Callback(Msg, AccIn),
unlike a view row, which goes through the merge_row + maybe_send_row mechanism
Rows = merge_row(Dir, Row#view_row{worker = {Worker, From}}, Rows0),
Counters1 = fabric_dict:update_counter(Worker, 1, Counters0),
State1 = State#collector{rows = Rows, counters = Counters1},
fabric_view:maybe_send_row(State1);
so, we may be be accounting the stats for row or complete messages which never participated in the generation of the response row. [1] The chances there are small so not worth bothering about probably.

That is one of reasons why I am rather hesitant to take on this advice.

Yup, agree, let's not worry about it in this PR.

[1] This is according to our new semantics as discussed in the previous PR, that we want to account exactly for work which happened on each worker, prior to its emitted row if that row was included as part of the response, as opposed to having a probabilistic kind of a thing where we account for the approximate work as I suggested initially.

@pgj
Copy link
Contributor Author

pgj commented Mar 11, 2024

In fact, I am not sure if "abruptly terminated" exactly applies here in general. That part works like that by design. In the majority of cases, the complete message won't be sent unless scanning is towards the end of the shard for all_docs or it's a short index key range. So the "abrupt" case might be more common in non-test sized databases.

Probably "abrupt" is a strong word here. That is just my false expectation on that complete should always be explicit 😛 Either way it goes, it complicates statistics calculation as it is illustrated by the motivation of the rolling execution statistics.

That would be more of a nice-to-have to make things more consistent. And since when we do that refactor later we'd need to introduce another extras option, I thought we could piggy back on this PR, but that does look a large change and so let's not bother with it currently.

Yeah, I have noticed the piggy on the back. I am a fan of consistent code changes, but this is slightly more than that me. As I mentioned, I consider this "view row map" operation mode oblivious to sending any metadata — we could even say that this is only a coincidence that the next field we may want to add is about that. It can be combined with "extensible complete" of which may have its own scope of application. I see these as two orthogonal building blocks for hosting other features. (Compositionality.) Tackling this problem in a separate PR can help with limiting the scope of this one, mitigating the associated (deployment) risks, and shipping the dependent feature quicker.

in theory I think it could affect correctness. We are handling and processing execution stats messages out of band without going through the merge row semantics. [..] we may be accounting the stats for row or complete messages which never participated in the generation of the response row.

Mind we are switching to the discussion of the Mango rolling execution statistics here, which is only a possible application. Note that the execution_stats message is only used by Mango, but for some reason, it had to be driven through (woven into, pun intended) Fabric. I do not think we are accounting for the complete messages themselves there. This type of message is used to trigger the worker to submit its (shard-level) statistics to the coordinator.

In the Mango execution statistics, currently we have "keys examined", "docs examined", and "results returned". Views are created from key ranges of (the selected) indexes. Walking the whole view, independently of whether the corresponding row matches the selector (filter) should always implicitly bump the count of keys examined. (This is the broadest.) On the contrary, docs examined are counted only if a document was accessed. (This might be narrower.) And then if the document matches the selector, it is counted as a result returned. (The narrowest set.)

The associated calculations are always performed in response to the view rows. The execution_stats message is responsible for coordinating the sum ("reduce") of the shard-level (worker-local) statistics only.

@nickva
Copy link
Contributor

nickva commented Mar 12, 2024

Mind we are switching to the discussion of the Mango rolling execution statistics here, which is only a possible application. Note that the execution_stats message is only used by Mango, but for some reason, it had to be driven through (woven into, pun intended) Fabric.

It is related to correctness that's why I mentioned the other pr. I was just pointing out that based on the discussion in the previous pr, and new semantics of how on the coordinator we want to account only for rows which participate in emitting the response to the user. With that view, having separate messages vs one message can lead to unexpected behavior. Because it may be possible for another worker to insert their message in between the two messages when streaming is about to conclude. In general, there could be a good amount of difference between the work done prior to emitting each row sent to the client, and all the work done by each worker.

but for some reason, it had to be driven through (woven into, pun intended) Fabric.

This is a side-effect of mango using fabric all_docs and view query as a layer. That extra handling of execution stats in all_docs and view query was another impetus for having a generic completion metadata. Though that would also require a bit more work to make sure it passes through the merge sort mechanism.

@pgj
Copy link
Contributor Author

pgj commented Mar 12, 2024

It is related to correctness that's why I mentioned the other pr.

I am sure you are right about that. But I think we should carry on with the related discussion in #4958 then. Regarding this PR, could you please make some other case that is not the Mango rolling execution statistics? The "view row map capability" only makes it possible to get rid of the rigidity of the record-based communication. We could use it to send non-metadata or anything else, I just do not have any other case on my mind. But in theory, that is entirely possible!

it may be possible for another worker to insert their message in between the two messages when streaming is about to conclude. In general, there could be a good amount of difference between the work done prior to emitting each row sent to the client, and all the work done by each worker.

Right. But is this a real problem here? The folding of Mango execution statistics hinges upon addition, which is a commutative and non-associative operation, which even lends itself to parallelization. The collection of the pieces happen locally without any interaction between the workers. Probably I am missing something here.

@nickva
Copy link
Contributor

nickva commented Mar 12, 2024

Right. But is this a real problem here? The folding of Mango execution statistics hinges upon addition, which is a commutative and non-associative operation, which even lends itself to parallelization. The collection of the pieces happen locally without any interaction between the workers. Probably I am missing something here.

It is if we want make any correctness guarantees. It doesn't have as much to do with commutativity and non-associativity of operations but which statistics should contribute to the result. We agreed to two things as far I recall:

  1. We liked Bob's idea of counting only contributing workers 2) based on your comment, not allowing uncertainty

@nickva I am afraid we cannot allow any uncertainty in the statistics if downstream consumers wanted to build uses cases on top of this mechanism that involves billing.

And pointing out that having separate interleaved stats messages based on random arrival order from worker seems to conflict with the correctness goal. Towards the conclusion of request we could end up accounting for stats messages out-of-band so to speak, workers which did work that didn't end up contributing to the final response. However, as we discussed, the chance of this is probably low and maybe not worth worrying about. I am merely clarifying that it does impact correctness.

@pgj
Copy link
Contributor Author

pgj commented Mar 12, 2024

But is this a real problem here?
It is if we want make any correctness guarantees. It doesn't have as much to do with commutativity and non-associativity of operations but which statistics should contribute to the result.

Ah, now I got it! Thanks for your patience in explaining this to me. I was sure that I am missing something and I was confused that the comments about #4958 got included in this discussion. Please let me think about those in the scope of #4958 and leave them out from this PR.

@pgj pgj force-pushed the feat/fabric/view_row_map branch from 6b97660 to 1046c4b Compare March 14, 2024 17:40
Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, nicely done @pgj!

I like all the tests and I think the accessor module turned out pretty neat.

@pgj pgj force-pushed the feat/fabric/view_row_map branch from 1046c4b to eae19bd Compare March 14, 2024 23:07
The `#view_row{}` record that is used for capturing messages with
row data is not flexible enough to have it extended easily.  If
one wanted to introduce a fresh field, the change would have to be
propagated through many functions and modules.  Especially, if
support for mixed-version clusters is a concern, this would come
with some degree of duplication.

Leverage Erlang/OTP's built-in maps for mitigating this issue and
offer the view callbacks the `view_row_map` Boolean key in
`#mrargs.extra` to request this communication format.  This way the
old record-based format would be still in use unless requested
otherwise.  This facilitates the smooth interoperability of old
coordinators and new workers.  In parallel to that, the new
coordinator could still receive view rows from old workers.
@pgj pgj force-pushed the feat/fabric/view_row_map branch from eae19bd to 0fd8c1b Compare March 14, 2024 23:39
@pgj pgj merged commit b964a84 into apache:main Mar 15, 2024
14 checks passed
@pgj pgj deleted the feat/fabric/view_row_map branch March 15, 2024 00:16
@nickva
Copy link
Contributor

nickva commented Mar 15, 2024

I noticed this CI failure in a full CI run. It seems to be in the test code in the replicator:

[2024-03-15T00:36:32.126Z]   couch_replicator_scheduler_docs_tests:70: scheduler_docs_test_prefixed_db_test_ (t_scheduler_docs_total_rows)...test/eunit/couch_replicator_scheduler_docs_tests.erl:137:<0.27855.0>: Error = <<"unknown_error">>
[2024-03-15T00:36:32.126Z] test/eunit/couch_replicator_scheduler_docs_tests.erl:138:<0.27855.0>: binary_to_list ( Reason ) = "function_clause"
[2024-03-15T00:36:32.126Z] *failed*
[2024-03-15T00:36:32.126Z] in function erlang:map_get/2
[2024-03-15T00:36:32.126Z]   called as map_get(<<"docs">>,"function_clause")
[2024-03-15T00:36:32.126Z] in call from couch_replicator_scheduler_docs_tests:t_scheduler_docs_total_rows/1 (test/eunit/couch_replicator_scheduler_docs_tests.erl, line 147)
[2024-03-15T00:36:32.126Z] in call from eunit_test:run_testfun/1 (eunit_test.erl, line 71)
[2024-03-15T00:36:32.126Z] in call from eunit_proc:run_test/1 (eunit_proc.erl, line 531)
[2024-03-15T00:36:32.126Z] in call from eunit_proc:with_timeout/3 (eunit_proc.erl, line 356)
[2024-03-15T00:36:32.126Z] in call from eunit_proc:handle_test/2 (eunit_proc.erl, line 514)
[2024-03-15T00:36:32.126Z] in call from eunit_proc:tests_inorder/3 (eunit_proc.erl, line 456)
[2024-03-15T00:36:32.126Z] in call from eunit_proc:with_timeout/3 (eunit_proc.erl, line 346)
[2024-03-15T00:36:32.126Z] **error:{badmap,"function_clause"}
[2024-03-15T00:36:32.126Z]   output:<<"">>

Comment on lines +137 to +138
?debugVal(Error, 100),
?debugVal(binary_to_list(Reason), 100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to leave this debugging here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Sorry, I did not put a comment there about it, but I mentioned it to @nickva so he was aware of the intention. The reason is that basically I had trouble with finding out the reason for failure. Initially I thought the timeout should be increased, but, with the help of these extension, it got clear that something else needs to be done. So I left the debugging calls there to facilitate similar investigations in the future. Note that this is only visible in case of failures.

@pgj
Copy link
Contributor Author

pgj commented Mar 15, 2024

I noticed this CI failure in a full CI run. It seems to be in the test code in the replicator

Oh, the good old CentOS 8. I will check on it later.

@nickva
Copy link
Contributor

nickva commented Mar 15, 2024

It might be just exposing a flaky test. I'll re-run the CI and we'll see if we get the same error on the same os/arch combination.

The difference from the previous case was that we used to wait longer if we didn't get the expected docs so it could get the error then take longer to possibly fix itself during the 10-15 second wait. Currently however, we exist on first error, but then the rest of the code expects a docs to be there.

@pgj
Copy link
Contributor Author

pgj commented Mar 15, 2024

I was trying to reproduce this locally (with the CouchDB CI CentOS 8 container) but I could not. Hence it seems to be a flaky case. But I think there is an error in the code that tries to provide debug information: it is not prepared for one (or more) of possible outcomes and it breaks when it happens.

@nickva
Copy link
Contributor

nickva commented Mar 15, 2024

Indeed it seems to be a flaky test. After re-running the CI I got the same failure on ubuntu 22.04 https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FFullPlatformMatrix/detail/main/1042/pipeline/233

I wonder if it's worth reverting the behavior to let it keep retrying until it's getting an actual doc, or did it always timeout and and crash when testing the PR even with the current time.

@pgj
Copy link
Contributor Author

pgj commented Mar 15, 2024

I am now trying to add more debugging around this unknown_error to see if it can reveal a scenario when the function actually crashes. Therefore we could ideally fix the root cause.

@nickva
Copy link
Contributor

nickva commented Mar 15, 2024

Based on its shape there is a chance it's coming from a remote worker node, so it's an error that serialized and sent back as a string.

@pgj
Copy link
Contributor Author

pgj commented Mar 15, 2024

Opened #5006 for investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants