Prevent delayed opener error from crashing index servers #4811

nickva · 2023-10-17T19:00:56Z

Previously, an index and opener process dying could have resulted in the index gen_server crashing. This was observed in the CI test as described in: #4809

The process in more detail was as follows:

When an async opener result is handled in the index server, there is a period of time when the index server is linked to both the index and the opener process.
After we reply early to the waiting clients, a client may do something to cause the indexer to crash, which would crash the opener along with it. That would generate two {'EXIT', Pid, _} messages queued in the indexer process' mailbox.
The index gen_server, is still processing the async opener result callback, where it would remove the opener from the openers table, then it returns ok to the async opener.
Index gen_server continues processing queued EXIT messages in handle_info:
- The one for the indexer Pid is handled appropriately
- The one for the opener leads to an exit(...) clause since we ended with an unknown process exiting.

To avoid the race condition, and the extra opener EXIT message, unlink and reply early to the opener, as soon we linked to the indexer or had received the error. To avoid the small chance of still getting an EXIT message from the opener, in case it crashed right before we unlinked, flush any exit messages from it. We do a similar flushing in two other places so create small utility function to avoid duplicating the code too much.

Fix: #4809

Previously, an index and opener process dying could have resulted in the index gen_server crashing. This was observed in the CI test as described in: #4809 The process in more detail was as follows: * When an async opener result is handled in the index server, there is a period of time when the index server is linked to both the index and the opener process. * After we reply early to the waiting clients, a client may do something to cause the indexer to crash, which would crash the opener along with it. That would generate two `{'EXIT', Pid, _}` messages queued in the indexer process' mailbox. * The index gen_server, is still processing the async opener result callback, where it would remove the openener from the `openers` table, then it returns `ok` to the async opener. * Index gen_server continues processing queued `EXIT` messages in `handle_info`: - The one for the indexer Pid is handled appropriately - The one for the opener leads to an eexit(...)` clause since we ended with an unknown process exiting. To avoid the race condition, and the extra opener `EXIT` message, unlink and reply early to the opener, as soon we linked to the indexer or had received the error. To avoid the small chance of still getting an `EXIT` message from the opener, in case it crashed right before we unlinked, flush any exit messages from it. We do a similar flushing in two other places so create small utility function to avoid duplicating the code too much. Fix: #4809

big-r81

+1

jaydoane · 2023-10-17T22:25:24Z

src/couch_index/src/couch_index_server.erl

+ [gen_server:reply(From, {ok, Pid}) || From <- Waiters],
+ % Flush opener exit messages in case it died before we unlinked it
+ ok = flush_exit_messages_from(OpenerPid),
+ {noreply, State};


I noticed that both here and in below you changed from {reply, ok, State} to {noreply, State}. Can you explain why?

Good question, @jaydoane. That's because we already replied early to the opener with gen_server:reply(FromOpener, ok). The "opener" call into the index_server then gets to end early and it doesn't have to wait for us to finish the rest of the call.

These two calls can now end a bit quicker and we should be able to garbage collect them sooner.

new_index({Mod, IdxState, DbName, Sig}) -> DDocId = Mod:get(idx_name, IdxState), case couch_index:start_link({Mod, IdxState}) of {ok, Pid} -> ok = gen_server:call( server_name(DbName), {async_open, {DbName, DDocId, Sig}, {ok, Pid}} ), unlink(Pid); Error -> ok = gen_server:call( server_name(DbName), {async_error, {DbName, DDocId, Sig}, Error} ) end.

The most importantly, however is we get the opener out of the way once we already linked to the main indexer process, or received the error.

Ah, nice, I missed those early replies.

big-r81 approved these changes Oct 17, 2023

View reviewed changes

jaydoane reviewed Oct 17, 2023

View reviewed changes

jaydoane approved these changes Oct 17, 2023

View reviewed changes

nickva merged commit 2310555 into main Oct 17, 2023
14 checks passed

nickva deleted the fix-index-server-premature-return branch October 17, 2023 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent delayed opener error from crashing index servers #4811

Prevent delayed opener error from crashing index servers #4811

nickva commented Oct 17, 2023 •

edited

Loading

big-r81 left a comment

jaydoane Oct 17, 2023

nickva Oct 17, 2023

jaydoane Oct 17, 2023

Prevent delayed opener error from crashing index servers #4811

Prevent delayed opener error from crashing index servers #4811

Conversation

nickva commented Oct 17, 2023 • edited Loading

big-r81 left a comment

Choose a reason for hiding this comment

jaydoane Oct 17, 2023

Choose a reason for hiding this comment

nickva Oct 17, 2023

Choose a reason for hiding this comment

jaydoane Oct 17, 2023

Choose a reason for hiding this comment

nickva commented Oct 17, 2023 •

edited

Loading