Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent delayed opener error from crashing index servers #4811

Merged
merged 1 commit into from
Oct 17, 2023

Conversation

nickva
Copy link
Contributor

@nickva nickva commented Oct 17, 2023

Previously, an index and opener process dying could have resulted in the index gen_server crashing. This was observed in the CI test as described in: #4809

The process in more detail was as follows:

  • When an async opener result is handled in the index server, there is a period of time when the index server is linked to both the index and the opener process.

  • After we reply early to the waiting clients, a client may do something to cause the indexer to crash, which would crash the opener along with it. That would generate two {'EXIT', Pid, _} messages queued in the indexer process' mailbox.

  • The index gen_server, is still processing the async opener result callback, where it would remove the opener from the openers table, then it returns ok to the async opener.

  • Index gen_server continues processing queued EXIT messages in handle_info:

    • The one for the indexer Pid is handled appropriately
    • The one for the opener leads to an exit(...) clause since we ended with an unknown process exiting.

To avoid the race condition, and the extra opener EXIT message, unlink and reply early to the opener, as soon we linked to the indexer or had received the error. To avoid the small chance of still getting an EXIT message from the opener, in case it crashed right before we unlinked, flush any exit messages from it. We do a similar flushing in two other places so create small utility function to avoid duplicating the code too much.

Fix: #4809

Previously, an index and opener process dying could have resulted in the index
gen_server crashing. This was observed in the CI test as described in:
#4809

The process in more detail was as follows:

  * When an async opener result is handled in the index server, there is a
    period of time when the index server is linked to both the index and the
    opener process.

  * After we reply early to the waiting clients, a client may do something to
    cause the indexer to crash, which would crash the opener along with it.
    That would generate two `{'EXIT', Pid, _}` messages queued in the indexer
    process' mailbox.

  * The index gen_server, is still processing the async opener result callback,
    where it would remove the openener from the `openers` table, then it
    returns `ok` to the async opener.

  * Index gen_server continues processing queued `EXIT` messages in `handle_info`:
     - The one for the indexer Pid is handled appropriately
     - The one for the opener leads to an eexit(...)` clause since we ended
       with an unknown process exiting.

To avoid the race condition, and the extra opener `EXIT` message, unlink and
reply early to the opener, as soon we linked to the indexer or had received the
error. To avoid the small chance of still getting an `EXIT` message from the
opener, in case it crashed right before we unlinked, flush any exit messages
from it. We do a similar flushing in two other places so create small utility
function to avoid duplicating the code too much.

Fix: #4809
Copy link
Contributor

@big-r81 big-r81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

[gen_server:reply(From, {ok, Pid}) || From <- Waiters],
% Flush opener exit messages in case it died before we unlinked it
ok = flush_exit_messages_from(OpenerPid),
{noreply, State};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that both here and in below you changed from {reply, ok, State} to {noreply, State}. Can you explain why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, @jaydoane. That's because we already replied early to the opener with gen_server:reply(FromOpener, ok). The "opener" call into the index_server then gets to end early and it doesn't have to wait for us to finish the rest of the call.

These two calls can now end a bit quicker and we should be able to garbage collect them sooner.

new_index({Mod, IdxState, DbName, Sig}) ->
    DDocId = Mod:get(idx_name, IdxState),
    case couch_index:start_link({Mod, IdxState}) of
        {ok, Pid} ->
            ok = gen_server:call(
                server_name(DbName), {async_open, {DbName, DDocId, Sig}, {ok, Pid}}
            ),
            unlink(Pid);
        Error ->
            ok = gen_server:call(
                server_name(DbName), {async_error, {DbName, DDocId, Sig}, Error}
            )
    end.

The most importantly, however is we get the opener out of the way once we already linked to the main indexer process, or received the error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice, I missed those early replies.

@nickva nickva merged commit 2310555 into main Oct 17, 2023
14 checks passed
@nickva nickva deleted the fix-index-server-premature-return branch October 17, 2023 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get_index call crashes couch_index_server instance
3 participants