Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The known_applied_index is not increasing but applying_index is #277

Closed
kishorenc opened this issue Apr 20, 2021 · 6 comments · Fixed by #278
Closed

The known_applied_index is not increasing but applying_index is #277

kishorenc opened this issue Apr 20, 2021 · 6 comments · Fixed by #278

Comments

@kishorenc
Copy link
Contributor

We have a braft node (no clustering, just a single node) which is under high write load in which the nodeStatus.applying_index and nodeStatus.committed_index are increasing but the nodeStatus.known_applied_index index is just stuck and seems to increase only after every snapshot. During snapshot nodeStatus.known_applied_index increases, and then after that it is just stuck even as the other two indices increase.

Can someone help explain when this can happen? Help needed @PFZheng @Edward-xk @ipconfigme

@PFZheng
Copy link
Collaborator

PFZheng commented Apr 20, 2021

It seems to be the problem of FSMCaller::do_committed. Each time this function pops a batch of logs to apply, but the applied index is only updated after the entire batch processed.

@PFZheng
Copy link
Collaborator

PFZheng commented Apr 20, 2021

A solution to this problem is to update applied index in a smaller batch.

@kishorenc
Copy link
Contributor Author

kishorenc commented Apr 20, 2021

Thank you for replying. Locally I was able to reproduce this problem with the following sequence:

a) Start a node and write 1,000 records (no snapshotting is done)
b) Stop node and modify the on_apply()function by adding a 5 second sleep
c) Restart node

This time, as the entries from the WAL are replayed, I see that the known_applied_index remains at 0 even as the applying_index index progresses. The known_applied_index index is updated only when all entries WAL in the WAL are processed.

Each time this function pops a batch of logs to apply, but the applied index is only updated after the entire batch processed.

So it seems like braft does not update applied index until the WAL is entirely caught up? This can cause issues when a node is under heavy load after a restart, since the WAL will continue growing and so applied_index will never progress.

@PFZheng
Copy link
Collaborator

PFZheng commented Apr 20, 2021

So it seems like braft does not update applied index until the WAL is entirely caught up? This can cause issues when a node is under heavy load after a restart, since the WAL will continue growing and so applied_index will never progress.

It will increase, the delay time depends on how long it takes to apply a batch of logs. The workflow of state machine can be described as following:

  1. the state machine see the current maximum committed index A;
  2. the state machine apply logs, whose index <= A. At the duration of process, committed index still increase, but the state machine will process them in the next cycle;
  3. state machine update |known_applied_index| to A;
  4. repeat the step 1-3.

If it's in heavy load, step 2 may take a long time, and the gap between the committed index and |known_applied_index| will be large.

However, it may confuse users, we will fix this problem.

@PFZheng
Copy link
Collaborator

PFZheng commented Apr 20, 2021

#278 @kishorenc

@kishorenc
Copy link
Contributor Author

Thank you, that looks good 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants