Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Ra server resilience when log infrastructure encounters faults #428

Merged
merged 5 commits into from
Apr 25, 2024

Conversation

kjnilsson
Copy link
Contributor

@kjnilsson kjnilsson commented Apr 11, 2024

Various improvements to data safety when log infrastructure processes encounter faults.

In particular there are many improvements and fixes relating to the server -> wal resend protocol including:

  • Bug fix to ra_log_cache that would cause most triggered resends result in a ra process crash.
  • Dropping fewer messages using the gen_state postpone feature.
  • Ra leaders would previously just exit with wal_down - now they enter the same await_condition state although with a shorter timeout after which the begin a leader transfer process
  • Improved detection and availability when a command is lost on the way to the wal and no further commands are sent.

Also there is a new feature to configure on a per system basis what kind of server recovery should take place when a ra system starts/restarts. There are 3 options:

  1. undefined : do not restart any ra server
  2. registered: restart all locally registered servers for the system
  3. mfa: call a custom function that performs the restart.

This feature will allow dynamically started ra server to be restarted should the ra system crash and restart.

Also improvements to code coverage and refactoring.

Fixes: #416

@pjk25
Copy link
Contributor

pjk25 commented Apr 22, 2024

@kjnilsson you can rebase now that #431 is merged

@kjnilsson kjnilsson changed the title Log improvements ra_log fault resilience and other fixes Apr 24, 2024
… encounter faults.

In particular there are many improvements and fixes relating to the server -> wal resend protocol including:

Bug fix to ra_log_cache that would cause most triggered resends result in a ra process crash.
Dropping fewer messages using the gen_state postpone feature.
Ra leaders would previously just exit with wal_down - now they enter the same await_condition state although with a shorter timeout after which the begin a leader transfer process
Improved detection and availability when a command is lost on the way to the wal and no further commands are sent.
Also there is a new feature to configure on a per system basis what kind of server recovery should take place when a ra system starts/restarts. There are 3 options:

undefined : do not restart any ra server
registered: restart all locally registered servers for the system
mfa: call a custom function that performs the restart.
This feature will allow dynamically started ra server to be restarted should the ra system crash and restart.

Also improvements to code coverage and refactoring.
improvements to data safety when log infra crashes.
@kjnilsson kjnilsson marked this pull request as ready for review April 24, 2024 14:46
@kjnilsson kjnilsson added this to the 2.10.0 milestone Apr 24, 2024
 * Clarify some comments
 * Log less where it is arguably not very important
Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed some mostly cosmetic changes:

  • Less logging
  • Clarified a few comments

@michaelklishin michaelklishin changed the title ra_log fault resilience and other fixes Improves log write failure resilience and other fixes Apr 24, 2024
@kjnilsson kjnilsson changed the title Improves log write failure resilience and other fixes Improve ra server resilience when log infrastructure experiences faults Apr 25, 2024
@kjnilsson kjnilsson changed the title Improve ra server resilience when log infrastructure experiences faults Improve Ra server resilience when log infrastructure experiences faults Apr 25, 2024
@kjnilsson kjnilsson changed the title Improve Ra server resilience when log infrastructure experiences faults Improve Ra server resilience when log infrastructure encounters faults Apr 25, 2024
@kjnilsson kjnilsson merged commit d4eb462 into main Apr 25, 2024
10 checks passed
@michaelklishin michaelklishin deleted the log-improvements branch April 25, 2024 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ra_log_cache_key_not_found exception exit occured
3 participants