Improve Ra server resilience when log infrastructure encounters faults #428

kjnilsson · 2024-04-11T13:53:41Z

Various improvements to data safety when log infrastructure processes encounter faults.

In particular there are many improvements and fixes relating to the server -> wal resend protocol including:

Bug fix to ra_log_cache that would cause most triggered resends result in a ra process crash.
Dropping fewer messages using the gen_state postpone feature.
Ra leaders would previously just exit with wal_down - now they enter the same await_condition state although with a shorter timeout after which the begin a leader transfer process
Improved detection and availability when a command is lost on the way to the wal and no further commands are sent.

Also there is a new feature to configure on a per system basis what kind of server recovery should take place when a ra system starts/restarts. There are 3 options:

undefined : do not restart any ra server
registered: restart all locally registered servers for the system
mfa: call a custom function that performs the restart.

This feature will allow dynamically started ra server to be restarted should the ra system crash and restart.

Also improvements to code coverage and refactoring.

Fixes: #416

pjk25 · 2024-04-22T14:16:15Z

@kjnilsson you can rebase now that #431 is merged

… encounter faults. In particular there are many improvements and fixes relating to the server -> wal resend protocol including: Bug fix to ra_log_cache that would cause most triggered resends result in a ra process crash. Dropping fewer messages using the gen_state postpone feature. Ra leaders would previously just exit with wal_down - now they enter the same await_condition state although with a shorter timeout after which the begin a leader transfer process Improved detection and availability when a command is lost on the way to the wal and no further commands are sent. Also there is a new feature to configure on a per system basis what kind of server recovery should take place when a ra system starts/restarts. There are 3 options: undefined : do not restart any ra server registered: restart all locally registered servers for the system mfa: call a custom function that performs the restart. This feature will allow dynamically started ra server to be restarted should the ra system crash and restart. Also improvements to code coverage and refactoring. improvements to data safety when log infra crashes.

…new module after all

src/ra_server.erl

* Clarify some comments * Log less where it is arguably not very important

michaelklishin

I have pushed some mostly cosmetic changes:

Less logging
Clarified a few comments

kjnilsson force-pushed the log-improvements branch from 55d6bba to 5beb361 Compare April 11, 2024 13:57

kjnilsson force-pushed the log-improvements branch from 75f40d4 to 41d86a8 Compare April 23, 2024 18:52

kjnilsson changed the title ~~Log improvements~~ ra_log fault resilience and other fixes Apr 24, 2024

kjnilsson force-pushed the log-improvements branch from 2880733 to 80d041c Compare April 24, 2024 14:45

kjnilsson marked this pull request as ready for review April 24, 2024 14:46

kjnilsson added this to the 2.10.0 milestone Apr 24, 2024

kjnilsson mentioned this pull request Apr 24, 2024

ra_log_cache_key_not_found exception exit occured #416

Closed

michaelklishin added 2 commits April 24, 2024 16:05

Remove commented out exports of functions that were not added to the …

76a933c

…new module after all

Log EAGAIN retries at debug level

46cf4cb

michaelklishin reviewed Apr 24, 2024

View reviewed changes

src/ra_server.erl Show resolved Hide resolved

Cosmetics

4f19473

* Clarify some comments * Log less where it is arguably not very important

michaelklishin approved these changes Apr 24, 2024

View reviewed changes

michaelklishin changed the title ~~ra_log fault resilience and other fixes~~ Improves log write failure resilience and other fixes Apr 24, 2024

test reliability

6b54def

kjnilsson changed the title ~~Improves log write failure resilience and other fixes~~ Improve ra server resilience when log infrastructure experiences faults Apr 25, 2024

kjnilsson changed the title ~~Improve ra server resilience when log infrastructure experiences faults~~ Improve Ra server resilience when log infrastructure experiences faults Apr 25, 2024

kjnilsson changed the title ~~Improve Ra server resilience when log infrastructure experiences faults~~ Improve Ra server resilience when log infrastructure encounters faults Apr 25, 2024

kjnilsson merged commit d4eb462 into main Apr 25, 2024
10 checks passed

michaelklishin deleted the log-improvements branch April 25, 2024 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Ra server resilience when log infrastructure encounters faults #428

Improve Ra server resilience when log infrastructure encounters faults #428

kjnilsson commented Apr 11, 2024 •

edited

Loading

pjk25 commented Apr 22, 2024

michaelklishin left a comment

Improve Ra server resilience when log infrastructure encounters faults #428

Improve Ra server resilience when log infrastructure encounters faults #428

Conversation

kjnilsson commented Apr 11, 2024 • edited Loading

pjk25 commented Apr 22, 2024

michaelklishin left a comment

Choose a reason for hiding this comment

kjnilsson commented Apr 11, 2024 •

edited

Loading