-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build on EL9 #126
Closed
Closed
Build on EL9 #126
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add SConscript files to integrate into the DAOS build system. Signed-off-by: Li Wei <[email protected]>
log_append_entry() passes caller's raft_entry_t to log_offer. If the log_offer implementation sets the data buffer to different value, caller's raft_entry_t gets modified as a side effect. This is error-prone as callers are usually responsible for freeing the data buffer with such a log_offer implementation. This patch passes the raft_entry_t in the log to log_offer instead. Also, ety in raft_recv_entry() is no longer required. Signed-off-by: Li Wei <[email protected]>
When a leader learns of a newer term and steps down, it should discard current_leader. Otherwise, a client request may be rejected with RAFT_ERR_NOT_LEADER and a hint pointing to this replica itself! For simplicity, this patche follows the Raft paper to discard current_leader whenever current_term changes. And, when receiving an AE with an up-to-date term, a server always updates current_leader, as this information is absolute, even if the previous entry doesn't match. This patch also fixes a test that has previously been failing due to incorrect node IDs. Signed-off-by: Li Wei <[email protected]>
If raft_become_candidate() assigns a negative value e (i.e., -election_timeout < e < 0) to timeout_elapsed, and if this replica successfully becomes the leader, then this leader won't send heartbeats for request_timeout - e milliseconds. This may be longer than election_timeout and cause an unnecessary leadership change. This patch zeroes timeout_elapsed in raft_become_leader(), randomizes election timeouts in [election_timeout, 2 * election_timeout), and fixes raft_recv_appendentries() to avoid zeroing timeout_elapsed if the AE request has an older term. Signed-off-by: Li Wei <[email protected]>
This reverts commit c3443aa. Test changes are left intact.
Check and handle memory allocation errors by reporting them back to users via NULL or RAFT_ERR_NOMEM return values. Signed-off-by: Li Wei <[email protected]>
raft_recv_appendentries_response() should check r->term before checking r->current_idx. Signed-off-by: Li Wei <[email protected]>
If an AE request's previous entry matches the follower's, then the follower doesn't need to truncate its log at prev_log_idx. When the follower checks if the other entries in the AE request are consistent with its local ones, it already takes care of truncating at any inconsistency. If any later entries not covered by this AE request are inconsistent, future AE requests shall truncate them. Signed-off-by: Li Wei <[email protected]>
raft_recv_appendentries() doesn't need to check prev_log_idx against currentIndex, as this is already done implicitly by getting the entry at prev_log_idx. Signed-off-by: Li Wei <[email protected]>
When the calling replica isn't in the leader state, raft_recv_appendentries_response() should return RAFT_ERR_NOT_LEADER instead of -1. Signed-off-by: Li Wei <[email protected]>
Remind users' persist_term implementations to persist currentTerm and nil votedFor atomically. Signed-off-by: Li Wei <[email protected]>
log_offer and log_poll should get log entry index rather than internal array index. That would match the callback comment in raft.h and would also match the log_pop case. Signed-off-by: Li Wei <[email protected]>
User storage callbacks may return negative errors smaller than RAFT_ERR_LAST. These will be returned all the way to the API methods, which report the errors back to users, so that appropriate handlings may be performed. Signed-off-by: Li Wei <[email protected]>
Move the declaration of raft_become_follower() from raft_private.h to raft.h, so that users may voluntarily give up leadership. Signed-off-by: Li Wei <[email protected]>
raft_recv_appendentries() should avoid increasing commitIndex beyond the highest matching index. This patch restores the Raft paper behavior which sets commitIndex to min(leaderCommit, index of last new entry). Although the resulting commitIndex shall not decrease with the current leader code (or it may?), adding a check does not complicate the code too much. The check should also help if the leader code ever uses binary searches to determine matchIndexs. Making sure commitIndex does not become zero is then unnecessary. Signed-off-by: Li Wei <[email protected]>
An AE request's prevLogTerm and entries should never conflict with the follower's committed entries, according to the Raft protocol. If this happens, we should print a warning and inform callers to shut down. Signed-off-by: Li Wei <[email protected]>
raft_get_last_applied_entry() appears twice in raft.h. Signed-off-by: Li Wei <[email protected]>
raft_periodic() should check the return value from raft_apply_entry() against 0 instead of -1. Signed-off-by: Li Wei <[email protected]>
Since mod() has a short name and is only in raft_log.c, it should be static. Signed-off-by: Li Wei <[email protected]>
Support polling a prefix of a log up to a certain index. Avoid returning the polled entries, as that's nontrivial to implement safely and well in practice. Remove raft_poll_entry() as raft_end_snapshot() already polls the log. Also, extract the common code for checking indices, calculating array subscripts, and computing batch sizes from an index up. (When adding batch popping support, we will need a matching batch_down().) Signed-off-by: Li Wei <[email protected]>
Support appending multiple entries to the end of the log by adding two new APIs: raft_append and log_append which accept an array of entries as arguments. Remove all references to the single entry versions of the APIs: log_append_entry and raft_append_entry. Increase capacity of the log buffer exponentially by a factor of two at a time; enough to ensure that there is enough to append the given number of entries. This is slightly different from when appending only a single entry since the buffer size may have to be increased more than once to be sufficient. Eliminate the memcpy loop and replace with at the most two calls for copying existing entries to the newly allocated buffer. Signed-off-by: Omkar Kulkarni <[email protected]>
Adding support for deleting multiple entries from the tail of the log starting from the youngest entry. This order is important in order to backtrack transitions to the state machine for correctly restoring the state right before the oldest entry that is deleted. Signed-off-by: Omkar Kulkarni <[email protected]>
To avoid leaving the daos submodule dirty when running raft tests Signed-off-by: Jeff Olivier <[email protected]>
This patch fixes a few confusions around raft_node_t pointers versus integer node IDs. Signed-off-by: Li Wei <[email protected]>
Although RAFT_LOGTYPE_SNAPSHOT entries make getting terms easy, they require extra cares: - After a log is polled, it no longer has a RAFT_LOGTYPE_SNAPSHOT entry at its head. We could require log_poll to convert an existing entry into a RAFT_LOGTYPE_SNAPSHOT one, but that would either be hard to implement without a transactional storage or require a scan for RAFT_LOGTYPE_SNAPSHOT entries in the middle of a log when a replica starts. - When transferring entries or getting entry counts, we need to consider if the log head is a RAFT_LOGTYPE_SNAPSHOT entry or not. It becomes simpler to think about if we get rid of RAFT_LOGTYPE_SNAPSHOT and store the snapshot term separately. Getting a term will indeed need to check if the index points at the snapshot, but that can easily be hidden inside a common helper. - Allow compacting a single-entry log. Useful for testing purposes. Signed-off-by: Li Wei <[email protected]>
For the consistency with raft_begin_snapshot(), raft_end_snapshot() should poll to snapshot_last_idx instead of raft_get_commit_idx(). Signed-off-by: Li Wei <[email protected]>
A snapshot should be accepted even if raft_get_commit_idx() < last_included_index <= raft_get_current_idx(), as the leader might be attempting to overwrite the follower's uncommitted entries. Signed-off-by: Li Wei <[email protected]>
Allow users to specify an index to raft_begin_snapshot(). For users whose state machine implementations allow this, snapshotting an index < the last applied index could reduce the necessities of sending snapshots to slower followers. Signed-off-by: Li Wei <[email protected]>
Description: Fix the warning/errors raised by python bandit check (similar to pylint with security-check) modified: raft/tests/log_fuzzer.py Signed-off-by: Ding Ho [email protected]
Assert the returned pointer from raft_get_entry_from_idx() is non-NULL. Also update packaging files based on latest daos-stack/packaging repo. Signed-off-by: Kenneth Cain <[email protected]>
This behavior is described in the original Raft paper and is required to prevent removed nodes from disrupting the cluster. This causes other problems though, as a split node may trigger election which is rejected but leaves it with a future term. The generally accepted solution for this appears to be the use of PreVote RPC to avoid election that can never succeed. Signed-off-by: Li Wei <[email protected]>
Update the packaging for #42 (Ignore RequestVote when leader is established), so that daos can pick up the fix. Signed-off-by: Li Wei <[email protected]>
Signed-off-by: Brian J. Murrell <[email protected]>
When a server receives a RequestVote request, if it has already voted for this candidate in the same term, the server should grant this repeated RequestVote. Signed-off-by: Li Wei <[email protected]>
Add the Pre-Vote phase (§4.2.3, §9.6) before a candidate increments the term, in order to prevent an isolated server from disrupting an existing leader when the isolation ends. A prevote field is added to both msg_requestvote_t and msg_requestvote_response_t. Users may need to make code changes, depending on how they transfer these messages. The Pre-Vote phase is implemented as a substate, determined by raft_server_private_t.prevote, in RAFT_STATE_CANDIDATE. A candidate first enters the Pre-Vote phase (raft_server_private_t.prevote == 1), requesting "prevotes". It's called a prevoting candidate in a few places. If this candidate wins the Pre-Vote phase, it becomes a "prevoted" candidate (raft_server_private_t.prevote == 0), requesting (normal) votes. A prevote request is evaluated against the term and log states, but not the vote state, for simplicity. See the comment in __should_grant_vote for the case being sacrificed. Granted prevotes are tracked using RAFT_NODE_VOTED_FOR_ME flags. When a candidate becomes a prevoted candidate, it clears all those flags and uses them to track (normal) votes. Prevotes and votes from a server itself are now tracked in the same way as those from other servers are, because using voted_for and prevote fields in raft_server_private_t seems to make things trickier. This change also leads to the fixes for a few tests where a server becomes a candidate with an empty membership. Signed-off-by: Li Wei <[email protected]>
* Bump version to 0.8.0 * Update packaging * Set version-release for daos-raft* packages Co-authored-by: Brian J. Murrell <[email protected]> Signed-off-by: Li Wei <[email protected]>
35d1029 clears leader_id after the Pre-Vote phase. That is too late, because incoming RequestVote requests will be rejected during a minimum election timeout from the beginning of the election, by the leader aliveness check in raft_recv_requestvote. The leader_id should be cleared as soon as a candidate resets timeout_elapsed. Signed-off-by: Li Wei <[email protected]>
raft_recv_installsnapshot_response returns immediately if the whole snapshot has not yet been transferred completely. This causes next chunk to be sent after an unnecessary request_timeout. The function should instead continue the execution, skip the node state update, and see if there is a new chunk to send immediately. Signed-off-by: Li Wei <[email protected]>
That tracks master. Skip-PR-comments: true Signed-off-by: Brian J. Murrell <[email protected]>
Mark issues in bandit as #nosec after inspection and fix ones that seem relevant still. Signed-off-by: David Quigley <[email protected]>
Signed-off-by: Brian J. Murrell <[email protected]>
CLinkedListQueue is cloned on demand at build time. This causes a few inconveniences: - If the clone (over the Internet) fails, the CLinkedListQueue directory is not deleted. Future builds will not clone again, but at the same time can not find CLinkedListQueue/linked_list_queue.c. (The issue alone could be fixed by changing Makefile.) - If the raft source is taken into a build environment without Internet access, the clone will fail. This has happened at least once in the field. Since CLinkedListQueue is only used by the tests, and is unlikely to require frequent updates, this patch adds its source into raft.git. Signed-off-by: Li Wei <[email protected]>
An optimization in raft_periodic lets a node who is the only voting node in its local configuration become a leader without any election. No persistent term or vote update is involved. This behavior violates some Raft rules. Consider a two-node configuration, C = {S1, S2}, where S1 is a follower in term 1 ("F1") and S2 is a leader in term 1 ("L1"): S1 (F1): [C] S2 (L1): [C] Both S1 and S2 have voted for S2 in term 1. Then, S2 is removed, resulting in a new configuration, D = {S1}. Before S2 receives a reply from S1 indicating that S1 has D: S1 (F1): [C, D] S2 (L1): [C, D] S1 may become a leader in term 1 through the optimization, because it is the only voting node in D, it doesn't increment the term, and it doesn't go through any election. Before S2 commits D and steps down, the two leaders in term 1 may even exist simultaneously. Another concern is that if S1 above restarts later, it will become the leader of term 1 again. One may argue that this should be a different leadership in a different term, for certain volatile states of the user or raft itself (e.g., nextIndex[] and matchIndex[]) are reset. After all, the performance benefit from the optimization does not seem to justify the correctness concerns (or the logical complexity, at least, if one regards two-node configurations as unimportant). It is with these considerations, the patch removes the optimization. Signed-off-by: Li Wei <[email protected]>
(The core change here may be of no interest to the upstream.) Unfortunately, daos ends up needing to support re-adding a node with its previous ID, which does not work properly with our algorithm and is not supported by the upstream algorithm too. While daos will avoid reusing node IDs eventually, it needs to provide compatiblity for existing persistent raft state. This patch therefore reverts to the upstream algorithm but drops the troublemaking logic that tracks the last applied membership, which is not used by daos. The resulting algorithm is based on the following principles: - raft only tracks the active membership; the user shall track the last applied membership for snapshotting purposes. (Tracking both using the same set of raft_node_t objects in raft proves to be complex. The user does get a burden, but a very clear one.) * The INACTIVE flag allows a node to be absent in the active membership but still present in the last applied one. * The ADDITION_COMMITTED and VOTING_COMMITTED flags track the last applied membership. The former corresponds to the INACTIVE flag for the active membership; the latter to the VOTING flag. - Each membership change entry is a clear transition from a fixed pre-state to a fixed post-state. For instance, ADD_NODE changes a node from absent to voting. This enables clear undo logic in raft_pop_log. * PROMOTE_NODE and REMOVE_NON_VOTING_NODE are introduced to enable upgrading from existing persistent raft state that doesn't use non-voting nodes. This patch also makes the following minor changes: - Fix an upstream bug that does not consider REMOVE_NODE as a voting configuration change. - Add a check to raft_recv_entry for invalid membership change entries. * Removing a leader is currently not safe and now prohibited. * Enforce the pre-state of each membership change. - Remove a redundant setting of voting_cfg_change_log_idx in raft_clear. - Change raft_add_node to disallow "promotion" semantics, which shall be unnecessary to the user. - Fix an upstream log_delete bug that causes log_get_node_id to operate with an entry that has already been popped. - Fix a log_delete bug that causes incorrect indices to be passed to log_pop. Signed-off-by: Li Wei <[email protected]>
Signed-off-by: Brian J. Murrell <[email protected]>
Update packaging to 466b45bd0a7f1b714075afbe2a6558d4cef3dfec. Signed-off-by: Li Wei <[email protected]>
Let C = {S1, S2 (nonvoting)}, D = {S1, S2 (voting)}. Consider a promotion of S2 in the following state. S1 (L1): [C, D] S2 (F1): [C] S1 restarts. Now since S2 considers itself a nonvoting member acccording to its local membership C, it won't campaign or grant any vote, whereas S1 needs a vote from S2 to become a leader. The fix is for a node to decide its vote despite its membership status. Signed-off-by: Li Wei <[email protected]>
The node parameter of notify_membership_events shall never be NULL. The assertion for this should be placed before the notify_membership_events call. The API document should also be fixed. Signed-off-by: Li Wei <[email protected]>
An index into the node array shouldn't be raft_index_t. Also, this patch adds a boundary check for the index parameter of raft_get_node_from_idx. Signed-off-by: Li Wei <[email protected]>
Signed-off-by: Li Wei <[email protected]>
The node_id field of raft_server_private_t is initialized to 0 by raft_new. Since 0 may be a valid node ID, raft_new should initialize node_id to -1 instead, like leader_id and what raft_clear does to node_id. This patch fixes the node_id initialization and adds a new API function, raft_set_nodeid, for users to set the node ID. Users shall call raft_set_nodeid right after raft_new/raft_clear. (Adding a node ID parameter to raft_new/raft_clear would be a better solution, though it would require a lot of test changes.) Signed-off-by: Li Wei <[email protected]>
Signed-off-by: Brian J. Murrell <[email protected]>
A failure of the following assertion in raft_recv_requestvote was observed: assert(!raft_is_leader(me_) && (vr->prevote || !raft_is_candidate(me_))); At the time, this server was very unlikely a leader. So it must be a candidate granting a non-prevote RV request. Since a candidate votes for itself only when becoming a _prevoted_ candidate, the server must be a _prevoting_ candidate. This patch fixes the assertion to allow such a valid case. Related, raft_become_prevoted_candidate clears me->prevote too early. If the raft_set_current_term call or the raft_vote_for_nodeid call returns an error, the server is left as a prevoted candidate, but hasn't voted for itself. If a non-prevote RV request comes in, the server may grant the vote and fail the assertion above. This patch changes raft_become_prevoted_candidate to exit the prevoting phase after the self vote succeeds. Signed-off-by: Li Wei <[email protected]>
Add BR: make, gcc. Update packaging. Clean up whitespace inconsistencies. #Pragmas from previous commit message: Skip-checkpatch: true Skip-build: true Quick-build: true Skip-unit-tests: true Skip-test: true Signed-off-by: Brian J. Murrell <[email protected]>
Wrong repo. Apologies for the noise. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add BR: make, gcc.
Update packaging.
Clean up whitespace inconsistencies.