Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build on EL9 #126

Closed
wants to merge 98 commits into from
Closed

Build on EL9 #126

wants to merge 98 commits into from

Conversation

brianjmurrell
Copy link

Add BR: make, gcc.

Update packaging.

Clean up whitespace inconsistencies.

liw and others added 30 commits April 1, 2017 15:50
Add SConscript files to integrate into the DAOS build system.

Signed-off-by: Li Wei <[email protected]>
log_append_entry() passes caller's raft_entry_t to log_offer. If the
log_offer implementation sets the data buffer to different value,
caller's raft_entry_t gets modified as a side effect. This is
error-prone as callers are usually responsible for freeing the data
buffer with such a log_offer implementation. This patch passes the
raft_entry_t in the log to log_offer instead. Also, ety in
raft_recv_entry() is no longer required.

Signed-off-by: Li Wei <[email protected]>
When a leader learns of a newer term and steps down, it should discard
current_leader. Otherwise, a client request may be rejected with
RAFT_ERR_NOT_LEADER and a hint pointing to this replica itself!

For simplicity, this patche follows the Raft paper to discard
current_leader whenever current_term changes. And, when receiving an AE
with an up-to-date term, a server always updates current_leader, as this
information is absolute, even if the previous entry doesn't match.

This patch also fixes a test that has previously been failing due to
incorrect node IDs.

Signed-off-by: Li Wei <[email protected]>
If raft_become_candidate() assigns a negative value e (i.e.,
-election_timeout < e < 0) to timeout_elapsed, and if this replica
successfully becomes the leader, then this leader won't send heartbeats
for request_timeout - e milliseconds. This may be longer than
election_timeout and cause an unnecessary leadership change.

This patch zeroes timeout_elapsed in raft_become_leader(), randomizes
election timeouts in [election_timeout, 2 * election_timeout), and fixes
raft_recv_appendentries() to avoid zeroing timeout_elapsed if the AE
request has an older term.

Signed-off-by: Li Wei <[email protected]>
This reverts commit c3443aa. Test
changes are left intact.
Check and handle memory allocation errors by reporting them back to
users via NULL or RAFT_ERR_NOMEM return values.

Signed-off-by: Li Wei <[email protected]>
raft_recv_appendentries_response() should check r->term before checking
r->current_idx.

Signed-off-by: Li Wei <[email protected]>
If an AE request's previous entry matches the follower's, then the
follower doesn't need to truncate its log at prev_log_idx. When the
follower checks if the other entries in the AE request are consistent
with its local ones, it already takes care of truncating at any
inconsistency. If any later entries not covered by this AE request are
inconsistent, future AE requests shall truncate them.

Signed-off-by: Li Wei <[email protected]>
raft_recv_appendentries() doesn't need to check prev_log_idx against
currentIndex, as this is already done implicitly by getting the entry at
prev_log_idx.

Signed-off-by: Li Wei <[email protected]>
When the calling replica isn't in the leader state,
raft_recv_appendentries_response() should return RAFT_ERR_NOT_LEADER
instead of -1.

Signed-off-by: Li Wei <[email protected]>
Remind users' persist_term implementations to persist currentTerm and
nil votedFor atomically.

Signed-off-by: Li Wei <[email protected]>
log_offer and log_poll should get log entry index rather than internal
array index. That would match the callback comment in raft.h and would
also match the log_pop case.

Signed-off-by: Li Wei <[email protected]>
User storage callbacks may return negative errors smaller than
RAFT_ERR_LAST. These will be returned all the way to the API methods,
which report the errors back to users, so that appropriate handlings may
be performed.

Signed-off-by: Li Wei <[email protected]>
Move the declaration of raft_become_follower() from raft_private.h to
raft.h, so that users may voluntarily give up leadership.

Signed-off-by: Li Wei <[email protected]>
raft_recv_appendentries() should avoid increasing commitIndex beyond the
highest matching index. This patch restores the Raft paper behavior
which sets commitIndex to min(leaderCommit, index of last new entry).

Although the resulting commitIndex shall not decrease with the current
leader code (or it may?), adding a check does not complicate the code
too much. The check should also help if the leader code ever uses binary
searches to determine matchIndexs. Making sure commitIndex does not
become zero is then unnecessary.

Signed-off-by: Li Wei <[email protected]>
An AE request's prevLogTerm and entries should never conflict with the
follower's committed entries, according to the Raft protocol. If this
happens, we should print a warning and inform callers to shut down.

Signed-off-by: Li Wei <[email protected]>
raft_get_last_applied_entry() appears twice in raft.h.

Signed-off-by: Li Wei <[email protected]>
raft_periodic() should check the return value from raft_apply_entry()
against 0 instead of -1.

Signed-off-by: Li Wei <[email protected]>
Since mod() has a short name and is only in raft_log.c, it should be
static.

Signed-off-by: Li Wei <[email protected]>
Support polling a prefix of a log up to a certain index. Avoid returning
the polled entries, as that's nontrivial to implement safely and well in
practice. Remove raft_poll_entry() as raft_end_snapshot() already polls
the log.

Also, extract the common code for checking indices, calculating array
subscripts, and computing batch sizes from an index up. (When adding
batch popping support, we will need a matching batch_down().)

Signed-off-by: Li Wei <[email protected]>
Support appending multiple entries to the end of the log by adding
two new APIs: raft_append and log_append which accept an array of
entries as arguments. Remove all references to the single entry
versions of the APIs: log_append_entry and raft_append_entry.

Increase capacity of the log buffer exponentially by a factor
of two at a time; enough to ensure that there is enough to
append the given number of entries. This is slightly different
from when appending only a single entry since the buffer size
may have to be increased more than once to be sufficient.

Eliminate the memcpy loop and replace with at the most two calls
for copying existing entries to the newly allocated buffer.

Signed-off-by: Omkar Kulkarni <[email protected]>
Adding support for deleting multiple entries from the tail of the
log starting from the youngest entry. This order is important in
order to backtrack transitions to the state machine for correctly
restoring the state right before the oldest entry that is deleted.

Signed-off-by: Omkar Kulkarni <[email protected]>
To avoid leaving the daos submodule dirty when running raft
tests

Signed-off-by: Jeff Olivier <[email protected]>
This patch fixes a few confusions around raft_node_t pointers versus
integer node IDs.

Signed-off-by: Li Wei <[email protected]>
Although RAFT_LOGTYPE_SNAPSHOT entries make getting terms easy, they
require extra cares:

  - After a log is polled, it no longer has a RAFT_LOGTYPE_SNAPSHOT
    entry at its head. We could require log_poll to convert an existing
    entry into a RAFT_LOGTYPE_SNAPSHOT one, but that would either be
    hard to implement without a transactional storage or require a scan
    for RAFT_LOGTYPE_SNAPSHOT entries in the middle of a log when a
    replica starts.

  - When transferring entries or getting entry counts, we need to
    consider if the log head is a RAFT_LOGTYPE_SNAPSHOT entry or not.

It becomes simpler to think about if we get rid of RAFT_LOGTYPE_SNAPSHOT
and store the snapshot term separately. Getting a term will indeed need
to check if the index points at the snapshot, but that can easily be
hidden inside a common helper.

  - Allow compacting a single-entry log. Useful for testing purposes.

Signed-off-by: Li Wei <[email protected]>
For the consistency with raft_begin_snapshot(), raft_end_snapshot()
should poll to snapshot_last_idx instead of raft_get_commit_idx().

Signed-off-by: Li Wei <[email protected]>
A snapshot should be accepted even if raft_get_commit_idx() <
last_included_index <= raft_get_current_idx(), as the leader might be
attempting to overwrite the follower's uncommitted entries.

Signed-off-by: Li Wei <[email protected]>
Allow users to specify an index to raft_begin_snapshot(). For users
whose state machine implementations allow this, snapshotting an index <
the last applied index could reduce the necessities of sending snapshots
to slower followers.

Signed-off-by: Li Wei <[email protected]>
dinghwah and others added 26 commits December 18, 2020 16:57
Description: Fix the warning/errors raised by python bandit check (similar to pylint with security-check)

modified: raft/tests/log_fuzzer.py

Signed-off-by: Ding Ho [email protected]
Assert the returned pointer from raft_get_entry_from_idx() is non-NULL.

Also update packaging files based on latest daos-stack/packaging repo.

Signed-off-by: Kenneth Cain <[email protected]>
This behavior is described in the original Raft paper and is required to
prevent removed nodes from disrupting the cluster.

This causes other problems though, as a split node may trigger election
which is rejected but leaves it with a future term.  The generally
accepted solution for this appears to be the use of PreVote RPC to avoid
election that can never succeed.

Signed-off-by: Li Wei <[email protected]>
Update the packaging for #42 (Ignore RequestVote when leader is
established), so that daos can pick up the fix.

Signed-off-by: Li Wei <[email protected]>
Signed-off-by: Brian J. Murrell <[email protected]>
When a server receives a RequestVote request, if it has already voted
for this candidate in the same term, the server should grant this
repeated RequestVote.

Signed-off-by: Li Wei <[email protected]>
Add the Pre-Vote phase (§4.2.3, §9.6) before a candidate increments the
term, in order to prevent an isolated server from disrupting an existing
leader when the isolation ends.

A prevote field is added to both msg_requestvote_t and
msg_requestvote_response_t. Users may need to make code changes,
depending on how they transfer these messages.

The Pre-Vote phase is implemented as a substate, determined by
raft_server_private_t.prevote, in RAFT_STATE_CANDIDATE. A candidate
first enters the Pre-Vote phase (raft_server_private_t.prevote == 1),
requesting "prevotes". It's called a prevoting candidate in a few
places. If this candidate wins the Pre-Vote phase, it becomes a
"prevoted" candidate (raft_server_private_t.prevote == 0), requesting
(normal) votes.

A prevote request is evaluated against the term and log states, but not
the vote state, for simplicity. See the comment in __should_grant_vote
for the case being sacrificed.

Granted prevotes are tracked using RAFT_NODE_VOTED_FOR_ME flags. When a
candidate becomes a prevoted candidate, it clears all those flags and
uses them to track (normal) votes. Prevotes and votes from a server
itself are now tracked in the same way as those from other servers are,
because using voted_for and prevote fields in raft_server_private_t
seems to make things trickier. This change also leads to the fixes for a
few tests where a server becomes a candidate with an empty membership.

Signed-off-by: Li Wei <[email protected]>
* Bump version to 0.8.0
* Update packaging
* Set version-release for daos-raft* packages

Co-authored-by: Brian J. Murrell <[email protected]>
Signed-off-by: Li Wei <[email protected]>
35d1029 clears leader_id after the Pre-Vote phase. That is too late,
because incoming RequestVote requests will be rejected during a minimum
election timeout from the beginning of the election, by the leader
aliveness check in raft_recv_requestvote. The leader_id should be
cleared as soon as a candidate resets timeout_elapsed.

Signed-off-by: Li Wei <[email protected]>
raft_recv_installsnapshot_response returns immediately if the whole
snapshot has not yet been transferred completely. This causes next chunk
to be sent after an unnecessary request_timeout. The function should
instead continue the execution, skip the node state update, and see if
there is a new chunk to send immediately.

Signed-off-by: Li Wei <[email protected]>
That tracks master.

Skip-PR-comments: true

Signed-off-by: Brian J. Murrell <[email protected]>
Mark issues in bandit as #nosec after inspection and fix ones that seem relevant still.

Signed-off-by: David Quigley <[email protected]>
CLinkedListQueue is cloned on demand at build time. This causes a few
inconveniences:

  - If the clone (over the Internet) fails, the CLinkedListQueue
    directory is not deleted. Future builds will not clone again, but at
    the same time can not find CLinkedListQueue/linked_list_queue.c.
    (The issue alone could be fixed by changing Makefile.)

  - If the raft source is taken into a build environment without
    Internet access, the clone will fail. This has happened at least
    once in the field.

Since CLinkedListQueue is only used by the tests, and is unlikely to
require frequent updates, this patch adds its source into raft.git.

Signed-off-by: Li Wei <[email protected]>
An optimization in raft_periodic lets a node who is the only voting node
in its local configuration become a leader without any election. No
persistent term or vote update is involved.

This behavior violates some Raft rules. Consider a two-node
configuration, C = {S1, S2}, where S1 is a follower in term 1 ("F1") and
S2 is a leader in term 1 ("L1"):

  S1 (F1): [C]
  S2 (L1): [C]

Both S1 and S2 have voted for S2 in term 1. Then, S2 is removed,
resulting in a new configuration, D = {S1}. Before S2 receives a reply
from S1 indicating that S1 has D:

  S1 (F1): [C, D]
  S2 (L1): [C, D]

S1 may become a leader in term 1 through the optimization, because it is
the only voting node in D, it doesn't increment the term, and it doesn't
go through any election. Before S2 commits D and steps down, the two
leaders in term 1 may even exist simultaneously.

Another concern is that if S1 above restarts later, it will become the
leader of term 1 again. One may argue that this should be a different
leadership in a different term, for certain volatile states of the user
or raft itself (e.g., nextIndex[] and matchIndex[]) are reset.

After all, the performance benefit from the optimization does not seem
to justify the correctness concerns (or the logical complexity, at
least, if one regards two-node configurations as unimportant). It is
with these considerations, the patch removes the optimization.

Signed-off-by: Li Wei <[email protected]>
(The core change here may be of no interest to the upstream.)

Unfortunately, daos ends up needing to support re-adding a node with its
previous ID, which does not work properly with our algorithm and is not
supported by the upstream algorithm too. While daos will avoid reusing
node IDs eventually, it needs to provide compatiblity for existing
persistent raft state. This patch therefore reverts to the upstream
algorithm but drops the troublemaking logic that tracks the last applied
membership, which is not used by daos. The resulting algorithm is based
on the following principles:

  - raft only tracks the active membership; the user shall track the
    last applied membership for snapshotting purposes. (Tracking both
    using the same set of raft_node_t objects in raft proves to be
    complex. The user does get a burden, but a very clear one.)
      * The INACTIVE flag allows a node to be absent in the active
        membership but still present in the last applied one.
      * The ADDITION_COMMITTED and VOTING_COMMITTED flags track the last
        applied membership. The former corresponds to the INACTIVE flag
        for the active membership; the latter to the VOTING flag.

  - Each membership change entry is a clear transition from a fixed
    pre-state to a fixed post-state. For instance, ADD_NODE changes a
    node from absent to voting. This enables clear undo logic in
    raft_pop_log.
      * PROMOTE_NODE and REMOVE_NON_VOTING_NODE are introduced to enable
        upgrading from existing persistent raft state that doesn't use
        non-voting nodes.

This patch also makes the following minor changes:

  - Fix an upstream bug that does not consider REMOVE_NODE as a voting
    configuration change.

  - Add a check to raft_recv_entry for invalid membership change
    entries.
      * Removing a leader is currently not safe and now prohibited.
      * Enforce the pre-state of each membership change.

  - Remove a redundant setting of voting_cfg_change_log_idx in
    raft_clear.

  - Change raft_add_node to disallow "promotion" semantics, which shall
    be unnecessary to the user.

  - Fix an upstream log_delete bug that causes log_get_node_id to
    operate with an entry that has already been popped.

  - Fix a log_delete bug that causes incorrect indices to be passed to
    log_pop.

Signed-off-by: Li Wei <[email protected]>
Update packaging to 466b45bd0a7f1b714075afbe2a6558d4cef3dfec.

Signed-off-by: Li Wei <[email protected]>
Let C = {S1, S2 (nonvoting)}, D = {S1, S2 (voting)}. Consider a
promotion of S2 in the following state.

  S1 (L1): [C, D]
  S2 (F1): [C]

S1 restarts. Now since S2 considers itself a nonvoting member acccording
to its local membership C, it won't campaign or grant any vote, whereas
S1 needs a vote from S2 to become a leader. The fix is for a node to
decide its vote despite its membership status.

Signed-off-by: Li Wei <[email protected]>
The node parameter of notify_membership_events shall never be NULL. The
assertion for this should be placed before the notify_membership_events
call. The API document should also be fixed.

Signed-off-by: Li Wei <[email protected]>
An index into the node array shouldn't be raft_index_t. Also, this patch
adds a boundary check for the index parameter of raft_get_node_from_idx.

Signed-off-by: Li Wei <[email protected]>
The node_id field of raft_server_private_t is initialized to 0 by
raft_new. Since 0 may be a valid node ID, raft_new should initialize
node_id to -1 instead, like leader_id and what raft_clear does to
node_id. This patch fixes the node_id initialization and adds a new API
function, raft_set_nodeid, for users to set the node ID. Users shall
call raft_set_nodeid right after raft_new/raft_clear. (Adding a node ID
parameter to raft_new/raft_clear would be a better solution, though it
would require a lot of test changes.)

Signed-off-by: Li Wei <[email protected]>
Signed-off-by: Brian J. Murrell <[email protected]>
A failure of the following assertion in raft_recv_requestvote was
observed:

  assert(!raft_is_leader(me_) &&
         (vr->prevote || !raft_is_candidate(me_)));

At the time, this server was very unlikely a leader. So it must be a
candidate granting a non-prevote RV request. Since a candidate votes for
itself only when becoming a _prevoted_ candidate, the server must be a
_prevoting_ candidate. This patch fixes the assertion to allow such a
valid case.

Related, raft_become_prevoted_candidate clears me->prevote too early. If
the raft_set_current_term call or the raft_vote_for_nodeid call returns
an error, the server is left as a prevoted candidate, but hasn't voted
for itself. If a non-prevote RV request comes in, the server may grant
the vote and fail the assertion above. This patch changes
raft_become_prevoted_candidate to exit the prevoting phase after the
self vote succeeds.

Signed-off-by: Li Wei <[email protected]>
Add BR: make, gcc.

Update packaging.

Clean up whitespace inconsistencies.





#Pragmas from previous commit message:
Skip-checkpatch: true
Skip-build: true
Quick-build: true
Skip-unit-tests: true
Skip-test: true
Signed-off-by: Brian J. Murrell <[email protected]>
@brianjmurrell
Copy link
Author

Wrong repo. Apologies for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.