Leader: One node in raft acts as the leader, and all logs flow from the leader to other servers
Leader Election: Raft uses randomized timers to elect leaders when the system boots up or a leader is unreachable
Membership Changes: When a server is either added or removed from a raft cluster
Replicated State Machine: The foundation of distributed systems/consensul algorithms. A set of distributed servers operate on the same state and can continue to operate even if some of the servers are down. All servers will agree on the same value.
Log: The replicated state machine log contains state machine commands from clients that will operate on the server
The state machine will execute these commands in order – the consensus algorithm keeps the log consistent and in the same order
Safety: If any server has applied a particular log entry to its state machine, no other server can apply a different log entry at the same index
Log Entry: A entry in the log. This is composed of an index, command, and term number the entry was received by the leader
To begin an election a follower increments its term, votes for itself, and sends out RequestVote RPCs
Each server can only vote for one candidate in a given term on a first-come first-serve basis
If a candidate receives a heartbeat from another leader while it is in the candidate phase, it reverts to follower if the term is at least as high as the term the candidate is proposing
Election timeouts are randomized between 150-300ms conventionally, but should be an order of magnitude more than the time to respond to RPCs
Depending on the node’s storage medium this time can be anywhere from ~.5ms - 20ms
When a leader receives a client request, it appends the entry to its log then sends AppendEntries RPCs in parallel to the other servers to replicate an entry
After safely replicating the log, the leader applies the entry to its state machine and notifies the client
Applying the log entry is when the log entry is deemed committed
The leader commits the log and replies to the client after confirmation that a majority of followers have replicated the log entry
If followers crash or are slow, the leader repeatedly retries AppendEntries RPCs until all followers have replicated the log
Any log inconsistencies (there are a lot of edge cases) are handled by the leader forcing follower logs to match its own
How does this ensure safety? What if a leader commits a log entry, crashes, and a follower who has not replicated the log entry becomes the new leader?
Raft adds a new candidacy restriction for this, specifying that only followers who have committed all the log entries in the previous term can become leader
This is done in the voting process
A candidate must contact the majority of the cluster to be elected, which means that every committed entry must be present on one of those servers
In the RequestVote RPC there is information about the candidates log; if a receiver sees the candidates log is out of date it will not vote for the candidate
Because, to commit a log entry, a majority of servers must have the entry, a candidate cannot ever become leader if all log entries aren’t committed
When servers are either added or removed from the cluster – this is built directly into Raft (and is one of the parts often excluded from non-production implementations)
Configuration changes require a two-phase approach – cluster switchover is not atomic and any in between time is unsafe without 2-phase
Configuration changes are committed by the leader in the log
This is essentially two separate Raft clusters working in tandem
Joint consensus
Log entries are replicated to all servers in both configurations
Agreement requires separate majorities from both old and new configurations
Because of joint consensus there is no time at any point where the new configuration and old configuration are both allowed to make decisions unilaterally
The leader sees a request for a new configuration change and the leader replicates it to the servers as usual, committing the new configuration when a majority of servers have the replicated configuration
The first configuration committed is the joint new and old configuration entry
Each follower always uses the most recent configuration in its log
Once the leader commits joint consensus configuration, it requires consensus from both configurations before commiting
Once the new configuration is committed the old configuration is irrelevant and old servers can be shut down
Because new servers don’t have all the log entries they may be offline for a time before they can accept logs as all the log entries are replicated
Because of this, new servers may join the cluster in a non-voting phase where logs are first replicated to them
If the current leader is not part of the cluster in the new configuration, it steps down after it commits the new configuration
When a client starts up it connects to a random server and is forwarded to the leader
Current leader information is returned to the client and all future requests go to that leader
Raft wants to implement serializable semantics
If a leader commits an entry before responding to the client and the client retries, a request can be executed twice
The solution to this is for clients to use idempotency keys for requests
But how does Raft protect against stale reads?
Raft leaders check that they aren’t deposed by another leader by sending heartbeat requests to a majority of the cluster before responding to read requests
If not, the leader will have the most up-to-date committed information