Raft Configuration Change with Single Log Entry
本文链接: https://blog.openacid.com/algo/single-log-joint/
Preface
TL;DR
Standard Raft configuration changes use two log entries with multi-phase commits and careful state management. Can we complete a configuration change with just one log entry? We’ll introduce effective-config, prove its correctness, then discover why the simple approach isn’t so simple after all. The standard Joint Consensus method wins for good reasons.
What We’ll Cover
- How Raft’s Joint Consensus works (the two-phase approach)
- The single-log-entry idea and its mechanics
- Why it’s theoretically correct
- Why it’s practically problematic (and the patches we’d need)
- Why we should stick with Joint Consensus
Introduction to Raft Joint Consensus: 2 Config Log Entries
Changing cluster membership in Raft is tricky. Switching from the old configuration {a,b,c}
to a new one {x,y,z}
in one step is dangerous.
Nodes can’t all switch configurations at the exact same moment. During the transition, some nodes (say a,b
) might still be using C_old
while others (x,y,z
) have moved to C_new
. If these two groups don’t overlap—meaning a quorum from C_old
(like {a,b}
) and a quorum from C_new
(like {x,y}
) share no common nodes—we could elect two leaders in the same term, violating Raft’s fundamental safety guarantee.
The Raft paper solves this with a two-phase protocol called Joint Consensus:
-
Phase 1: Enter the Joint phase (
C_old_new
) When the leader receives a configuration change request, it writes a log entry containingC_old_new
—a joint configuration that includes both old and new members. In this state, any decision (like committing a log entry) needs approval from a quorum ofC_old
and a quorum ofC_new
. The leader starts usingC_old_new
as soon as it writes this entry to its own log. -
Phase 2: Move to the new configuration (
C_new
) OnceC_old_new
commits, the leader writes a second log entry containing justC_new
. From this point forward, the leader uses onlyC_new
, and all subsequent log entries need only commit on aC_new
quorum. When this second entry commits, the configuration change is complete.
The intermediate joint phase ensures that any two quorums—whether based on C_old
, C_new
, or C_old_new
—must overlap, preventing split brain. This requires two log entries for each configuration change.
Can We Do It With Just One Log Entry?
Can we do this safely with just one log entry?
We need a new concept: effective-config. This is the configuration the leader actually uses to determine if log entries are committed. It might not match any specific configuration stored in a log entry—it’s a runtime state that changes as the configuration change progresses.
Terminology
- effective-config: The runtime configuration the leader uses to determine if entries are committed
- Joint config: A configuration containing both old and new members, like
C_old_new = [{a,b,c}, {x,y,z}]
- Uniform config: A configuration with just one set of members, like
C_new = {x,y,z}
- Barrier entry: A marker log entry that signals the joint phase has safely ended
How It Works
-
Starting point: The cluster is running with
C_old = {a,b,c}
, and that configuration has been committed. The effective-config isC_old
. -
Propose the change: To change to
C_new = {x,y,z}
, the leader writes a single log entryentry-i
containing justC_new
. -
Enter joint mode immediately: The moment the leader appends
entry-i
to its own log—before it commits, before it replicates—the leader switches its effective-config to the joint configurationC_old_new = [{a,b,c}, {x,y,z}]
. Nowentry-i
and all subsequent entries must commit on a quorum from both{a,b,c}
and{x,y,z}
. -
Normal operation continues: The cluster keeps processing requests. Every entry commits using the joint quorum rules.
-
Exit joint mode: Once
entry-i
commits underC_old_new
, the leader switches effective-config toC_new = {x,y,z}
. All subsequent entries need only aC_new
quorum.
With one log entry, the system transitions through three states: C_old → C_old_new → C_new
.
Correctness Proof
We need to show that we can’t elect two leaders—neither during the configuration change nor afterward.
Assume leader t
is doing the configuration change (writing entry-i
). Later, some candidate u
tries to get elected in term u > t
. We prove t
and u
can’t both be leaders.
Analyzing candidate u
’s election
Candidate u
either has entry-i
in its log or it doesn’t.
-
Case 1:
u
hasentry-i
Then
u
’s effective-config includes{x,y,z}
. Leadert
’s effective-config is eitherC_old_new = [{a,b,c}, {x,y,z}]
(still in joint mode) orC_new = {x,y,z}
(finished). Either way, it includes{x,y,z}
.Since
u
needs a quorum from{x,y,z}
to get elected, andt
needs a quorum from{x,y,z}
to stay leader, these quorums must overlap. No split brain. -
Case 2:
u
doesn’t haveentry-i
Then
u
’s effective-config isC_old = {a,b,c}
. Now we consider where leadert
is:-
If
t
’s effective-config isC_old_new
, thent
needs a quorum from{a,b,c}
andu
needs a quorum from{a,b,c}
. These must overlap. No split brain. -
If
t
’s effective-config isC_new = {x,y,z}
, that meansentry-i
committed underC_old_new
. Soentry-i
must exist on a quorum of{a,b,c}
. Those nodes have logs at least as long as indexi
.But
u
doesn’t haveentry-i
, so its log is shorter thani
. Whenu
requests votes from nodes in{a,b,c}
, they’ll reject it because their logs are more up-to-date. The election fails.
-
In every case, we can’t have both t
and u
as leaders. The algorithm is safe.
However, although theoretically correct, it introduces problems in actual implementation:
Problem 1: The Memory-Only Transition
When we move from C_old_new
to C_new
, we only change the in-memory effective-config. Nothing hits disk. This creates trouble.
Nodes from C_old
can still initiate elections and compete with C_new
nodes, because C_old
logs are as long as C_new
logs. Even after the configuration change completes, C_old
nodes can steal leadership from C_new
nodes. The root cause is that the state change is not recorded on the persistent layer. This is problematic because nodes intended for removal can still become leaders.
Compare this to standard Joint Consensus: it writes a second log entry containing C_new
. That entry acts as a barrier. Nodes from C_old
have shorter logs and lose elections. The single-entry approach has no such barrier—the transition from C_old_new
to C_new
is invisible on disk.
Look at the diagram below. The cluster transitions from C_old_new
to C_new
, but no logs change. Leadership moves to node x
in {x,y,z}
. But nodes from C_old
can still start elections and steal leadership from x
.
Patch-1: After entering C_new
, immediately append a no-op entry. This lengthens the logs of C_new
nodes, blocking elections from C_old
nodes.
Problem 2: The Restart Ambiguity
When a node restarts, it can’t tell if the cluster is in joint mode or has finished the change.
-
The restarting node reads its log. It sees
entry-i
containingC_old
andentry-j
containingC_new
. -
We know
entry-i
is committed (Raft requires it before starting a new change). -
But what about
entry-j
? The node can’t tell just from its local log:- If
entry-j
isn’t committed yet, the cluster is in joint mode with effective-configC_old_new
- If
entry-j
is committed, the cluster is usingC_new
- If
Without talking to other nodes, there’s no way to know.
In the diagram above, even if entry-3
has committed, the restarting nodes b
, c
, x
, y
can’t tell whether the cluster is in joint mode or using the new configuration. (Nodes a
and z
never received entry-3
and are still using {a,b,c}
.)
Patch-2: Always start in joint mode after a restart.
- When a node starts up, it sets effective-config to the joint configuration formed from the last two config entries in its log
- It uses this joint config for elections and normal operation
- Only after confirming that the latest config entry has committed under the joint configuration can it switch to the new configuration
Example: A node sees configs {a,b,c}
and {u,v,w}
in its log. It starts with effective-config [{a,b,c}, {u,v,w}]
. To become leader, it needs quorums from both groups. Only after it confirms the new config committed under the joint rules can it switch to just {u,v,w}
.
Problem 3: Calling Home to Dead Nodes
Patch-2 solves the ambiguity problem but creates a worse one: nodes might try to contact old cluster members that no longer exist, making elections impossible.
Example:
-
The cluster changes from
{a,b,c}
to{x,y,z}
-
The config entry commits under
C_old_new
-
The cluster transitions to
C_new = {x,y,z}
-
Nodes
a
,b
,c
are no longer members. They get shut down, their data gets wiped, and they’re gone -
Then something happens and all remaining nodes restart
-
Node
x
restarts and follows Patch-2: it sees configs{a,b,c}
and{x,y,z}
in its log, so it sets effective-config to[{a,b,c}, {x,y,z}]
-
Node
x
tries to run an election, butb
andc
don’t exist anymore! It can’t get a quorum from both groups. The election fails. The cluster is stuck.
This is state regression. The transition from C_old_new
to C_new
wasn’t persisted, so after a restart, the system rolls back to needing C_old
.
Adding a Barrier to Prevent Regression
Restarting nodes need to know for certain that the joint phase has ended—proof that it’s safe to use C_new
without calling back to C_old
.
Patch-3: Add a barrier entry
After entry-j
(containing C_new
) commits under C_old_new
, append a special barrier entry to mark that entry-j
has committed.
Important: The barrier must come after
entry-j
commits. Otherwise it can’t serve as proof of the commit.
When a restarting node sees this barrier, it knows the joint phase ended successfully. It can safely use C_new
for elections without trying to contact old nodes that might not exist anymore.
In the diagram below, when entry-3
commits under C_old_new
, we add barrier entry-4
:
Now when all nodes restart, there’s no regression. Nodes x
and y
see the barrier, so they use C_new = {x,y,z}
directly. Even though b
and c
are gone, x
or y
can still get elected:
Alternative: Persisting commit-index
Instead of a barrier entry, we could persist the commit-index—an idea from Ma Jianjiang.
The rule: joint consensus ends when commit-index reaches a quorum of
C_new
. To make this work, we’d need to persist commit-index (standard Raft doesn’t require this).When a node restarts, it checks: if the persisted commit-index covers the config change entry, it knows
C_old_new
finished and can safely useC_new
. No need to contact old nodes.But this still has Problem 1—
C_old
andC_new
nodes competing for leadership. Here’s why:C_new
nodes don’t have extra log entries, and committing commit-index to justC_new
doesn’t guaranteeC_old
nodes see it. This is the classic distributed systems dilemma of at-least-once vs at-most-once delivery:
- At-least-once (commit on
C_old_new
): commit-index might succeed, thenC_old
nodes get decommissioned, then we can’t commit it again to reach them. We’re stuck.- At-most-once (commit on
C_new
only): commit-index reachesC_new
but might not reachC_old
. Those nodes don’t know the cluster moved on, so they keep trying to run elections.Either way, we can still end up with
C_old
andC_new
nodes competing for leadership.
So here’s what the patched single-log approach looks like:
-
Start with
effective-config = C_old = {a,b,c}
-
Leader writes
entry-j
containingC_new = {x,y,z}
and immediately switcheseffective-config
toC_old_new = [{a,b,c}, {x,y,z}]
-
All entries from index
j
onward replicate and commit underC_old_new
-
Critical step: Once
entry-j
commits underC_old_new
, the leader writes a special barrier entry. This entry has no configuration data—it just marks “the joint phase is done.” The leader can switch toeffective-config = C_new
and useC_new
to replicate the barrier. -
When the barrier entry commits, the configuration change is complete
Restart behavior:
When a node restarts, it reads its log. It sees entry-i
(C_old
) and entry-j
(C_new
). It checks: is there a barrier after entry-j
?
-
Barrier present: Joint phase ended. Set
effective-config = C_new
. No need to contact old nodes. -
No barrier: Joint phase might still be active. Set
effective-config = C_old_new
.
Patch-3 adds a second log entry. We’re no longer doing “one log entry” configuration changes. We need “one config entry + one barrier entry.”
Conclusion
Configuration changes must pass through three states—C_old → C_old_new → C_new
. One log entry gives us one bit of persistent information: C_old
or C_new
. That’s only two states. We can’t represent three states with two values.
To safely handle all three states, we need at least two log entries. That gives us two bits of information and up to four possible states, which is enough to encode the three states we actually need.
The “single-log-entry” approach, after all the patches, ends up needing two entries anyway—one for the configuration and one for the barrier. And it’s more complex than standard Joint Consensus, with trickier edge cases around restarts and state transitions.
Stick with Joint Consensus. It’s cleaner, simpler, and solves the problem directly without patches.
References
- Diego Ongaro & John Ousterhout. In Search of an Understandable Consensus Algorithm (Raft paper): https://raft.github.io/raft.pdf
- OpenRaft(rust): https://github.com/databendlabs/openraft
- etcd/raft source code: https://github.com/etcd-io/raft
- Hashicorp Raft implementation: https://github.com/hashicorp/raft
Reference:
本文链接: https://blog.openacid.com/algo/single-log-joint/
留下评论