<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://blog.openacid.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blog.openacid.com/" rel="alternate" type="text/html" /><updated>2026-05-13T02:26:54+00:00</updated><id>https://blog.openacid.com/feed.xml</id><title type="html">OpenACID Blog</title><subtitle>分布式研究小院.</subtitle><author><name>OpenACID</name></author><entry><title type="html">Do Not Read: Raf Is a Useless and Valueless Failed Experiment</title><link href="https://blog.openacid.com/algo/raf-without-term/" rel="alternate" type="text/html" title="Do Not Read: Raf Is a Useless and Valueless Failed Experiment" /><published>2026-05-11T00:00:00+00:00</published><updated>2026-05-11T00:00:00+00:00</updated><id>https://blog.openacid.com/algo/raf-without-term</id><content type="html" xml:base="https://blog.openacid.com/algo/raf-without-term/"><![CDATA[<p><img src="/post-res/raf-without-term/a9a9ca7fa5a1b329-raf-banner-small.png" alt="" /></p>

<blockquote>
  <p>Summary: <code class="language-plaintext highlighter-rouge">raf: Raft without [T]erm</code> is an experimental Raft variant. It does not persist <code class="language-plaintext highlighter-rouge">currentTerm</code> as a separate piece of state. Instead, a candidate reserves a log index when it starts an election, and that index becomes the leader term. This does not remove Raft’s logical time model. It only changes where the term is derived from in storage.</p>
</blockquote>

<blockquote>
  <p>Declaration: This approach is just the same as saving terms in the separate first extra slot in the terms array. So it actually still stores the term and has no value at all. Please do not read this as a useful design; it is only a failed personal experiment.</p>
</blockquote>

<blockquote>
  <p>Note: The idea in this article came from Zhang Yanpo. The code was implemented by Zhang Yanpo by hand. This article was drafted and refined with Codex.</p>
</blockquote>

<p>Repository: <a href="https://github.com/drmingdrmer/raf/tree/v0.1.1">raf</a> (<code class="language-plaintext highlighter-rouge">v0.1.1</code>).</p>

<h2 id="introduction">Introduction</h2>

<p>I have seen a few interesting proposals that try to remove the term from Raft. The idea is appealing: if a consensus protocol can maintain one less piece of persistent state, perhaps both the model and the implementation become simpler. But Raft’s term is not just a counter. It represents logical time, lets nodes distinguish old leaders from new ones, and participates in Raft’s commit safety rule. So the real question is not whether we can simply delete the term. The useful question is whether we can express it in a different way.</p>

<p>The name <code class="language-plaintext highlighter-rouge">raf</code> comes from <code class="language-plaintext highlighter-rouge">Raft without [T]erm</code>. Here, “without term” does not mean the protocol has no term at all. It means <code class="language-plaintext highlighter-rouge">currentTerm</code> is no longer persisted as an independent field. The project turns this idea into a small but serious implementation: avoid storing <code class="language-plaintext highlighter-rouge">currentTerm</code> separately, while still giving every part of Raft that needs a term a reliable source of logical time.</p>

<p>This article explains that core idea. The term still exists as a concept. Logs are still compared by <code class="language-plaintext highlighter-rouge">(term, index)</code>. What changes is the source of the term: it no longer comes from a separately incremented persistent counter; it comes from a log index reserved by an election. The goal of this implementation is not to prove that it is a drop-in replacement for standard Raft in every engineering detail. The goal is to see whether this storage representation preserves the most important safety intuition behind Raft.</p>

<p>We will walk through the storage model, election, replication, commit, and the three-node example in the repository. I assume the reader is already familiar with the basic Raft flow: leader election, AppendEntries, quorum commit, and log ids in the form <code class="language-plaintext highlighter-rouge">(term, index)</code>.</p>

<h2 id="why-term-cannot-disappear">Why Term Cannot Disappear</h2>

<p>In a consensus algorithm, the log records events that have been chosen or are being proposed. The term tells us which logical time those events belong to.</p>

<p>Standard Raft uses the term for several jobs:</p>

<ul>
  <li>Leader election advances the term before choosing a leader. The term gives leaders an ordering.</li>
  <li>Because of that, log freshness is compared by term first, then by index.</li>
  <li>A leader may directly commit only entries from its own term.</li>
</ul>

<p>This role is similar to the ballot number in Paxos. It lets the system decide which history is newer and which candidate is eligible to become leader, even when a node does not know every other node’s complete log.</p>

<p>In short: <strong>a log index is a local event position; a term is the logical time used to compare histories across nodes.</strong></p>

<p><code class="language-plaintext highlighter-rouge">raf</code> keeps the concept of term, but removes its separate storage.</p>

<h2 id="core-idea">Core Idea</h2>

<p>Standard Raft usually persists state shaped roughly like this:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">StandardRaftStorage</span> <span class="p">{</span>
    <span class="n">current_term</span><span class="p">:</span> <span class="n">Term</span><span class="p">,</span>
    <span class="n">voted_for</span><span class="p">:</span> <span class="nb">Option</span><span class="o">&lt;</span><span class="n">NodeId</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="k">log</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">LogEntry</span><span class="o">&gt;</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">struct</span> <span class="n">LogEntry</span> <span class="p">{</span>
    <span class="n">term</span><span class="p">:</span> <span class="n">Term</span><span class="p">,</span>
    <span class="n">cmd</span><span class="p">:</span> <span class="n">Cmd</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">raf</code> represents the persistent state as two <code class="language-plaintext highlighter-rouge">Vec</code>s aligned by log index:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">RafStorage</span> <span class="p">{</span>
    <span class="n">terms</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Term</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="n">cmds</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Cmd</span><span class="o">&gt;</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">terms[i]</code> is the leader term of the log entry at index <code class="language-plaintext highlighter-rouge">i</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">cmds[i]</code> is the application command of the log entry at index <code class="language-plaintext highlighter-rouge">i</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">log_id(i)</code> is still <code class="language-plaintext highlighter-rouge">(terms[i], i)</code>.</li>
</ul>

<p>The following is one possible storage state. <code class="language-plaintext highlighter-rouge">ø</code> means empty command. <code class="language-plaintext highlighter-rouge">cmds</code> only reaches index <code class="language-plaintext highlighter-rouge">7</code>, so indexes <code class="language-plaintext highlighter-rouge">8</code> and <code class="language-plaintext highlighter-rouge">9</code> are not complete log entries yet.</p>

<!-- [ASCII source](assets/storage-layout.txt) -->

<p><img src="/post-res/raf-without-term/10e4df1ad2741f72-storage-layout.png" alt="Storage layout" /></p>

<p>Index by index, this state means:</p>

<ul>
  <li>Index <code class="language-plaintext highlighter-rouge">0</code>: the fixed default entry. <code class="language-plaintext highlighter-rouge">terms[0] = 0</code>, and <code class="language-plaintext highlighter-rouge">cmds[0]</code> is an empty command.</li>
  <li>Index <code class="language-plaintext highlighter-rouge">1</code>: a successful election for term <code class="language-plaintext highlighter-rouge">1</code>. The leader reserved index <code class="language-plaintext highlighter-rouge">1</code> when it was elected and wrote its first empty command there.</li>
  <li>Index <code class="language-plaintext highlighter-rouge">2</code>: a successful election for term <code class="language-plaintext highlighter-rouge">2</code>. The new leader reserved index <code class="language-plaintext highlighter-rouge">2</code>; its first log entry is also an empty command.</li>
  <li>Index <code class="language-plaintext highlighter-rouge">3</code>: a user log entry <code class="language-plaintext highlighter-rouge">C3</code> written by the leader of term <code class="language-plaintext highlighter-rouge">2</code>, so <code class="language-plaintext highlighter-rouge">terms[3] = 2</code> and <code class="language-plaintext highlighter-rouge">cmds[3] = C3</code>.</li>
  <li>Index <code class="language-plaintext highlighter-rouge">4</code>: this position used to be an election attempt for term <code class="language-plaintext highlighter-rouge">4</code>, but it did not produce an established leader. Later, when the leader of term <code class="language-plaintext highlighter-rouge">6</code> was established, this position was filled as an empty command owned by term <code class="language-plaintext highlighter-rouge">6</code>, so now <code class="language-plaintext highlighter-rouge">terms[4] = 6</code>.</li>
  <li>Index <code class="language-plaintext highlighter-rouge">5</code>: this position used to be another failed election attempt for term <code class="language-plaintext highlighter-rouge">5</code>. It was also later filled as an empty command owned by term <code class="language-plaintext highlighter-rouge">6</code>, so now <code class="language-plaintext highlighter-rouge">terms[5] = 6</code>.</li>
  <li>Index <code class="language-plaintext highlighter-rouge">6</code>: a successful election for term <code class="language-plaintext highlighter-rouge">6</code>. The leader reserved index <code class="language-plaintext highlighter-rouge">6</code> when it was elected, and wrote its first empty command there.</li>
  <li>Index <code class="language-plaintext highlighter-rouge">7</code>: a user log entry <code class="language-plaintext highlighter-rouge">C7</code> written by the leader of term <code class="language-plaintext highlighter-rouge">6</code>, so <code class="language-plaintext highlighter-rouge">terms[7] = 6</code> and <code class="language-plaintext highlighter-rouge">cmds[7] = C7</code>.</li>
  <li>Index <code class="language-plaintext highlighter-rouge">8</code>: a new election attempt for term <code class="language-plaintext highlighter-rouge">8</code>. So far we have only seen <code class="language-plaintext highlighter-rouge">terms[8] = 8</code>; there is no command for this index yet, so it is not a complete log entry.</li>
  <li>Index <code class="language-plaintext highlighter-rouge">9</code>: another election attempt for term <code class="language-plaintext highlighter-rouge">9</code>. Like index <code class="language-plaintext highlighter-rouge">8</code>, it currently has only a term record and no command. From this state alone, we cannot tell whether it will eventually become a leader.</li>
</ul>

<p><em>Standard Raft persists a separate <code class="language-plaintext highlighter-rouge">current_term</code> and an array of <code class="language-plaintext highlighter-rouge">(term, command)</code> log entries. <code class="language-plaintext highlighter-rouge">raf</code> is similar, but splits the term and command at each index into two aligned <code class="language-plaintext highlighter-rouge">Vec</code>s.</em></p>

<h2 id="storage-model">Storage Model</h2>

<p>Index <code class="language-plaintext highlighter-rouge">0</code> is a fixed default entry. This keeps the types simple and avoids using <code class="language-plaintext highlighter-rouge">Option</code> for the initial position:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>terms[0] = 0
cmds[0]  = empty
</code></pre></div></div>

<p>When both <code class="language-plaintext highlighter-rouge">Vec</code>s have a value at the same index, that index is a complete log entry. Otherwise, the index has a term but no command. That state represents an election in progress: the term has been observed, but no log entry has been written at that position yet.</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>log[index] = (terms[index], cmds[index])
log_id     = (terms[index], index)
</code></pre></div></div>

<p><em><code class="language-plaintext highlighter-rouge">terms</code> and <code class="language-plaintext highlighter-rouge">cmds</code> share the same log index. During election, <code class="language-plaintext highlighter-rouge">terms</code> may be ahead of <code class="language-plaintext highlighter-rouge">cmds</code>.</em></p>

<p>During an election, <code class="language-plaintext highlighter-rouge">terms</code> can temporarily be longer than <code class="language-plaintext highlighter-rouge">cmds</code>. A candidate first reserves an index as its term. Multiple failed elections may reserve multiple indexes. Only after some candidate becomes an established leader are the positions with missing commands filled with empty commands.</p>

<p>So this implementation maintains a few basic facts:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">cmds</code> should never be longer than <code class="language-plaintext highlighter-rouge">terms</code>.</li>
  <li>When an election first reserves a term slot, <code class="language-plaintext highlighter-rouge">terms[term] = term</code>.</li>
  <li>When an established leader fills positions that still miss commands, those positions are rewritten to the current <code class="language-plaintext highlighter-rouge">leader.term</code>.</li>
</ul>

<p>Therefore <code class="language-plaintext highlighter-rouge">terms[i] &lt;= i</code> is not a protocol invariant. A backfilled empty entry may have <code class="language-plaintext highlighter-rouge">terms[i] &gt; i</code>. That is not a problem by itself; standard Raft terms can also be greater than log indexes. The important property is different: every complete log entry must carry the term of an established leader, except for the fixed default entry at index <code class="language-plaintext highlighter-rouge">0</code>.</p>

<h2 id="why-split">Why Split</h2>
<p><code class="language-plaintext highlighter-rouge">terms</code>
 and 
<code class="language-plaintext highlighter-rouge">cmds</code></p>

<p>This implementation separates leader terms and application commands into two <code class="language-plaintext highlighter-rouge">Vec</code>s. Raft’s protocol semantics do not require this layout. It is mainly a storage design choice that creates room for cleaner optimization.</p>

<p>For example, a leader may write many consecutive log entries during one term, and all of those entries have the same term. A storage engine could compress long runs of identical terms into a compact representation, while storing commands according to application needs. Once the two streams are separated, term compression, command persistence, and payload encoding can evolve independently.</p>

<p>This is the experimental value of the project. It expresses the relationship between “logical time” and “log position” in Raft more directly, then asks whether that expression can simplify persistent state.</p>

<h2 id="starting-an-election">Starting an Election</h2>

<p>When a candidate starts an election, it uses the next index of the <code class="language-plaintext highlighter-rouge">terms</code> array as the new term:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">term</span> <span class="o">=</span> <span class="n">terms</span><span class="nf">.len</span><span class="p">();</span>
<span class="n">terms</span><span class="nf">.push</span><span class="p">(</span><span class="n">term</span><span class="p">);</span>
</code></pre></div></div>

<p>In this code:</p>

<ul>
  <li>The candidate declares that it wants to use <code class="language-plaintext highlighter-rouge">term</code> as its leader term.</li>
  <li>The local persistent state records that this index has been reserved by an election.</li>
</ul>

<p>The candidate then sends <code class="language-plaintext highlighter-rouge">RequestVote</code>. This part is the same as standard Raft. The request carries two important pieces of information:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">term</code>: the leader term the candidate wants to use.</li>
  <li><code class="language-plaintext highlighter-rouge">last_log_id</code>: the <code class="language-plaintext highlighter-rouge">(term, index)</code> of the candidate’s last complete log entry.</li>
</ul>

<p><code class="language-plaintext highlighter-rouge">last_log_id</code> is computed from the last index in <code class="language-plaintext highlighter-rouge">cmds</code>:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">last_log_index</span> <span class="o">=</span> <span class="n">cmds</span><span class="nf">.len</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">let</span> <span class="n">last_log_id</span> <span class="o">=</span> <span class="p">(</span><span class="n">terms</span><span class="p">[</span><span class="n">last_log_index</span><span class="p">],</span> <span class="n">last_log_index</span><span class="p">);</span>
</code></pre></div></div>

<p>The meaning of <code class="language-plaintext highlighter-rouge">last_log_id</code> is the same as in standard Raft: the voter uses it to check whether the candidate’s log is at least as up to date as its own. Notice the separation here. The new election term comes from <code class="language-plaintext highlighter-rouge">terms.len()</code>, while <code class="language-plaintext highlighter-rouge">last_log_id</code> comes from <code class="language-plaintext highlighter-rouge">cmds.len() - 1</code>. These can point to different indexes.</p>

<p>In the following example, the complete log currently reaches only index <code class="language-plaintext highlighter-rouge">3</code>, so <code class="language-plaintext highlighter-rouge">last_log_id = (2, 3)</code>. When the candidate starts a new election, it uses <code class="language-plaintext highlighter-rouge">terms.len() = 4</code> as the new term and first writes index <code class="language-plaintext highlighter-rouge">4</code> into <code class="language-plaintext highlighter-rouge">terms</code>. At this point, <code class="language-plaintext highlighter-rouge">cmds</code> still reaches only index <code class="language-plaintext highlighter-rouge">3</code>, because this candidate has not become an established leader yet.</p>

<!-- [ASCII source](assets/leader-election-term4.txt) -->

<p><img src="/post-res/raf-without-term/289a45a69ad1dd8d-leader-election-term4.png" alt="Leader election term=4" /></p>

<p>If the election for term <code class="language-plaintext highlighter-rouge">4</code> does not reach a quorum, it leaves only an observed term index in <code class="language-plaintext highlighter-rouge">terms</code>; it does not create a new command. The next election will use <code class="language-plaintext highlighter-rouge">terms.len()</code> again, which is now term <code class="language-plaintext highlighter-rouge">5</code>.</p>

<!-- [ASCII source](assets/leader-election-term5.txt) -->

<p><img src="/post-res/raf-without-term/3863b4a533d8cdf5-leader-election-term5.png" alt="Leader election retry term=5" /></p>

<p>At this point <code class="language-plaintext highlighter-rouge">last_log_id</code> is still <code class="language-plaintext highlighter-rouge">(2, 3)</code>, because <code class="language-plaintext highlighter-rouge">cmds</code> has not moved beyond index <code class="language-plaintext highlighter-rouge">3</code>. What changed is the candidate term: it advanced from <code class="language-plaintext highlighter-rouge">4</code> to <code class="language-plaintext highlighter-rouge">5</code>. Only after an election succeeds and a leader is established does the system rewrite the positions with missing commands to this leader’s term and fill <code class="language-plaintext highlighter-rouge">cmds</code> with empty commands.</p>

<h2 id="how-a-voter-handles-requestvote">How a Voter Handles RequestVote</h2>

<p>When a voter receives <code class="language-plaintext highlighter-rouge">RequestVote</code>, it checks three things:</p>

<ol>
  <li>Whether the requested term is greater than the last term observed locally.</li>
  <li>Whether the requested term slot does not already exist locally.</li>
  <li>Whether the candidate’s log, represented by <code class="language-plaintext highlighter-rouge">last_log_id</code>, is fresh enough. It must not be older than the voter’s own log.</li>
</ol>

<p>Aside from the term check, the rest of the logic is standard Raft. The subtle difference is that the term is no longer stored independently, so term freshness is expressed through <code class="language-plaintext highlighter-rouge">terms</code>.</p>

<p>The core condition looks like this:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">local_last_log_index</span> <span class="o">=</span> <span class="n">cmds</span><span class="nf">.len</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">let</span> <span class="n">local_last_log_id</span> <span class="o">=</span> <span class="p">(</span><span class="n">terms</span><span class="p">[</span><span class="n">local_last_log_index</span><span class="p">],</span> <span class="n">local_last_log_index</span><span class="p">);</span>
<span class="k">let</span> <span class="n">local_last_term</span> <span class="o">=</span> <span class="n">terms</span><span class="p">[</span><span class="n">terms</span><span class="nf">.len</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>

<span class="k">let</span> <span class="n">can_vote</span> <span class="o">=</span>
    <span class="n">req</span><span class="py">.term</span> <span class="o">&gt;</span> <span class="n">local_last_term</span>
        <span class="o">&amp;&amp;</span> <span class="n">req</span><span class="py">.term</span> <span class="o">&gt;=</span> <span class="n">terms</span><span class="nf">.len</span><span class="p">()</span>
        <span class="o">&amp;&amp;</span> <span class="n">req</span><span class="py">.last_log_id</span> <span class="o">&gt;=</span> <span class="n">local_last_log_id</span><span class="p">;</span>
</code></pre></div></div>

<p>The next diagram sends the same <code class="language-plaintext highlighter-rouge">RequestVote { term: 5, last_log_id: (2, 3) }</code> to voters with three different local states. The candidate’s own state is at the top. The three branches below it show how each voter makes its decision.</p>

<!-- [ASCII source](assets/request-vote.txt) -->

<p><img src="/post-res/raf-without-term/d310c7c93bee8cbb-request-vote.png" alt="RequestVote scenarios" /></p>

<p>The three outcomes are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">granted</code>: the voter’s <code class="language-plaintext highlighter-rouge">terms</code> reaches only index <code class="language-plaintext highlighter-rouge">3</code>, and <code class="language-plaintext highlighter-rouge">cmds</code> also reaches only index <code class="language-plaintext highlighter-rouge">3</code>. Therefore <code class="language-plaintext highlighter-rouge">req.term = 5</code> is greater than the last locally observed term, names a term slot that has not appeared locally, and <code class="language-plaintext highlighter-rouge">req.last_log_id = (2, 3)</code> is not behind the voter. The vote can be granted.</li>
  <li><code class="language-plaintext highlighter-rouge">rejected: term=7</code>: the voter has already observed a later term <code class="language-plaintext highlighter-rouge">7</code>. Since <code class="language-plaintext highlighter-rouge">req.term = 5</code> is not greater than the last locally observed term, the candidate’s requested term is stale from this voter’s perspective, so the vote is rejected.</li>
  <li><code class="language-plaintext highlighter-rouge">rejected: last log id = (4,4)</code>: the voter’s last complete log entry is <code class="language-plaintext highlighter-rouge">(4, 4)</code>, which is newer than the candidate’s <code class="language-plaintext highlighter-rouge">(2, 3)</code>. Even if the requested term could be recorded, the log freshness check still fails, so the vote is rejected.</li>
</ul>

<p>If the request is valid, the voter records the term in local <code class="language-plaintext highlighter-rouge">terms</code>. If local <code class="language-plaintext highlighter-rouge">terms</code> is shorter than <code class="language-plaintext highlighter-rouge">req.term</code>, it fills the missing positions with default indexes until local state contains index <code class="language-plaintext highlighter-rouge">req.term</code>:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">can_vote</span> <span class="p">{</span>
    <span class="k">while</span> <span class="n">terms</span><span class="nf">.len</span><span class="p">()</span> <span class="o">&lt;=</span> <span class="n">req</span><span class="py">.term</span> <span class="p">{</span>
        <span class="k">let</span> <span class="n">index</span> <span class="o">=</span> <span class="n">terms</span><span class="nf">.len</span><span class="p">();</span>
        <span class="n">terms</span><span class="nf">.push</span><span class="p">(</span><span class="n">index</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>These default items have term values equal to their own indexes. They mean the node has observed the corresponding term indexes. They do not mean those indexes are complete log entries, because the matching <code class="language-plaintext highlighter-rouge">cmds</code> may not exist yet. If a later leader fills those positions as complete empty entries, it rewrites their terms to its own <code class="language-plaintext highlighter-rouge">leader.term</code>. On the final iteration, <code class="language-plaintext highlighter-rouge">index == req.term</code>, so the voter has observed and accepted that term. Later, it will not accept an older term or a term index that already exists locally.</p>

<p>This replaces the role of standard Raft’s persisted <code class="language-plaintext highlighter-rouge">currentTerm</code>, but it is not fully equivalent to standard Raft’s <code class="language-plaintext highlighter-rouge">votedFor</code>. The current implementation does not persist “which candidate this term was granted to.” As a result, RequestVote retry and restart behavior are more conservative. We will return to this trade-off in “Current Boundaries.”</p>

<p><em>The candidate chooses <code class="language-plaintext highlighter-rouge">terms.len()</code> as its term. Other nodes record that term in their local <code class="language-plaintext highlighter-rouge">terms</code> when they grant the vote.</em></p>

<p>After the candidate receives granted replies from a quorum, it becomes an established leader. It first rewrites the local <code class="language-plaintext highlighter-rouge">cmds.len()..terms.len()</code> range to its own <code class="language-plaintext highlighter-rouge">leader.term</code>, then appends empty commands so that <code class="language-plaintext highlighter-rouge">cmds.len()</code> catches up with <code class="language-plaintext highlighter-rouge">terms.len()</code>. The index reserved by the leader’s election becomes this leader’s first complete log entry.</p>

<h2 id="establishing-leader-state">Establishing Leader State</h2>

<p>After a candidate becomes an established leader, it keeps the core state of this leadership in memory. You can think of it as the following structure:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">LeaderState</span> <span class="p">{</span>
    <span class="n">term</span><span class="p">:</span> <span class="n">Term</span><span class="p">,</span>
    <span class="n">granted_nodes</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">NodeId</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="n">replications</span><span class="p">:</span> <span class="n">BTreeMap</span><span class="o">&lt;</span><span class="n">NodeId</span><span class="p">,</span> <span class="n">ReplicationState</span><span class="o">&gt;</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">struct</span> <span class="n">ReplicationState</span> <span class="p">{</span>
    <span class="n">matched</span><span class="p">:</span> <span class="n">LogIndex</span><span class="p">,</span>
    <span class="n">end</span><span class="p">:</span> <span class="n">LogIndex</span><span class="p">,</span>
    <span class="n">inflight</span><span class="p">:</span> <span class="nb">bool</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The important fields are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">term</code>: the log index reserved when this leader was elected. All later log entries produced by this leader write this term into the corresponding positions of the <code class="language-plaintext highlighter-rouge">terms</code> array.</li>
  <li><code class="language-plaintext highlighter-rouge">granted_nodes</code>: the node ids that granted this leadership. This proves the leader was chosen by a quorum.</li>
  <li><code class="language-plaintext highlighter-rouge">replications</code>: the replication progress of each node from the leader’s point of view. <code class="language-plaintext highlighter-rouge">matched</code> is the largest index known to match on that node; <code class="language-plaintext highlighter-rouge">end</code> is the upper bound used for further probing or replication; <code class="language-plaintext highlighter-rouge">inflight</code> prevents sending multiple Append requests to the same node at the same time.</li>
</ul>

<blockquote>
  <p>The leader also has its own replication state.
This makes commit calculation uniform:
look at which nodes have <code class="language-plaintext highlighter-rouge">matched</code> covering an index, then check whether those nodes form a quorum.</p>
</blockquote>

<p>After the leader is established, every position that already exists in local <code class="language-plaintext highlighter-rouge">terms</code> but is still missing from <code class="language-plaintext highlighter-rouge">cmds</code> is taken over by the current leader: the term at that position is rewritten to <code class="language-plaintext highlighter-rouge">leader.term</code>, and the command is filled with an empty command. After that, every local index on the leader has a corresponding command, and new application writes can start at the next index.</p>

<!-- [ASCII source](assets/establish-leader.txt) -->

<p><img src="/post-res/raf-without-term/3494a2b8ffc3e936-establish-leader.png" alt="Establish leader" /></p>

<p>In this example, index <code class="language-plaintext highlighter-rouge">4</code> used to be a term slot left behind by a failed election, and term <code class="language-plaintext highlighter-rouge">5</code> is the index reserved by the current leader. When the candidate for term <code class="language-plaintext highlighter-rouge">5</code> becomes an established leader, indexes <code class="language-plaintext highlighter-rouge">4</code> and <code class="language-plaintext highlighter-rouge">5</code> are both rewritten as term <code class="language-plaintext highlighter-rouge">5</code> empty log entries. The <code class="language-plaintext highlighter-rouge">ø</code> at index <code class="language-plaintext highlighter-rouge">5</code> is the entry reserved by this leader’s election; the <code class="language-plaintext highlighter-rouge">ø</code> at index <code class="language-plaintext highlighter-rouge">4</code> is the backfilled entry that keeps the log prefix contiguous.</p>

<h2 id="appending-a-log-entry">Appending a Log Entry</h2>

<p>After a node becomes leader, each new application write appends a log entry. The term does not change. It is still the term chosen when the leader was elected:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">terms</span><span class="nf">.push</span><span class="p">(</span><span class="n">leader</span><span class="py">.term</span><span class="p">);</span>
<span class="n">cmds</span><span class="nf">.push</span><span class="p">(</span><span class="n">user_cmd</span><span class="p">);</span>
</code></pre></div></div>

<p>So within one leader term, all later log entries have the same <code class="language-plaintext highlighter-rouge">terms[i]</code>. This matches standard Raft behavior. Only the source of the term is different.</p>

<h2 id="log-replication">Log Replication</h2>

<p>The leader sends Append requests to the other nodes. Conceptually, each request first names a previous log id that is already known to match, then carries a contiguous segment of log entries after that position:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">Append</span> <span class="p">{</span>
    <span class="n">term</span><span class="p">:</span> <span class="n">Term</span><span class="p">,</span>
    <span class="n">prev_log_id</span><span class="p">:</span> <span class="n">LogId</span><span class="p">,</span>
    <span class="n">terms</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Term</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="n">cmds</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Cmd</span><span class="o">&gt;</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here <code class="language-plaintext highlighter-rouge">prev_log_id</code> is the matching point. <code class="language-plaintext highlighter-rouge">terms</code> and <code class="language-plaintext highlighter-rouge">cmds</code> are the real entries after that point. They must have the same length, and their first item corresponds to <code class="language-plaintext highlighter-rouge">prev_log_id.index + 1</code>. The follower first checks whether its local log has the same <code class="language-plaintext highlighter-rouge">LogId</code> at <code class="language-plaintext highlighter-rouge">prev_log_id.index</code>. Only if that previous position matches does it accept the entries that follow.</p>

<p>This design is close to standard Raft’s AppendEntries. Standard Raft carries <code class="language-plaintext highlighter-rouge">prevLogIndex</code> and <code class="language-plaintext highlighter-rouge">prevLogTerm</code> separately; <code class="language-plaintext highlighter-rouge">raf</code> combines them into a single <code class="language-plaintext highlighter-rouge">prev_log_id</code>. That is clearer than using the first entry in the request as the matching point, because the request’s <code class="language-plaintext highlighter-rouge">terms</code> and <code class="language-plaintext highlighter-rouge">cmds</code> represent only the real entries to replicate.</p>

<p>The following diagram shows one Append request applied to several follower states. The request has <code class="language-plaintext highlighter-rouge">term = 5</code>, <code class="language-plaintext highlighter-rouge">prev_log_id = (2, 3)</code>, and carries the consecutive log entries for indexes <code class="language-plaintext highlighter-rouge">4..=5</code>.</p>

<!-- [ASCII source](assets/append-replication.txt) -->

<p><img src="/post-res/raf-without-term/a0e8c4e1f7671c25-append-replication.png" alt="Append replication scenarios" /></p>

<p>The three outcomes are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">accepted</code>: the follower matches the leader at <code class="language-plaintext highlighter-rouge">prev_log_id = (2, 3)</code>, so it can accept this Append. <code class="language-plaintext highlighter-rouge">{...}</code> marks the <code class="language-plaintext highlighter-rouge">terms</code> range overwritten by this request, as well as the command range appended because it was missing locally. Here the term at index <code class="language-plaintext highlighter-rouge">4</code> is updated from an old value to <code class="language-plaintext highlighter-rouge">5</code>, and command <code class="language-plaintext highlighter-rouge">c5</code> is appended at index <code class="language-plaintext highlighter-rouge">5</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">conflict at prev_log_id</code>: <code class="language-plaintext highlighter-rouge">*</code> marks the conflict position. The follower’s term at index <code class="language-plaintext highlighter-rouge">3</code> is <code class="language-plaintext highlighter-rouge">3</code>, while the request’s <code class="language-plaintext highlighter-rouge">prev_log_id</code> is <code class="language-plaintext highlighter-rouge">(2, 3)</code>. Since the previous log id does not match, the follower returns a conflict index immediately. The leader must try again with an earlier <code class="language-plaintext highlighter-rouge">prev_log_id</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">rejected: follower has newer term</code>: the follower has already observed term <code class="language-plaintext highlighter-rouge">6</code> at index <code class="language-plaintext highlighter-rouge">6</code>, while the Append request is from term <code class="language-plaintext highlighter-rouge">5</code>. This request comes from a stale leader, so the follower rejects it without modifying the log.</li>
</ul>

<p>The handling logic is:</p>

<ol>
  <li>If the request term is older than the last observed local term, reject it.</li>
  <li>Check the follower’s local log at <code class="language-plaintext highlighter-rouge">prev_log_id</code>. If it does not match, return the conflict index.</li>
  <li>If <code class="language-plaintext highlighter-rouge">prev_log_id</code> matches, process the real entries starting at <code class="language-plaintext highlighter-rouge">prev_log_id.index + 1</code>.</li>
  <li>If later local commands diverge from the leader, truncate the local commands.</li>
  <li>Overwrite the local <code class="language-plaintext highlighter-rouge">terms</code> range covered by this request.</li>
  <li>Append only the commands that are missing locally.</li>
</ol>

<p><em>Append first finds the common prefix with <code class="language-plaintext highlighter-rouge">prev_log_id</code>, then truncates the follower’s conflicting suffix, and finally copies the leader’s entries that the follower is missing.</em></p>

<p>This is still Raft’s core replication model: the leader finds a shared log prefix, then replaces the follower’s divergent suffix with its own.</p>

<h2 id="advancing-commit">Advancing Commit</h2>

<p>Replication to a quorum does not mean every historical entry can be committed immediately. Standard Raft has an important rule: a leader may only directly commit log entries from its own current term. Entries from older terms become committed only as a consequence of committing an entry from the current term.</p>

<p><code class="language-plaintext highlighter-rouge">raf</code> keeps this rule. Although an established leader rewrites earlier gap slots to its own term, the leader still only directly commits matched indexes that are not smaller than its election index. Empty entries backfilled before <code class="language-plaintext highlighter-rouge">leader.term</code> are committed only indirectly, together with the leader’s election index or a later entry.</p>

<p>Intuitively:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="nf">quorum_has_matched</span><span class="p">(</span><span class="n">index</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="n">index</span> <span class="o">&gt;=</span> <span class="n">leader</span><span class="py">.term</span> <span class="p">{</span>
    <span class="nf">commit</span><span class="p">(</span><span class="n">index</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The reason is the same as in standard Raft: once an index is committed, every future valid leader must contain it and must not overwrite it.</p>

<p>The following state shows why “replicated to a quorum” is not enough. The leader of term <code class="language-plaintext highlighter-rouge">6</code> has replicated the old log entries at indexes <code class="language-plaintext highlighter-rouge">4</code> and <code class="language-plaintext highlighter-rouge">5</code> to quorum <code class="language-plaintext highlighter-rouge">A+B</code>, but that quorum has not yet matched the leader’s own term index <code class="language-plaintext highlighter-rouge">6</code>. Therefore indexes <code class="language-plaintext highlighter-rouge">4</code> and <code class="language-plaintext highlighter-rouge">5</code> still cannot be committed:</p>

<!-- [ASCII source](assets/not-committed.txt) -->

<p><img src="/post-res/raf-without-term/d98978c7f1b7fab2-not-committed.png" alt="Not committed yet" /></p>

<p>If a new leader for term <code class="language-plaintext highlighter-rouge">7</code> appears later, and its <code class="language-plaintext highlighter-rouge">last_log_id=(5,6)</code> is newer, it can overwrite those uncommitted log entries. In the diagram, <code class="language-plaintext highlighter-rouge">{x}</code> marks the range replaced by the new leader.</p>

<p><em>A leader only directly commits an index that is both covered by a quorum and inside the current leader’s term range.</em></p>

<h2 id="example">Example</h2>

<p>The repository includes a three-node in-process example that demonstrates the basic flow described in this article. It creates three <code class="language-plaintext highlighter-rouge">Raf</code> nodes, connects them with <code class="language-plaintext highlighter-rouge">InProcessNetwork</code>, explicitly triggers an election on node 1, and then writes a few log entries through the leader. Metrics show the role, term, commit index, and replication progress.</p>

<p>The example source is here:</p>

<p>https://github.com/drmingdrmer/raf/blob/v0.1.1/examples/three_node.rs</p>

<p>Run it from the repository root:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cargo run <span class="nt">--example</span> three_node
</code></pre></div></div>

<p>This example is not a production deployment template. It is a minimal demonstration for observing the core protocol state transitions. It logs to stderr and does not include an election timer, heartbeat, snapshots, or membership changes.</p>

<h2 id="current-boundaries">Current Boundaries</h2>

<p>The current implementation is intentionally small. It leaves out several features that a complete production system would usually need:</p>

<ul>
  <li>Automatic election triggering.</li>
  <li>Snapshots and log compaction.</li>
  <li>Membership changes.</li>
  <li>Heartbeats.</li>
  <li>RequestVote retry logic.</li>
  <li>Persistence semantics for application payloads.</li>
</ul>

<p>Automatic election triggering corresponds to the election timer in standard Raft. A node periodically checks whether it has gone too long without seeing a valid leader. If it times out, it starts a new election. This can be implemented by an external timer that calls <code class="language-plaintext highlighter-rouge">Raf::elect()</code>. It does not need to live inside the core <code class="language-plaintext highlighter-rouge">raf</code> state machine, so the current implementation leaves it out.</p>

<p>RequestVote retry has a subtler boundary. Suppose a target node successfully handles a <code class="language-plaintext highlighter-rouge">RequestVote</code>, but the reply is lost in the network. If the candidate retries the same request, the target node has already recorded this term in <code class="language-plaintext highlighter-rouge">terms[req.term]</code>. Under the current rule, it rejects the retry because that term index already exists.</p>

<p>One optional fix is to add an in-memory <code class="language-plaintext highlighter-rouge">voted_for</code> field that records which candidate owns a term. Then a retry from the same candidate for the same term can be recognized and granted again. This field does not necessarily need to be persisted. If a node restarts and loses <code class="language-plaintext highlighter-rouge">voted_for</code>, it can conservatively reject every <code class="language-plaintext highlighter-rouge">RequestVote</code> that uses a term already present locally. That creates a small availability issue, but only after restart; it does not change the persisted relationship between logs and terms.</p>

<p><em>If a RequestVote reply is lost, the retry sees an already existing term. An optional in-memory <code class="language-plaintext highlighter-rouge">voted_for</code> field can improve availability in that case.</em></p>

<p>These features can all be added around the core model. This article focuses on the central question: if the term comes from the log index, can Raft election, replication, and commit still be expressed in the familiar way?</p>

<h2 id="summary">Summary</h2>

<p><code class="language-plaintext highlighter-rouge">raf</code> is not “Raft without any term.” It still has terms, and it still compares logs by <code class="language-plaintext highlighter-rouge">(term, index)</code>. What it removes is the independently persisted <code class="language-plaintext highlighter-rouge">currentTerm</code>; the leader term is bound to the log index reserved by an election.</p>

<p>This change turns the storage state into two <code class="language-plaintext highlighter-rouge">Vec</code>s aligned by index:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">RafStorage</span> <span class="p">{</span>
    <span class="n">terms</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Term</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="n">cmds</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Cmd</span><span class="o">&gt;</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>During election, a candidate chooses <code class="language-plaintext highlighter-rouge">terms.len()</code> as its term. After it becomes leader, both the missing-command gap slots and later log entries use this leader term. Replication and commitment still follow the basic Raft rules.</p>

<p>That is the core of this experimental implementation: keep Raft’s logical time model, but change where that logical time comes from in persistent state.</p>

<p>Repository: <a href="https://github.com/drmingdrmer/raf/tree/v0.1.1">raf</a> (<code class="language-plaintext highlighter-rouge">v0.1.1</code>).</p>

<p>Reference:</p>

<ul>
  <li>raf : <a href="https://github.com/drmingdrmer/raf/tree/v0.1.1">https://github.com/drmingdrmer/raf/tree/v0.1.1</a></li>
</ul>]]></content><author><name>Zhang Yanpo (drdr.xp)</name></author><category term="algo" /><category term="distributed" /><category term="raft" /><category term="consensus" /><category term="storage" /><summary type="html"><![CDATA[raf is an experimental Raft variant that does not persist currentTerm as a separate field. Instead, each election reserves a log index, and that index becomes the leader term.]]></summary></entry><entry><title type="html">Histogram Done Right: 2KB Memory, 0.2% Error</title><link href="https://blog.openacid.com/algo/histogram/" rel="alternate" type="text/html" title="Histogram Done Right: 2KB Memory, 0.2% Error" /><published>2026-04-02T00:00:00+00:00</published><updated>2026-04-02T00:00:00+00:00</updated><id>https://blog.openacid.com/algo/histogram</id><content type="html" xml:base="https://blog.openacid.com/algo/histogram/"><![CDATA[<p><img src="/post-res/histogram/7b72af58792aed59-histagram-banner.png" alt="" /></p>

<h2 id="the-problem-tracking-request-latency-without-slowing-things-down">The Problem: Tracking Request Latency Without Slowing Things Down</h2>

<p>When we were building <a href="https://github.com/databendlabs/databend">Databend</a> and <a href="https://github.com/databendlabs/openraft">OpenRaft</a>,
we ran into a familiar need: we wanted to see how request latency was distributed across the system, in real time, without burning CPU or memory to do it.
This article explains the design behind <a href="https://github.com/drmingdrmer/base2histogram">base2histogram</a>, the library we built to solve it.</p>

<p>Consider the life of a single Raft log entry. It passes through several stages, and each one has its own latency profile:</p>

<ul>
  <li>Received → written to storage</li>
  <li>Persisted to local disk</li>
  <li>Replicated to remote nodes</li>
  <li>Acknowledged by a majority quorum</li>
  <li>Committed → applied to the state machine</li>
</ul>

<p>A <a href="https://en.wikipedia.org/wiki/Histogram">histogram</a> is the natural tool here — plot latency on the x-axis, request count on the y-axis, and you get an immediate picture of where time is being spent.</p>

<p><img src="/post-res/histogram/d0883e49611bab5c-001-article-latency.png" alt="001-latency histogram" /></p>

<p>This kind of visibility is what lets you find bottlenecks and fix the right thing.</p>

<p>But there’s a catch: collecting metrics can’t get in the way of doing actual work. So the histogram needs to be:</p>

<ul>
  <li><strong>O(1) to record</strong> — no sorting, no rebalancing, nothing that can stall a hot path</li>
  <li><strong>Tiny in memory</strong> — the system may run hundreds or thousands of these at once</li>
  <li><strong>Queryable for <a href="https://en.wikipedia.org/wiki/Percentile">percentiles</a></strong> — P50, P95, P99</li>
</ul>

<p>Let’s walk through how we designed one that hits all three.</p>

<h2 id="recording-getting-samples-into-buckets">Recording: Getting Samples Into Buckets</h2>

<h3 id="why-log-scale-buckets">Why Log-Scale Buckets</h3>

<p>Most requests cluster around some typical latency, with a few outliers on both ends. This is a <a href="https://en.wikipedia.org/wiki/Log-normal_distribution">log-normal distribution</a> — take the log of the latency values, and the shape becomes a classic <a href="https://en.wikipedia.org/wiki/Normal_distribution">bell curve</a>.</p>

<p>The signature look: a peak at lower values, then a gradual <a href="https://en.wikipedia.org/wiki/Long_tail">long tail</a> stretching to the right.</p>

<p><img src="/post-res/histogram/1e4a9be7908c5de8-002-lognormal-distribution.png" alt="" /></p>

<p>To build a histogram, we divide the x-axis into buckets and count how many samples land in each one.</p>

<p>The key question is how to size those buckets.
Equal-width buckets work great for a normal distribution, but latency is log-normal — the data only looks uniform on a logarithmic scale.
So the buckets need to grow on a <strong><a href="https://en.wikipedia.org/wiki/Logarithmic_scale">log scale</a></strong>, not a <strong>linear</strong> one.</p>

<p>The simplest version of this: each bucket is twice as wide as the one before it.</p>

<p><code class="language-plaintext highlighter-rouge">[0,1), [1,2), [2,4), [4,8), [8,16), ...</code></p>

<p>Why powers of 2? Because multiplying by 2 is free on a CPU, and mapping a value to its bucket takes a single <a href="https://en.wikipedia.org/wiki/Find_first_set#CLZ">leading zero count</a> instruction.</p>

<p>Simulate a log-normal workload, plot the bucket counts with the bucket index on the x-axis (effectively a log transform), and the result is a clean bell curve:</p>

<p><img src="/post-res/histogram/0587ffc4d78c5e01-003-log2-bucketing.png" alt="" /></p>

<p>Storage-wise, this is great — 65 buckets cover the entire u64 range.
Resolution-wise, not so much. The last bucket spans half of all possible values. Everything that lands there is a blur.</p>

<p><img src="/post-res/histogram/823fba54ce3daa43-004-log2-coarse.png" alt="" /></p>

<h3 id="a-tempting-fix-we-passed-on">A Tempting Fix We Passed On</h3>

<p>An obvious improvement: use a smaller growth factor, like 1.1× instead of 2×.
More buckets, finer resolution:</p>

<p><img src="/post-res/histogram/4431414f6473152b-005-1.1x-buckets.png" alt="" /></p>

<p>The problem is cost. Finding the right bucket for a value <code class="language-plaintext highlighter-rouge">l</code> means solving for the smallest <code class="language-plaintext highlighter-rouge">x</code> where <code class="language-plaintext highlighter-rouge">1 + 1.1 + 1.1^2 + ... + 1.1^x &gt;= l</code> — and that requires floating-point logarithms. That’s real overhead on a hot path.</p>

<p>We wanted to stay in the world of integers and bit operations.</p>

<h3 id="the-trick-float-like-encoding">The Trick: Float-Like Encoding</h3>

<p>Here’s the idea that makes everything work. We keep the roughly exponential bucket sizes, but we encode each bucket using a fixed number of bits — a parameter we call WIDTH.</p>

<p>Think of a bucket’s lower bound as a tiny <a href="https://en.wikipedia.org/wiki/Floating-point_arithmetic">floating-point number</a>.
The <a href="https://en.wikipedia.org/wiki/Bit_numbering#Most_significant_bit">MSB</a> position gives you the exponent (which group of buckets you’re in),
and the next few bits give you the offset within that group.</p>

<p>With WIDTH=3 (the default), a bucket boundary looks like this in binary:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00..00 1 xx 00..00
       |
       MSB
&lt;- significant
</code></pre></div></div>

<p>The position of the leading <code class="language-plaintext highlighter-rouge">1</code> picks the group. The two bits that follow pick the bucket within the group.</p>

<p>Here’s what the first few groups look like — each bucket is fully described by just 3 bits:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>WIDTH = 3:

range     bucket index        bucket size
[0, 1)     0  0b0 ..... 000    1
[1, 2)     1  0b0 ..... 001    1
[2, 3)     2  0b0 ..... 010    1
[3, 4)     3  0b0 ..... 011    1

[4, 5)     4  0b0 ..... 100    1
[5, 6)     5  0b0 ..... 101    1
[6, 7)     6  0b0 ..... 110    1
[7, 8)     7  0b0 ..... 111    1

[8, 10)    8  0b0 .... 1000    2
[10, 12)   9  0b0 .... 1010    2
[12, 14)  10  0b0 .... 1100    2
[14, 16)  11  0b0 .... 1110    2

[16, 20)  12  0b0 ... 10000    4
[20, 24)  13  0b0 ... 10100    4
[24, 28)  14  0b0 ... 11000    4
[28, 32)  15  0b0 ... 11100    4

[32, 40)  16  0b0 .. 100000    8
[40, 48)  17  0b0 .. 101000    8
[48, 56)  18  0b0 .. 110000    8
[56, 64)  19  0b0 .. 111000    8
</code></pre></div></div>

<p>The pattern is clean:</p>

<ul>
  <li>Each group contains <code class="language-plaintext highlighter-rouge">2^(WIDTH-1) = 4</code> buckets</li>
  <li>The 2 bits after the MSB select the bucket within the group</li>
  <li>It’s a 3-bit float: 1 implicit leading bit + 2 fractional bits</li>
</ul>

<p><img src="/post-res/histogram/077aac24f7b55056-006-bit-decomposition.png" alt="" /></p>

<p>Bucket sizes grow roughly logarithmically,
and computing the bucket index is just a matter of extracting the top WIDTH bits — a handful of integer and bit ops. Recording a sample is <strong>O(1)</strong>.</p>

<p>Walk-through with latency = 42:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>value = 42 (binary: 0b101010)
  MSB position: 5
  group: 5 - 2 = 3
  2 bits after MSB: 01 (from 1[01]010)
  offset in group: 1
  Bucket index: 4 + (3 × 4) + 1 = 17
</code></pre></div></div>

<p><img src="/post-res/histogram/126a7a9d50366663-007-bucket-layout.png" alt="" /></p>

<h3 id="tuning-width-the-precisionmemory-knob">Tuning WIDTH: The Precision–Memory Knob</h3>

<p>WIDTH sets how many buckets each group gets (<code class="language-plaintext highlighter-rouge">2^(WIDTH-1)</code>).
Groups top out at 64 (they still double in size, covering the full u64 range).</p>

<p>Here’s how the trade-off plays out:</p>

<table>
<tr class="header">
<th>WIDTH</th>
<th>Buckets</th>
<th>Mem/slot</th>
<th>Buckets per group</th>
</tr>
<tr class="odd">
<td>1</td>
<td>65</td>
<td>520 B</td>
<td>1</td>
</tr>
<tr class="even">
<td>2</td>
<td>128</td>
<td>1.0 KB</td>
<td>2</td>
</tr>
<tr class="odd">
<td>3</td>
<td>252</td>
<td>2.0 KB</td>
<td>4 (default)</td>
</tr>
<tr class="even">
<td>4</td>
<td>496</td>
<td>3.9 KB</td>
<td>8</td>
</tr>
<tr class="odd">
<td>5</td>
<td>976</td>
<td>7.6 KB</td>
<td>16</td>
</tr>
<tr class="even">
<td>6</td>
<td>1920</td>
<td>15.0 KB</td>
<td>32</td>
</tr>
</table>

<p>At the default WIDTH=3, one histogram costs 2 KB and records every sample in O(1).</p>

<p>That’s the write side sorted. Now for the read side.</p>

<h2 id="percentile-estimation-getting-answers-out">Percentile Estimation: Getting Answers Out</h2>

<p>Once we’ve collected the counts, we want percentiles: at what latency have 50% of requests finished (P50)? 90% (P90)? 99% (P99)?</p>

<h3 id="locating-the-right-bucket">Locating the Right Bucket</h3>

<p>The basic idea is simple.
For P50: count the total samples, take 50% to get a target rank <code class="language-plaintext highlighter-rouge">p</code>, then walk through the buckets from the start, accumulating counts until you pass <code class="language-plaintext highlighter-rouge">p</code>. That’s your bucket.</p>

<p>But a bucket spans a range, not a point. We still need to estimate where inside the bucket the percentile actually falls.</p>

<p>Here are a few ways to do that, from rough to precise.
All error numbers below come from a log-normal distribution (API latency scenario), WIDTH=3, 1,000,000 samples.</p>

<p><strong>Midpoint</strong>: just return <code class="language-plaintext highlighter-rouge">(min + max) / 2</code>.
Many histogram libraries do this (e.g., <a href="https://github.com/iopsystems/histogram">iopsystems/histogram</a>).
It’s a blind guess — it ignores everything about how samples are distributed within the bucket.</p>

<table>
<tr class="header">
<th></th>
<th>P50</th>
<th>P95</th>
<th>P99</th>
</tr>
<tr class="odd">
<td>midpoint</td>
<td>5.018%</td>
<td>7.732%</td>
<td>4.861%</td>
</tr>
</table>

<p><strong>Uniform interpolation</strong>: assume samples are spread evenly across the bucket (a flat rectangle), then interpolate linearly: <code class="language-plaintext highlighter-rouge">estimate = min + (max - min) × rank / count</code>.</p>

<p>Better than midpoint — at least it uses where in the bucket the target rank falls. But “evenly spread” is a rough assumption. Log-normal data is skewed even within a single bucket.</p>

<h3 id="trapezoid-interpolation-our-approach">Trapezoid Interpolation (Our Approach)</h3>

<p>Uniform interpolation treats the density inside a bucket as flat. In reality, it’s sloped — denser on the side closer to the peak of the distribution.</p>

<p>If we know which way the density tilts, we can swap the rectangle for a trapezoid and land much closer to the true value.</p>

<p><img src="/post-res/histogram/851f5482f43e8c4c-008-trapezoid.png" alt="" /></p>

<p>Each bucket stores only a count — we don’t want to add extra fields. So where does the slope information come from? The neighbors.</p>

<p><strong>The densities of the left and right buckets tell us how the density slopes through the current bucket.</strong></p>

<p>Here’s the recipe.
Compute the average density of the left bucket: <code class="language-plaintext highlighter-rouge">d0 = c0/(x1-x0)</code>, and treat it as the density at that bucket’s midpoint <code class="language-plaintext highlighter-rouge">m0</code>.
Do the same for the right bucket: <code class="language-plaintext highlighter-rouge">d2 = c2/(x3-x2)</code> at midpoint <code class="language-plaintext highlighter-rouge">m2</code>.
Assume density varies linearly from <code class="language-plaintext highlighter-rouge">m0</code> to <code class="language-plaintext highlighter-rouge">m2</code> — over this short range, that’s a reasonable approximation.
This gives us the slope <code class="language-plaintext highlighter-rouge">k</code>.</p>

<p>Inside the target bucket, the density now forms a trapezoid: a sloped line with slope <code class="language-plaintext highlighter-rouge">k</code>, pinned so that the density at the bucket’s midpoint <code class="language-plaintext highlighter-rouge">(x1+x2)/2</code> equals the bucket’s own average density <code class="language-plaintext highlighter-rouge">d1 = c1/(x2-x1)</code> (the midpoint of a linear function always equals its average).</p>

<p>To find the percentile, we solve for the x-position where the trapezoid’s area from <code class="language-plaintext highlighter-rouge">x1</code> equals the target rank.</p>

<p>Same distribution, same buckets — here’s how it stacks up:</p>

<table>
<tr class="header">
<th></th>
<th>P50</th>
<th>P95</th>
<th>P99</th>
</tr>
<tr class="odd">
<td>midpoint</td>
<td>5.018%</td>
<td>7.732%</td>
<td>4.861%</td>
</tr>
<tr class="even">
<td>trapezoid</td>
<td>0.000%</td>
<td>0.080%</td>
<td>0.086%</td>
</tr>
</table>

<p>Two orders of magnitude better, with zero additional storage.</p>

<p>The three-bucket layout:</p>

<p><img src="/post-res/histogram/bd85430d86e35843-009-slope-estimation.png" alt="" /></p>

<table>
<tr class="header">
<th>Variable</th>
<th>Meaning</th>
</tr>
<tr class="odd">
<td><code>x0, x1, x2, x3</code></td>
<td>Boundaries of the three adjacent buckets</td>
</tr>
<tr class="even">
<td><code>w0, w1, w2</code></td>
<td>Bucket widths: <code>w0 = x1-x0</code>, <code>w1 = x2-x1</code>, <code>w2 = x3-x2</code></td>
</tr>
<tr class="odd">
<td><code>c0, c1, c2</code></td>
<td>Sample counts in each bucket</td>
</tr>
<tr class="even">
<td><code>rank</code></td>
<td>How many samples into the target bucket the percentile falls</td>
</tr>
</table>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>d0 = c0 / w0       -- left bucket density
d1 = c1 / w1       -- target bucket density
d2 = c2 / w2       -- right bucket density
</code></pre></div></div>

<p>Midpoints of the left and right buckets: <code class="language-plaintext highlighter-rouge">m0 = (x0+x1)/2</code>, <code class="language-plaintext highlighter-rouge">m2 = (x2+x3)/2</code>.
Slope:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>k = (d2 - d0) / (m2 - m0)
</code></pre></div></div>

<p>Then solve for the x-position where the trapezoid’s cumulative area from <code class="language-plaintext highlighter-rouge">x1</code> equals the rank.</p>

<p>The whole thing runs on three counts and their bucket boundaries. Nothing else stored, nothing else needed.</p>

<h2 id="benchmarks-seven-distributions-six-width-settings">Benchmarks: Seven Distributions, Six WIDTH Settings</h2>

<p>We tested across 7 representative distributions with 1,000,000 samples each, using trapezoid interpolation.</p>

<p>The rows to watch are <strong>LN-API</strong> and <strong>LN-DB</strong> at <strong>W=3</strong> — these are the real-world latency cases, running on the default 2 KB configuration:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|                   W=1      W=2      W=3      W=4      W=5      W=6
| ------------------------------------------------------------------
| Uniform P50    0.108%   0.028%   0.012%   0.018%   0.019%   0.002%
|         P95    2.317%   1.988%   1.035%   0.475%   0.005%   0.005%
|         P99    4.290%   4.129%   3.706%   1.486%   0.298%   0.162%
| 
| LN-API  P50    2.281%   0.182%   0.000%   0.000%   0.000%   0.000%
|         P95   20.256%   3.963%   0.080%   0.040%   0.040%   0.000%
|         P99   11.951%   3.594%   0.086%   0.000%   0.029%   0.000%
| 
| Bimodal P50    1.381%   0.394%   0.394%   0.197%   0.197%   0.197%
|         P95    3.918%   0.172%   0.012%   0.028%   0.038%   0.008%
|         P99    1.521%   1.344%   0.543%   0.078%   0.016%   0.014%
| 
| Expon   P50    1.012%   0.000%   0.145%   0.145%   0.145%   0.000%
|         P95   10.989%   0.200%   0.000%   0.000%   0.033%   0.033%
|         P99   18.665%   4.574%   0.824%   0.022%   0.022%   0.022%
| 
| LN-DB   P50    2.018%   0.034%   0.000%   0.000%   0.000%   0.034%
|         P95    2.027%   0.368%   0.039%   0.006%   0.019%   0.026%
|         P99    3.764%   1.066%   0.187%   0.007%   0.003%   0.062%
| 
| Sequent P50    0.095%   0.000%   0.000%   0.000%   0.000%   0.000%
|         P95    2.271%   1.967%   1.011%   0.496%   0.000%   0.000%
|         P99    4.272%   4.118%   3.696%   1.521%   0.305%   0.169%
| 
| Pareto  P50   10.127%   1.899%   0.633%   0.633%   0.633%   0.000%
|         P95    9.239%   0.272%   0.000%   0.136%   0.000%   0.000%
|         P99    3.517%   0.879%   0.231%   0.093%   0.046%   0.046%
| 
| ------------------------------------------------------------------
| Buckets            65      128      252      496      976     1920
| Mem/slot        520 B   1.0 KB   2.0 KB   3.9 KB   7.6 KB  15.0 KB
| Mem total      1.0 KB   2.0 KB   3.9 KB   7.8 KB  15.2 KB  30.0 KB
</code></pre></div></div>

<p>What each distribution models:</p>

<ul>
  <li><strong>Uniform (<a href="https://en.wikipedia.org/wiki/Continuous_uniform_distribution">uniform distribution</a>)</strong>: synthetic benchmarks</li>
  <li><strong>LN-API (<a href="https://en.wikipedia.org/wiki/Log-normal_distribution">log-normal</a> σ=0.5)</strong>: API and microservice latency</li>
  <li><strong>Bimodal (<a href="https://en.wikipedia.org/wiki/Multimodal_distribution">bimodal distribution</a>)</strong>: cache hit/miss — 90% fast path ~500μs, 10% slow path ~50ms</li>
  <li><strong>Expon (<a href="https://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</a>)</strong>: network and I/O waits</li>
  <li><strong>LN-DB (log-normal σ=1.0)</strong>: database query latency, with a wider tail</li>
  <li><strong>Sequent (sequential)</strong>: adversarial worst case</li>
  <li><strong>Pareto (<a href="https://en.wikipedia.org/wiki/Pareto_distribution">Pareto distribution</a> α=1.5)</strong>: heavy-tailed workloads like request sizes</li>
</ul>

<p>For the latency distributions we care about most — LN-API and LN-DB — WIDTH=3 delivers sub-0.2% error on 2 KB of memory.</p>

<h2 id="summary">Summary</h2>

<ul>
  <li><strong>2 KB memory</strong> (WIDTH=3, 252 buckets of u64), P50/P95/P99 error under 0.2% for log-normal latency</li>
  <li><strong>O(1) recording</strong>, O(buckets) querying</li>
  <li><strong>Trapezoid interpolation</strong> is what makes it work — over 10× more accurate than midpoint, with zero extra storage</li>
  <li><strong>WIDTH is tunable</strong>: 520 B for bare-minimum tracking, up to 15 KB for maximum precision</li>
  <li>Code: <a href="https://github.com/drmingdrmer/base2histogram">base2histogram</a></li>
</ul>

<hr />

<p>Reference:</p>

<ul>
  <li>
    <p>Databend : <a href="https://github.com/databendlabs/databend">https://github.com/databendlabs/databend</a></p>
  </li>
  <li>
    <p>MSB : <a href="https://en.wikipedia.org/wiki/Bit_numbering#Most_significant_bit">https://en.wikipedia.org/wiki/Bit_numbering#Most_significant_bit</a></p>
  </li>
  <li>
    <p>OpenRaft : <a href="https://github.com/databendlabs/openraft">https://github.com/databendlabs/openraft</a></p>
  </li>
  <li>
    <p>Pareto distribution : <a href="https://en.wikipedia.org/wiki/Pareto_distribution">https://en.wikipedia.org/wiki/Pareto_distribution</a></p>
  </li>
  <li>
    <p>base2histogram : <a href="https://github.com/drmingdrmer/base2histogram">https://github.com/drmingdrmer/base2histogram</a></p>
  </li>
  <li>
    <p>bimodal distribution : <a href="https://en.wikipedia.org/wiki/Multimodal_distribution">https://en.wikipedia.org/wiki/Multimodal_distribution</a></p>
  </li>
  <li>
    <p>exponential distribution : <a href="https://en.wikipedia.org/wiki/Exponential_distribution">https://en.wikipedia.org/wiki/Exponential_distribution</a></p>
  </li>
  <li>
    <p>floating-point number : <a href="https://en.wikipedia.org/wiki/Floating-point_arithmetic">https://en.wikipedia.org/wiki/Floating-point_arithmetic</a></p>
  </li>
  <li>
    <p>histogram : <a href="https://en.wikipedia.org/wiki/Histogram">https://en.wikipedia.org/wiki/Histogram</a></p>
  </li>
  <li>
    <p>iopsystems/histogram : <a href="https://github.com/iopsystems/histogram">https://github.com/iopsystems/histogram</a></p>
  </li>
  <li>
    <p>leading zero counting : <a href="https://en.wikipedia.org/wiki/Find_first_set#CLZ">https://en.wikipedia.org/wiki/Find_first_set#CLZ</a></p>
  </li>
  <li>
    <p>log scale : <a href="https://en.wikipedia.org/wiki/Logarithmic_scale">https://en.wikipedia.org/wiki/Logarithmic_scale</a></p>
  </li>
  <li>
    <p>log-normal distribution : <a href="https://en.wikipedia.org/wiki/Log-normal_distribution">https://en.wikipedia.org/wiki/Log-normal_distribution</a></p>
  </li>
  <li>
    <p>long tail : <a href="https://en.wikipedia.org/wiki/Long_tail">https://en.wikipedia.org/wiki/Long_tail</a></p>
  </li>
  <li>
    <p>normal distribution : <a href="https://en.wikipedia.org/wiki/Normal_distribution">https://en.wikipedia.org/wiki/Normal_distribution</a></p>
  </li>
  <li>
    <p>percentile : <a href="https://en.wikipedia.org/wiki/Percentile">https://en.wikipedia.org/wiki/Percentile</a></p>
  </li>
  <li>
    <p>uniform distribution : <a href="https://en.wikipedia.org/wiki/Continuous_uniform_distribution">https://en.wikipedia.org/wiki/Continuous_uniform_distribution</a></p>
  </li>
</ul>]]></content><author><name>Zhang Yanpo (drdr.xp)</name></author><category term="algo" /><category term="histogram" /><category term="latency" /><category term="percentile" /><category term="metrics" /><category term="performance" /><summary type="html"><![CDATA[A lightweight histogram that tracks latency distributions in 2KB of memory with sub-0.2% error. Uses a float-like encoding for O(1) bucket indexing and trapezoid interpolation for accurate percentile estimation — no floating-point math needed.]]></summary></entry><entry><title type="html">xp 的 AI 工作流</title><link href="https://blog.openacid.com/life/xp-vibe-coding/" rel="alternate" type="text/html" title="xp 的 AI 工作流" /><published>2026-01-04T00:00:00+00:00</published><updated>2026-01-04T00:00:00+00:00</updated><id>https://blog.openacid.com/life/xp-vibe-coding</id><content type="html" xml:base="https://blog.openacid.com/life/xp-vibe-coding/"><![CDATA[<p>目前我的日常开发基本只需要”动嘴”就能完成。这篇文章分享我在用的工具组合。</p>

<p><strong>偏见声明</strong>：我更希望把精力放在问题分析上，所以市面上流行的工具并没有一一尝试。只要满足需要就不会替换，除非发现明显的效率瓶颈。</p>

<hr />

<h2 id="工作流概览">工作流概览</h2>

<ol>
  <li><strong>分析与决策</strong> — 在 IDE 中浏览代码，理解项目，决定要做什么</li>
  <li><strong>描述需求</strong> — 用语音输入把想法描述出来</li>
  <li><strong>代码生成</strong> — 交给 AI 命令行工具生成代码</li>
  <li><strong>审核与提交</strong> — 用 tig 逐行 review，增量提交到 Git</li>
</ol>

<p>注意上面的几个步骤一般是穿插进行的, 一个这样的修改看做一个 session 的话, 一般我会同时做 2, 3 个工作流 session, 这是因为大模型输出代码比我 review 的速度慢一些, 我自己经常会有 IO 等待, 所以同时开 2,3 个 session 可以把我的 CPU 占满. 但是切换任务也会让自己脑袋里的 context 频繁切换, 降低效率, 所以一般我开 1 个需要我仔细思考的困难任务(例如增加新 feature), 和 1 或 2 个不太需要深度思考的简单任务(例如代码重构), 就不会产生大量颅内 context 切换. 而且忙起来似乎有助于多巴胺分泌.</p>

<p><img src="/post-res/xp-vibe-coding/ef3c8085ef56b7b8-workflow.webp" alt="workflow" /></p>

<hr />

<h2 id="第一步分析与决策--rust-rover">第一步：分析与决策 — Rust Rover</h2>

<p><a href="https://www.jetbrains.com/rust/">Rust Rover</a> 是 JetBrains 开发的 Rust IDE。JetBrains 的 IDE 系列（IntelliJ IDEA、PyCharm、WebStorm 等）在 AI 时代之前一直是开发者的首选，以代码跳转、重构、调试等功能著称。</p>

<p>虽然 Rust Rover 在 AI 功能上相对保守，但传统的代码跳转、全局搜索等功能依然扎实，足以快速建立对项目的整体认知。</p>

<p>在这个阶段，我主要根据要做的事情回顾代码结构，确定具体的实现思路；或者在没有具体任务时浏览项目，寻找需要改进的地方。</p>

<p><img src="/post-res/xp-vibe-coding/5499429f7c2ff0a1-rust-rover.webp" alt="Rust Rover" /></p>

<h2 id="第二步描述需求--语音输入">第二步：描述需求 — 语音输入</h2>

<p>口述比打字快，语音输入法是重要补充。</p>

<p>语音输入的错误可被大模型理解能力覆盖——只要大致表达清晰，就能完成任务。因此我对语音输入的准确率要求不高，即使口述转文字时有些错误也没关系。只要描述足够详细、提供足够的冗余信息，大模型就能准确理解需求。</p>

<p>以下是几种我尝试过、都能满足需要的语音输入法：</p>

<h3 id="闪电说"><a href="https://shandianshuo.cn/">闪电说</a></h3>
<p>（本地模型）</p>

<p><img src="/post-res/xp-vibe-coding/3676e2058bc85027-shandianshuo.webp" alt="闪电说" /></p>

<ul>
  <li>识别准确率中等</li>
  <li>离线可用，响应快</li>
</ul>

<h3 id="豆包桌面版"><a href="https://www.doubao.com/">豆包桌面版</a></h3>

<p><img src="/post-res/xp-vibe-coding/53677e8f2cbbc865-doubao.webp" alt="豆包" /></p>

<ul>
  <li>在线识别更精准</li>
  <li>网络延迟所以稍慢</li>
</ul>

<h3 id="智谱语音输入法"><a href="https://autoglm.zhipuai.cn/autotyper/">智谱语音输入法</a></h3>

<p><img src="/post-res/xp-vibe-coding/264208e2cb986c5f-zhipu.webp" alt="智谱语音输入法" /></p>

<ul>
  <li>AI 矫正能力强，准确率高</li>
  <li>网络延迟所以稍慢</li>
</ul>

<hr />

<h2 id="第三步代码生成--claude-code--glm-47">第三步：代码生成 — Claude Code + GLM 4.7</h2>

<p><a href="https://github.com/anthropics/claude-code">Claude Code</a> 是我的主力工具，CLI 框架成熟，即便不用内置模型，作为命令行交互载体也很高效。</p>

<p>日常开发中，我直接口述需求给 Claude Code，全程不需要键盘打字。</p>

<p>为了简化开发流程（也为了好玩），我买了一个只有三个键的小键盘：一个键触发语音输入，一个键是删除，一个键是回车。目前正在尝试只用这三个键完成日常开发。</p>

<p><img src="/post-res/xp-vibe-coding/ba4b118935fc9e66-3-key.webp" alt="三键小键盘" /></p>

<p>模型用的是 <a href="https://open.bigmodel.cn/">GLM 4.7</a>（<a href="https://www.zhipuai.cn/">智谱 AI</a> 发布）。需求描述清晰的话，完成”文字到代码”的转化没问题。与顶尖模型相比，处理抽象需求时有差距，但性价比高，国内网络环境下响应快。</p>

<p>就算说的乱 78 糟, 大模型也知道我说的 <code class="language-plaintext highlighter-rouge">IOarrow</code> 是 <code class="language-plaintext highlighter-rouge">io::Error</code>:
<img src="/post-res/xp-vibe-coding/4757c5dfa48f6dc4-claude.webp" alt="Claude Code" /></p>

<p>我已完全摒弃交互式辅助模式，改用”全量需求描述 + AI 独立实现”。</p>

<hr />

<h2 id="第四步审核与提交--tig">第四步：审核与提交 — tig</h2>

<p>我发现最高效的工作方式不是等大模型写完 review, 而是：在大模型生成代码的过程中就开始 review 产生的变更，逐段将变更加入版本控制。确认一个变更正确且符合要求后，立即加入 Git stage(git add)。这样后续修改时, review 工作区变化可以忽略已确认(git add)的部分，效率很高。</p>

<p>因此这个阶段最需要的工具，是能够高效的逐段将代码加入 Git 的交互式工具。<code class="language-plaintext highlighter-rouge">git add -p</code> 是最基本的选择，我使用的是 tig。</p>

<blockquote>
  <p><a href="https://github.com/jonas/tig">tig</a> 是基于 ncurses 的 Git 文本界面工具。</p>
</blockquote>

<h3 id="为什么用-tig-而非">为什么用 tig 而非</h3>
<p><code class="language-plaintext highlighter-rouge">git add -p</code></p>

<p><code class="language-plaintext highlighter-rouge">git add -p</code> 只能显示固定的 2-3 行上下文，有时不够理解修改(我的 context 太小 🤔)。</p>

<p>tig 通过 <code class="language-plaintext highlighter-rouge">tig status</code> 后进入交互界面, 可以：</p>

<ul>
  <li>逐行查看差异，像 Vim 一样导航(<code class="language-plaintext highlighter-rouge">j/k</code>)</li>
  <li>快捷键<code class="language-plaintext highlighter-rouge">[</code> 和 <code class="language-plaintext highlighter-rouge">]</code> 随时调整 diff 上下文大小</li>
  <li>逐行(<code class="language-plaintext highlighter-rouge">1</code>)或逐块(<code class="language-plaintext highlighter-rouge">u</code>)将修改加入 Git stage(stage 部分也叫 cached 或 index)</li>
  <li><code class="language-plaintext highlighter-rouge">R</code>(shift-r) 刷新页面, 显示最新修改</li>
</ul>

<p>一般我习惯在 tmux 里左右分两屏, 右边 pua 大模型干活, 左边逐行确认修改. 因为有时大模型修改的滚动太快了, 看不清它到底做了什么. <code class="language-plaintext highlighter-rouge">tig</code> 的逐行交互式 review 容许我异步查看每一行修改, 一旦看到有问题的修改, 就可以及时切到右边叫停努力的大模型, 重新调整:</p>

<p><img src="/post-res/xp-vibe-coding/05dc7b8862daff46-tig.webp" alt="tig" /></p>

<p>可以看到 tig 的上下文可以用<code class="language-plaintext highlighter-rouge">[</code>和<code class="language-plaintext highlighter-rouge">]</code>任意宽度展开</p>

<p>Vim 的 Git fugitive 插件也可以提供类似功能但需要先启动 Vim，而且操作稍繁琐。</p>

<h3 id="增量-review">增量 Review</h3>

<p>确认 OK 的部分立即加入 Git，后续再次让大模型做出修改后只 review not-staged 部分, 不再重复审查上一步已经加入 stage, 确认 OK 的代码。AI 增量修改时只需关注变化的部分。</p>

<hr />

<h2 id="vibe-coding-的核心理念">Vibe Coding 的核心理念</h2>

<p>AI 能极大提升效率——前提是: 你了解这个领域。否则无法判断对错，错误会吃掉所有效率提升。用 AI 开发的前提是我比 AI 更了解这个项目。</p>

<hr />

<h2 id="工具清单">工具清单</h2>

<table>
<tr class="header">
<th>工具</th>
<th>用途</th>
</tr>
<tr class="odd">
<td><a href="https://www.jetbrains.com/rust/">Rust Rover</a></td>
<td>代码分析</td>
</tr>
<tr class="even">
<td><a href="https://shandianshuo.cn/">闪电说</a> / <a href="https://www.doubao.com/">豆包</a> / <a href="https://autoglm.zhipuai.cn/autotyper/">智谱语音输入法</a></td>
<td>语音输入</td>
</tr>
<tr class="odd">
<td><a href="https://github.com/anthropics/claude-code">Claude Code</a></td>
<td>AI 编程</td>
</tr>
<tr class="even">
<td><a href="https://open.bigmodel.cn/">GLM 4.7</a></td>
<td>后端模型</td>
</tr>
<tr class="odd">
<td><a href="https://github.com/jonas/tig">tig</a></td>
<td>代码审核，增量提交</td>
</tr>
</table>

<hr />

<p><em>这篇文章是口述完成的。</em></p>

<!-- 链接定义 -->

<p>Reference:</p>

<ul>
  <li>
    <p>Claude Code : <a href="https://github.com/anthropics/claude-code">https://github.com/anthropics/claude-code</a></p>
  </li>
  <li>
    <p>GLM 4.7 : <a href="https://open.bigmodel.cn/">https://open.bigmodel.cn/</a></p>
  </li>
  <li>
    <p>Rust Rover : <a href="https://www.jetbrains.com/rust/">https://www.jetbrains.com/rust/</a></p>
  </li>
  <li>
    <p>tig : <a href="https://github.com/jonas/tig">https://github.com/jonas/tig</a></p>
  </li>
  <li>
    <p>智谱 AI : <a href="https://www.zhipuai.cn/">https://www.zhipuai.cn/</a></p>
  </li>
  <li>
    <p>智谱语音输入法 : <a href="https://autoglm.zhipuai.cn/autotyper/">https://autoglm.zhipuai.cn/autotyper/</a></p>
  </li>
  <li>
    <p>豆包 : <a href="https://www.doubao.com/">https://www.doubao.com/</a></p>
  </li>
  <li>
    <p>闪电说 : <a href="https://shandianshuo.cn/">https://shandianshuo.cn/</a></p>
  </li>
</ul>]]></content><author><name>Zhang Yanpo (drdr.xp)</name></author><category term="life" /><category term="vide-coding" /><category term="ai" /><category term="voice-input" /><summary type="html"><![CDATA[xp 的 AI 开发工作流]]></summary></entry><entry><title type="html">Raft Node Rejoin Bug</title><link href="https://blog.openacid.com/algo/raft-rejoin-bug/" rel="alternate" type="text/html" title="Raft Node Rejoin Bug" /><published>2025-11-20T00:00:00+00:00</published><updated>2025-11-20T00:00:00+00:00</updated><id>https://blog.openacid.com/algo/raft-rejoin-bug</id><content type="html" xml:base="https://blog.openacid.com/algo/raft-rejoin-bug/"><![CDATA[<p><img src="/post-res/raft-rejoin-bug/1d34a13b1c0631e8-raft-rejoin-bug-banner.webp" alt="" /></p>

<p>In Raft cluster operations, there’s an easily overlooked bug: when a node is removed and then re-added to the cluster within the same term, delayed AppendEntries responses from the old membership configuration can corrupt the leader’s replication progress tracking for that node, causing the leader to enter an infinite retry loop.</p>

<p>The root cause is the lack of a <strong>replication session isolation mechanism</strong>. When the same node joins the cluster at different times, these should be treated as different replication sessions. However, without explicit session identifiers, the leader cannot distinguish which session a response belongs to. The result is that delayed responses from old sessions incorrectly update the progress records of new sessions.</p>

<p>While this creates operational challenges—continuous resource consumption and nodes unable to catch up with the cluster—the good news is that Raft’s commit protocol ensures data safety remains intact.</p>

<p>Analyzed Raft libs:</p>

<table>
<tr class="header">
<th>Implementation</th>
<th style="text-align: right;">Stars</th>
<th>Language</th>
<th>Status</th>
<th>Analysis</th>
</tr>
<tr class="odd">
<td>Apache Ratis</td>
<td style="text-align: right;">1,418</td>
<td>Java</td>
<td>✓ PROTECTED</td>
<td><a href="analysis/apache-ratis.md">Report</a></td>
</tr>
<tr class="even">
<td>NuRaft</td>
<td style="text-align: right;">1,140</td>
<td>C++</td>
<td>✓ PROTECTED</td>
<td><a href="analysis/nuraft.md">Report</a></td>
</tr>
<tr class="odd">
<td>OpenRaft</td>
<td style="text-align: right;">1,700</td>
<td>Rust</td>
<td>✓ PROTECTED</td>
<td><a href="analysis/openraft.md">Report</a></td>
</tr>
<tr class="even">
<td>RabbitMQ Ra</td>
<td style="text-align: right;">908</td>
<td>Erlang</td>
<td>✓ PROTECTED</td>
<td><a href="analysis/rabbitmq-ra.md">Report</a></td>
</tr>
<tr class="odd">
<td>braft</td>
<td style="text-align: right;">4,174</td>
<td>C++</td>
<td>✓ PROTECTED</td>
<td><a href="analysis/braft.md">Report</a></td>
</tr>
<tr class="even">
<td>canonical/raft</td>
<td style="text-align: right;">954</td>
<td>C</td>
<td>✓ PROTECTED</td>
<td><a href="analysis/canonical-raft.md">Report</a></td>
</tr>
<tr class="odd">
<td>sofa-jraft</td>
<td style="text-align: right;">3,762</td>
<td>Java</td>
<td>✓ PROTECTED</td>
<td><a href="analysis/sofa-jraft-analysis.md">Report</a></td>
</tr>
<tr class="even">
<td><strong>LogCabin</strong></td>
<td style="text-align: right;"><strong>1,945</strong></td>
<td><strong>C++</strong></td>
<td><strong>✗ VULNERABLE</strong></td>
<td><a href="analysis/logcabin.md">Report</a></td>
</tr>
<tr class="odd">
<td><strong>PySyncObj</strong></td>
<td style="text-align: right;"><strong>738</strong></td>
<td><strong>Python</strong></td>
<td><strong>✗ VULNERABLE</strong></td>
<td><a href="analysis/pysyncobj.md">Report</a></td>
</tr>
<tr class="even">
<td><strong>dragonboat</strong></td>
<td style="text-align: right;"><strong>5,262</strong></td>
<td><strong>Go</strong></td>
<td><strong>✗ VULNERABLE</strong></td>
<td><a href="analysis/dragonboat.md">Report</a></td>
</tr>
<tr class="odd">
<td><strong>etcd-io/raft</strong></td>
<td style="text-align: right;"><strong>943</strong></td>
<td><strong>Go</strong></td>
<td><strong>✗ VULNERABLE</strong></td>
<td><a href="analysis/etcd-raft.md">Report</a></td>
</tr>
<tr class="even">
<td><strong>hashicorp/raft</strong></td>
<td style="text-align: right;"><strong>8,826</strong></td>
<td><strong>Go</strong></td>
<td><strong>✗ VULNERABLE</strong></td>
<td><a href="analysis/hashicorp-raft-analysis.md">Report</a></td>
</tr>
<tr class="odd">
<td><strong>raft-java</strong></td>
<td style="text-align: right;"><strong>1,234</strong></td>
<td><strong>Java</strong></td>
<td><strong>✗ VULNERABLE</strong></td>
<td><a href="analysis/raft-java.md">Report</a></td>
</tr>
<tr class="even">
<td><strong>raft-rs (TiKV)</strong></td>
<td style="text-align: right;"><strong>3,224</strong></td>
<td><strong>Rust</strong></td>
<td><strong>✗ VULNERABLE</strong></td>
<td><a href="analysis/raft-rs.md">Report</a></td>
</tr>
<tr class="odd">
<td><strong>redisraft</strong></td>
<td style="text-align: right;"><strong>841</strong></td>
<td><strong>C</strong></td>
<td><strong>✗ VULNERABLE</strong></td>
<td><a href="analysis/redisraft.md">Report</a></td>
</tr>
<tr class="even">
<td><strong>willemt/raft</strong></td>
<td style="text-align: right;"><strong>1,160</strong></td>
<td><strong>C</strong></td>
<td><strong>✗ VULNERABLE</strong></td>
<td><a href="analysis/willemt-raft.md">Report</a></td>
</tr>
<tr class="odd">
<td>eliben/raft</td>
<td style="text-align: right;">1,232</td>
<td>Go</td>
<td>N/A</td>
<td><a href="analysis/eliben-raft.md">Report</a></td>
</tr>
</table>

<p>This article uses raft-rs, the Raft implementation used by TiKV, as a case study to analyze this bug’s trigger conditions, impact, and potential solutions.</p>

<p>Complete analysis and survey of other Raft implementations can be found in the <a href="https://github.com/drmingdrmer/raft-rejoin-bug">Raft Rejoin Bug Survey</a></p>

<h2 id="raft-log-replication-basics">Raft Log Replication Basics</h2>

<p>In Raft, the leader replicates log entries to followers through AppendEntries RPC calls, while maintaining a replication state machine for each follower to track replication progress.</p>

<h3 id="appendentries-request-response-flow">AppendEntries Request-Response Flow</h3>

<p>Here’s how it works: The leader sends AppendEntries requests with the current <code class="language-plaintext highlighter-rouge">term</code>, the <code class="language-plaintext highlighter-rouge">prev_log_index</code> and <code class="language-plaintext highlighter-rouge">prev_log_term</code> pointing to the position just before the new entries, the <code class="language-plaintext highlighter-rouge">entries[]</code> array to replicate, and the leader’s <code class="language-plaintext highlighter-rouge">leader_commit</code> index. The follower responds with its own <code class="language-plaintext highlighter-rouge">term</code>, the highest log <code class="language-plaintext highlighter-rouge">index</code> it replicated, and whether the operation succeeded.</p>

<h3 id="progress-tracking">Progress Tracking</h3>

<p>The leader relies on these responses to track each follower’s replication status. It uses <code class="language-plaintext highlighter-rouge">matched</code> to record the highest log index confirmed to be replicated on that follower, and <code class="language-plaintext highlighter-rouge">next_idx</code> to mark where to send next. When a successful response comes back with <code class="language-plaintext highlighter-rouge">index=N</code>, the leader updates <code class="language-plaintext highlighter-rouge">matched=N</code> and calculates <code class="language-plaintext highlighter-rouge">next_idx=N+1</code> for the next round.</p>

<p>This tracking mechanism has an implicit assumption: responses correspond to the current replication session.</p>

<p>If this assumption isn’t handled properly, when a node rejoins the cluster, the leader can get stuck in an infinite retry loop. It keeps sending AppendEntries requests, the node keeps rejecting them, and the cycle repeats endlessly while that node never manages to catch up with the cluster.</p>

<h2 id="raft-rs-progress-tracking">raft-rs Progress Tracking</h2>

<p>raft-rs tracks replication progress using a Progress structure for each follower node:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// From raft-rs/src/tracker/progress.rs</span>
<span class="k">pub</span> <span class="k">struct</span> <span class="n">Progress</span> <span class="p">{</span>
    <span class="k">pub</span> <span class="n">matched</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>      <span class="c1">// Highest log index known to be replicated</span>
    <span class="k">pub</span> <span class="n">next_idx</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>     <span class="c1">// Next log index to send</span>
    <span class="k">pub</span> <span class="n">state</span><span class="p">:</span> <span class="n">ProgressState</span><span class="p">,</span>
    <span class="c1">// ... other fields</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">matched</code> field records the highest log index successfully replicated to this follower. Whenever the leader receives a successful AppendEntries response, it updates this field:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// From raft-rs/src/tracker/progress.rs</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">maybe_update</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">n</span><span class="p">:</span> <span class="nb">u64</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">bool</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">need_update</span> <span class="o">=</span> <span class="k">self</span><span class="py">.matched</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span>  <span class="c1">// Only check monotonicity</span>
    <span class="k">if</span> <span class="n">need_update</span> <span class="p">{</span>
        <span class="k">self</span><span class="py">.matched</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>  <span class="c1">// Accept the update!</span>
        <span class="k">self</span><span class="nf">.resume</span><span class="p">();</span>
    <span class="p">}</span>
    <span class="n">need_update</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Notice the update logic is quite simple: as long as the new index is higher than the current <code class="language-plaintext highlighter-rouge">matched</code>, it accepts the update. When a node gets removed from the cluster, its Progress record is deleted. When it rejoins, a brand new Progress record is created with <code class="language-plaintext highlighter-rouge">matched = 0</code>.</p>

<h2 id="bug-reproduction-sequence">Bug Reproduction Sequence</h2>

<p>Let’s walk through a concrete timeline to see how this bug unfolds. Pay special attention to the fact that all events happen within a single term (term=5)—this is key to understanding why term-based validation fails.</p>

<h3 id="event-timeline">Event Timeline</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>| Time | Event                                         | Progress State
|------|-----------------------------------------------|----------------
| T1   | log=1, members={a,b,c}                        | C: matched=0
|      | Leader sends AppendEntries(index=1) to C      |
|      | (Network delay causes slow delivery)          |
|      |                                               |
| T2   | log=5, members={a,b}                          | C: [deleted]
|      | Node C removed from cluster                   |
|      | Progress[C] deleted from leader's tracker     |
|      |                                               |
| T3   | log=100, members={a,b,c}                      | C: matched=0 (new)
|      | Node C rejoins the cluster                    |
|      | New Progress[C] created with matched=0        |
|      |                                               |
| T4   | Delayed response arrives from T1:             |
|      | {from: C, index: 1, success: true}            |
|      | Leader finds Progress[C] (the new one!)       |
|      | maybe_update(1) called: 0 &lt; 1, so update!     | C: matched=1 ❌
|      |                                               |
| T5   | Leader calculates next_idx = matched + 1 = 2  |
|      | Sends AppendEntries(prev_index=1)             |
|      | Node C rejects (doesn't have index 1!)        |
|      | Leader can't decrement (matched == rejected)  |
|      | Infinite loop begins...                       |
</code></pre></div></div>

<h3 id="response-handling-at-t4">Response Handling at T4</h3>

<p>At time T4, that response sent at T1 and delayed in the network finally arrives. Here’s how the leader handles it:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// From raft-rs/src/raft.rs</span>
<span class="k">fn</span> <span class="nf">handle_append_response</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">m</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">Message</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Find the progress record</span>
    <span class="k">let</span> <span class="n">pr</span> <span class="o">=</span> <span class="k">match</span> <span class="k">self</span><span class="py">.prs</span><span class="nf">.get_mut</span><span class="p">(</span><span class="n">m</span><span class="py">.from</span><span class="p">)</span> <span class="p">{</span>
        <span class="nf">Some</span><span class="p">(</span><span class="n">pr</span><span class="p">)</span> <span class="k">=&gt;</span> <span class="n">pr</span><span class="p">,</span>
        <span class="nb">None</span> <span class="k">=&gt;</span> <span class="p">{</span>
            <span class="nd">debug!</span><span class="p">(</span><span class="k">self</span><span class="py">.logger</span><span class="p">,</span> <span class="s">"no progress available for {}"</span><span class="p">,</span> <span class="n">m</span><span class="py">.from</span><span class="p">);</span>
            <span class="k">return</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">};</span>

    <span class="c1">// Update progress if the index is higher</span>
    <span class="k">if</span> <span class="o">!</span><span class="n">pr</span><span class="nf">.maybe_update</span><span class="p">(</span><span class="n">m</span><span class="py">.index</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here’s where things go wrong: The leader does find a Progress record for node C, but it’s the new one created at T3. Since the message’s term matches the current term, it passes the term check in the <a href="https://github.com/tikv/raft-rs/blob/master/src/raft.rs#L1346-L1478"><code class="language-plaintext highlighter-rouge">step()</code> function</a>, and the leader updates progress with this stale index value.</p>

<h2 id="root-cause-analysis">Root Cause Analysis</h2>

<p>The root of this bug is that <strong>request-response messages lack replication session identification</strong>. When node C gets removed at T2 and rejoins at T3, these should be two distinct replication sessions—but the leader has no way to distinguish between responses from requests sent at T1 versus responses from requests sent after T3.</p>

<p>Look at raft-rs’s Message structure:</p>

<p>File: <a href="https://github.com/tikv/raft-rs/blob/master/proto/proto/eraftpb.proto#L71-L98"><code class="language-plaintext highlighter-rouge">proto/proto/eraftpb.proto:71-98</code></a></p>

<div class="language-protobuf highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">message</span> <span class="nc">Message</span> <span class="p">{</span>
    <span class="n">MessageType</span> <span class="na">msg_type</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">uint64</span> <span class="k">to</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
    <span class="kt">uint64</span> <span class="na">from</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
    <span class="kt">uint64</span> <span class="na">term</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>        <span class="c1">// Only term, no session identifier!</span>
    <span class="kt">uint64</span> <span class="na">log_term</span> <span class="o">=</span> <span class="mi">5</span><span class="p">;</span>
    <span class="kt">uint64</span> <span class="na">index</span> <span class="o">=</span> <span class="mi">6</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The Message only has a <code class="language-plaintext highlighter-rouge">from</code> field identifying the sending node, but the same node ID joining the cluster at different times should be treated as different replication sessions. The leader needs to distinguish: is this response from node C’s first session or its second session? But the current Message structure provides no way to tell.</p>

<h2 id="impact-analysis">Impact Analysis</h2>

<h3 id="infinite-retry-loop">Infinite Retry Loop</h3>

<p>Once the leader incorrectly sets <code class="language-plaintext highlighter-rouge">matched=1</code>, trouble begins. Here’s what happens:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// From raft-rs/src/tracker/progress.rs</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">maybe_decr_to</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">rejected</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span> <span class="n">match_hint</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span> <span class="o">...</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">bool</span> <span class="p">{</span>
    <span class="k">if</span> <span class="k">self</span><span class="py">.state</span> <span class="o">==</span> <span class="nn">ProgressState</span><span class="p">::</span><span class="n">Replicate</span> <span class="p">{</span>
        <span class="c1">// Can't decrement if rejected &lt;= matched</span>
        <span class="k">if</span> <span class="n">rejected</span> <span class="o">&lt;</span> <span class="k">self</span><span class="py">.matched</span>
            <span class="p">||</span> <span class="p">(</span><span class="n">rejected</span> <span class="o">==</span> <span class="k">self</span><span class="py">.matched</span> <span class="o">&amp;&amp;</span> <span class="n">request_snapshot</span> <span class="o">==</span> <span class="n">INVALID_INDEX</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="k">false</span><span class="p">;</span>  <span class="c1">// Ignore the rejection!</span>
        <span class="p">}</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The leader sends AppendEntries with <code class="language-plaintext highlighter-rouge">prev_log_index=1</code>, but node C’s log is empty—it doesn’t have index 1. Node C rejects the request. The leader wants to decrement <code class="language-plaintext highlighter-rouge">next_idx</code> to retry an earlier position, but here’s the problem: because <code class="language-plaintext highlighter-rouge">rejected (1) == matched (1)</code>, the decrement logic returns false and refuses to decrement. So the leader just sends the same request again, node C rejects it again, and this cycle continues forever.</p>

<h3 id="operational-impact">Operational Impact</h3>

<p>This bug creates a series of operational problems. First, there’s resource exhaustion: the continuous AppendEntries-rejection cycle keeps consuming CPU and network bandwidth.</p>

<h2 id="why-data-remains-safe">Why Data Remains Safe</h2>

<p>Despite all the operational chaos, there’s good news: data integrity remains intact. Raft’s safety properties ensure that even with corrupted progress tracking, the cluster won’t lose any committed data.</p>

<p>The reason is that commit index calculation still works correctly. Even if the leader mistakenly thinks node C has <code class="language-plaintext highlighter-rouge">matched=1</code>, it calculates the commit index based on the actual majority. For example, node A has matched=100, node B has matched=100, and node C has matched=1 (which is wrong, but doesn’t matter). The majority looks at A and B with matched=100, so the commit index is correctly calculated as 100. Combined with Raft’s overlapping majorities property, any newly elected leader will necessarily have all committed entries, keeping data safe.</p>

<h2 id="solutions">Solutions</h2>

<h3 id="solution-1-add-membership-version-recommended">Solution 1: Add Membership Version (Recommended)</h3>

<p>The most straightforward fix is to add a membership configuration version to messages:</p>

<div class="language-protobuf highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">message</span> <span class="nc">Message</span> <span class="p">{</span>
    <span class="c1">// ... existing fields</span>
    <span class="kt">uint64</span> <span class="na">membership_log_id</span> <span class="o">=</span> <span class="mi">17</span><span class="p">;</span>  <span class="c1">// New field</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then validate it when processing responses:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">handle_append_response</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">m</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">Message</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">pr</span> <span class="o">=</span> <span class="k">self</span><span class="py">.prs</span><span class="nf">.get_mut</span><span class="p">(</span><span class="n">m</span><span class="py">.from</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>

    <span class="c1">// Check membership version</span>
    <span class="k">if</span> <span class="n">m</span><span class="py">.membership_log_id</span> <span class="o">!=</span> <span class="k">self</span><span class="py">.current_membership_log_id</span> <span class="p">{</span>
        <span class="nd">debug!</span><span class="p">(</span><span class="s">"stale message from different membership"</span><span class="p">);</span>
        <span class="k">return</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">pr</span><span class="nf">.maybe_update</span><span class="p">(</span><span class="n">m</span><span class="py">.index</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This directly fixes the root cause—the leader can now tell which membership configuration a message comes from.</p>

<h3 id="solution-2-generation-counters">Solution 2: Generation Counters</h3>

<p>Another approach is to add a generation counter to Progress that increments each time a node rejoins:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">struct</span> <span class="n">Progress</span> <span class="p">{</span>
    <span class="k">pub</span> <span class="n">matched</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
    <span class="k">pub</span> <span class="n">next_idx</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
    <span class="k">pub</span> <span class="n">generation</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>  <span class="c1">// Incremented on each rejoin</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Include the generation in messages and validate it when responses arrive. This is lighter weight than solution 1, but you need to carefully manage the generation lifecycle.</p>

<h2 id="summary">Summary</h2>

<p>This bug shows us that when membership changes happen within the same term, relying on term-based validation alone isn’t enough to ensure message freshness. Without explicit session isolation, delayed responses from old membership configurations can corrupt progress tracking.</p>

<p>Fortunately, because Raft’s commit index calculation and overlapping quorum mechanisms provide strong guarantees, this bug doesn’t compromise data safety. The main impact is operational—the symptoms look like data corruption, which can send operations teams down the rabbit hole investigating a data loss problem that doesn’t actually exist.</p>

<p>For production Raft implementations, it’s recommended to introduce explicit session management mechanisms. This can be achieved through membership versioning or generation counters. The most recommended approach is to add a membership_log_id field to messages, which lets the leader clearly distinguish which membership configuration a response comes from.</p>

<p>Complete analysis and survey of other Raft implementations can be found in the <a href="https://github.com/drmingdrmer/raft-rejoin-bug">Raft Rejoin Bug Survey</a></p>

<p>Reference:</p>]]></content><author><name>Zhang Yanpo (drdr.xp)</name></author><category term="algo" /><category term="distributed" /><category term="分布式" /><category term="raft" /><category term="en" /><summary type="html"><![CDATA[Analyzes a replication session isolation bug in Raft implementations. When a node rejoins the cluster within the same term, delayed AppendEntries responses can corrupt progress tracking, causing infinite retry loops. While data safety remains intact, it creates operational issues like resource exhaustion. Uses raft-rs as a case study to examine trigger conditions and solutions.]]></summary></entry><entry><title type="html">Raft 中的 IO 执行顺序：内存状态与持久化状态的陷阱</title><link href="https://blog.openacid.com/algo/raft-io-order-complete-cn/" rel="alternate" type="text/html" title="Raft 中的 IO 执行顺序：内存状态与持久化状态的陷阱" /><published>2025-10-09T00:00:00+00:00</published><updated>2025-10-09T00:00:00+00:00</updated><id>https://blog.openacid.com/algo/raft-io-order-complete-cn</id><content type="html" xml:base="https://blog.openacid.com/algo/raft-io-order-complete-cn/"><![CDATA[<p><img src="/post-res/raft-io-order-complete-cn/62b7bb390d222f2e-raft-io-order-fix-banner.webp" alt="" /></p>

<h2 id="前言">前言</h2>

<p>在 Raft 实现中，处理 appendEntries 请求时需要持久化两类数据：term 和 log entries。Raft 论文要求”在响应 RPC 之前必须更新持久化状态”，但并未明确说明这两类数据的持久化顺序。这个看似无关紧要的细节，却可能导致已提交数据的丢失。</p>

<p>问题的根源在于：Raft 论文描述的是一个简单的抽象模型（只有磁盘状态），而实际实现为了性能会分离内存状态和持久化状态。这种状态分离引入了论文中未定义的行为，当 IO 操作允许重排序时，就可能破坏 Raft 的安全性保证。</p>

<p>本文将深入分析这个问题是如何产生的，以及主流实现（TiKV、HashiCorp Raft、SOFAJRaft）如何避免这个陷阱。</p>

<h2 id="内存状态与持久化状态的陷阱">内存状态与持久化状态的陷阱</h2>

<p>在实际的 Raft 实现中，为了提升性能，通常会分离内存状态(<code class="language-plaintext highlighter-rouge">current_term</code>)和磁盘状态(<code class="language-plaintext highlighter-rouge">persisted_term</code>)。处理 appendEntries 请求的典型流程是：</p>

<ol>
  <li>收到 appendEntries，如果 <code class="language-plaintext highlighter-rouge">req.term &gt; current_term</code>，立即更新 <code class="language-plaintext highlighter-rouge">current_term</code></li>
  <li>异步提交 save-term IO</li>
  <li>IO 完成后更新 <code class="language-plaintext highlighter-rouge">persisted_term</code>（有些实现中可能没有显式的 <code class="language-plaintext highlighter-rouge">persisted_term</code>）</li>
</ol>

<p>这种状态分离引入了 Raft 论文中没有定义的行为（Raft 论文只关注磁盘状态）：</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">RaftState</span> <span class="p">{</span>
    <span class="c1">// In-memory term, updated immediately when receiving higher term</span>
    <span class="n">current_term</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>

    <span class="c1">// Persisted term on disk, updated only after IO completes</span>
    <span class="n">persisted_term</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>上面描述的流程是常见的 Raft 实现的流程, 在没有 IO-reorder 时, 它是正确的。但当 IO 操作可以重排序时，就会出现严重的安全问题。</p>

<h2 id="问题场景">问题场景</h2>

<p>用一个具体的时间线来展示 IO-reorder 如何导致数据丢失：</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Legend:
Ni:   Node i
Vi:   RequestVote, term=i
Li:   Establish Leader, term=i
Ei-j: Log entry, term=i, index=j

N5 |          V5  L5     E5-1     E5-2
N4 |          V5         E5-1     E5-2
N3 |  V1              V5,E5-1  V5,E5-2  E1-1
N2 |  V1      V5                        E1-1
N1 |  V1  L1                            E1-1
------+---+---+---+------+--------+-----+------&gt; time
      t1  t2  t3  t4     t5       t6    t7
</code></pre></div></div>

<ul>
  <li>t1-t4: 两次选举，N1（term=1）和 N5（term=5）先后成为 leader</li>
  <li><strong>t5</strong>: L5 复制 E5-1 到 N3（N3 的 <code class="language-plaintext highlighter-rouge">current_term=1 &lt; req.term=5</code>）
    <ul>
      <li>N3 需要执行两个 IO：持久化 term=5 和 E5-1</li>
      <li>等待两个 IO 完成才返回成功</li>
    </ul>
  </li>
  <li><strong>t6</strong>: L5 复制 E5-2 到 N3（关键时刻）
    <ul>
      <li>N3 可能还在处理 t5 的 IO</li>
      <li>这时是否存在 IO-reorder 至关重要</li>
    </ul>
  </li>
  <li>t7: L1 尝试复制 E1-1（term=1, index=1）</li>
</ul>

<p><strong>关键在于 t6 时刻的第二个 AppendEntries 请求</strong>。让我们看看 N3 的内部状态变化。</p>

<h3 id="t5-时刻第一个-appendentries">t5 时刻：第一个 AppendEntries</h3>

<p>N3 收到 <code class="language-plaintext highlighter-rouge">appendEntries(term=5, entries=[E5-1])</code>：</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">handle_append_entries</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">req</span><span class="p">:</span> <span class="n">AppendEntries</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Check: RPC term &gt; in-memory term?</span>
    <span class="k">if</span> <span class="n">req</span><span class="py">.term</span> <span class="o">&gt;</span> <span class="k">self</span><span class="py">.current_term</span> <span class="p">{</span>
        <span class="k">self</span><span class="py">.current_term</span> <span class="o">=</span> <span class="n">req</span><span class="py">.term</span><span class="p">;</span>           <span class="c1">// Update memory immediately: 5</span>
        <span class="k">self</span><span class="nf">.submit_io</span><span class="p">(</span><span class="nf">save_term</span><span class="p">(</span><span class="n">req</span><span class="py">.term</span><span class="p">));</span>    <span class="c1">// Submit IO request</span>
    <span class="p">}</span>

    <span class="k">self</span><span class="nf">.submit_io</span><span class="p">(</span><span class="nf">save_entries</span><span class="p">(</span><span class="n">req</span><span class="py">.entries</span><span class="p">));</span>  <span class="c1">// Submit IO request</span>

    <span class="c1">// Wait for both IOs to complete</span>
    <span class="nf">wait_for_both_ios</span><span class="p">();</span>
    <span class="k">return</span> <span class="nf">success</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>N3 的状态：</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">current_term = 5</code>（内存已更新）</li>
  <li><code class="language-plaintext highlighter-rouge">persisted_term = 1</code>（磁盘还未更新，IO 进行中）</li>
  <li>IO 队列：<code class="language-plaintext highlighter-rouge">save_term(5)</code>, <code class="language-plaintext highlighter-rouge">save_entries(E5-1)</code></li>
</ul>

<p>这个请求本身是正确的，问题出现在下一个时刻。</p>

<h3 id="t6-时刻第二个-appendentries">t6 时刻：第二个 AppendEntries</h3>

<p>N3 还没完成 t5 的 IO，就收到了 <code class="language-plaintext highlighter-rouge">appendEntries(term=5, entries=[E5-2])</code>。</p>

<p>如果代码只检查内存 <code class="language-plaintext highlighter-rouge">current_term</code>（大多数实现的做法）, 并提交 save-entries IO：</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">handle_append_entries</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">req</span><span class="p">:</span> <span class="n">AppendEntries</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Check: 5 &gt; 5? No</span>
    <span class="k">if</span> <span class="n">req</span><span class="py">.term</span> <span class="o">&gt;</span> <span class="k">self</span><span class="py">.current_term</span> <span class="p">{</span>
        <span class="c1">// Won't enter this branch</span>
    <span class="p">}</span>

    <span class="c1">// Only submit save_entries(E5-2)</span>
    <span class="k">self</span><span class="nf">.submit_io</span><span class="p">(</span><span class="nf">save_entries</span><span class="p">(</span><span class="n">req</span><span class="py">.entries</span><span class="p">));</span>

    <span class="c1">// Only wait for save_entries to complete</span>
    <span class="nf">wait_for_io</span><span class="p">(</span><span class="n">save_entries</span><span class="p">);</span>
    <span class="k">return</span> <span class="nf">success</span><span class="p">();</span>  <span class="c1">// Return success!</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>问题出现</strong>：在允许 IO-reorder 的时候,</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">save_entries(E5-2)</code> 完成</li>
  <li>但 <code class="language-plaintext highlighter-rouge">save_term(5)</code> 可能还没完成（如果存在 IO 重排序）</li>
  <li>N3 向 Leader 返回成功</li>
</ul>

<p>如果 N3 此时崩溃重启，磁盘状态可能是：</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">persisted_term = 1</code>（save_term(5) 未完成）</li>
  <li><code class="language-plaintext highlighter-rouge">entries = [E5-1, E5-2]</code>（都完成了）</li>
  <li>Leader L5 认为 E5-2 已提交</li>
</ul>

<h3 id="t7-时刻数据丢失">t7 时刻：数据丢失</h3>

<p>重启后 N3 的磁盘状态：<code class="language-plaintext highlighter-rouge">term=1, entries=[E5-1, E5-2]</code></p>

<p>当 L1 发送 <code class="language-plaintext highlighter-rouge">appendEntries(term=1, entries=[E1-1])</code>：</p>

<ul>
  <li>N3 检查：RPC term (1) == 本地 term (1)，接受</li>
  <li>E1-1 覆盖 index=1</li>
  <li><strong>已向 L5 确认提交的 E5-1 和 E5-2 被覆盖</strong></li>
</ul>

<p>注意, 如果不允许 IO-reorder, 那么 t6 的 <code class="language-plaintext highlighter-rouge">save_entries(E5-2)</code> 的完成就暗示了
<code class="language-plaintext highlighter-rouge">save_term(5)</code> 的完成, 满足了 appendEntries 成功的条件, 不会出现问题.</p>

<h2 id="问题的本质">问题的本质</h2>

<p>如果允许 IO-reorder，必须检查 <code class="language-plaintext highlighter-rouge">persisted_term</code> 来判断是否下发 save-term IO；如果不允许 IO-reorder，检查 <code class="language-plaintext highlighter-rouge">current_term</code> 即可。</p>

<p>Raft 论文不区分内存状态和持久化状态，这是实现相关的陷阱。论文要求 “Before responding to RPCs, a server must update its persistent state”，在实现中需要更精确的表述： <strong>必须等待所有使 <code class="language-plaintext highlighter-rouge">persisted_term &gt;= req.term</code> 的 IO 完成后，才能返回成功</strong>。</p>

<h2 id="正确的做法">正确的做法</h2>

<p>检查持久化的 term 而不是内存 term：</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">handle_append_entries</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">req</span><span class="p">:</span> <span class="n">AppendEntries</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Check persisted term, not in-memory term!</span>
    <span class="k">let</span> <span class="n">need_save_term</span> <span class="o">=</span> <span class="n">req</span><span class="py">.term</span> <span class="o">&gt;</span> <span class="k">self</span><span class="py">.persisted_term</span><span class="p">;</span>

    <span class="k">if</span> <span class="n">need_save_term</span> <span class="p">{</span>
        <span class="k">self</span><span class="py">.current_term</span> <span class="o">=</span> <span class="n">req</span><span class="py">.term</span><span class="p">;</span>
        <span class="k">self</span><span class="nf">.submit_io</span><span class="p">(</span><span class="nf">save_term</span><span class="p">(</span><span class="n">req</span><span class="py">.term</span><span class="p">));</span>
    <span class="p">}</span>

    <span class="k">self</span><span class="nf">.submit_io</span><span class="p">(</span><span class="nf">save_entries</span><span class="p">(</span><span class="n">req</span><span class="py">.entries</span><span class="p">));</span>

    <span class="k">if</span> <span class="n">need_save_term</span> <span class="p">{</span>
        <span class="nf">wait_for_both_ios</span><span class="p">();</span>  <span class="c1">// Must wait for save_term to complete</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="nf">wait_for_io</span><span class="p">(</span><span class="n">save_entries</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="nf">success</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>注意：这种实现可能多次提交 save-term IO，需要在实现中谨慎优化。</p>

<h2 id="主流实现的方案">主流实现的方案</h2>

<p>主流实现（TiKV、HashiCorp Raft、SOFAJRaft）通过限制 save-term 和 save-entries 不能 reorder，因此只检查 <code class="language-plaintext highlighter-rouge">current_term</code> 也是安全的：</p>

<ol>
  <li>
    <p><strong>原子批处理（TiKV）</strong>：将 save-term 和 save-entries 放到一个 IO 请求里，一次性提交。这样根本不存在”第二个 appendEntries 只提交 save_entries”的情况。</p>
  </li>
  <li>
    <p><strong>有序分离（HashiCorp Raft）</strong>：save-term 和 save-entries 顺序执行，不会重排序。先完成 term 的 fsync（失败则 panic），再写 log。</p>
  </li>
  <li>
    <p><strong>混合顺序（SOFAJRaft）</strong>：term 同步写入（阻塞等待 fsync），log 异步批处理。保证了 save_term 完成后才会入队 save_entries。</p>
  </li>
</ol>

<h2 id="总结">总结</h2>

<p>Raft 论文的抽象模型（只关注持久化状态）和实际实现（内存状态 + 持久化状态）之间存在微妙的映射关系。</p>

<p><strong>关键不变式</strong>：log entry (term=T) 在磁盘 → persisted_term ≥ T 也必须在磁盘</p>

<p>维护此不变式的两种方式：</p>

<ol>
  <li><strong>消除 IO-reorder</strong>：原子批处理、有序执行或混合方式（主流实现）</li>
  <li><strong>处理 IO-reorder</strong>：检查持久化状态，等待必要的 IO 完成</li>
</ol>

<h2 id="相关资源">相关资源</h2>

<ul>
  <li><a href="https://github.com/databendlabs/openraft/blob/main/openraft/src/docs/protocol/io_ordering.md">OpenRaft docs: io-ordering</a></li>
  <li><a href="https://github.com/tikv/tikv">tikv/tikv</a></li>
  <li><a href="https://github.com/hashicorp/raft">hashicorp/raft</a></li>
  <li><a href="https://github.com/sofastack/sofa-jraft">sofastack/sofa-jraft</a></li>
</ul>

<p>Reference:</p>

<ul>
  <li>
    <p>OpenRaft docs: io-ordering : <a href="https://github.com/databendlabs/openraft/blob/main/openraft/src/docs/protocol/io_ordering.md">https://github.com/databendlabs/openraft/blob/main/openraft/src/docs/protocol/io_ordering.md</a></p>
  </li>
  <li>
    <p>hashicorp/raft : <a href="https://github.com/hashicorp/raft">https://github.com/hashicorp/raft</a></p>
  </li>
  <li>
    <p>sofastack/sofa-jraft : <a href="https://github.com/sofastack/sofa-jraft">https://github.com/sofastack/sofa-jraft</a></p>
  </li>
  <li>
    <p>tikv/tikv : <a href="https://github.com/tikv/tikv">https://github.com/tikv/tikv</a></p>
  </li>
</ul>]]></content><author><name>Zhang Yanpo (drdr.xp)</name></author><category term="algo" /><category term="distributed" /><category term="分布式" /><category term="raft" /><category term="cn" /><summary type="html"><![CDATA[深入分析 Raft 实现中 IO 重排序导致数据丢失的问题。问题不在 Raft 的设计，而在于实现中内存状态与持久化状态的区分导致的陷阱]]></summary></entry><entry><title type="html">Raft Configuration Change with Single Log Entry</title><link href="https://blog.openacid.com/algo/single-log-joint/" rel="alternate" type="text/html" title="Raft Configuration Change with Single Log Entry" /><published>2025-10-07T00:00:00+00:00</published><updated>2025-10-07T00:00:00+00:00</updated><id>https://blog.openacid.com/algo/single-log-joint</id><content type="html" xml:base="https://blog.openacid.com/algo/single-log-joint/"><![CDATA[<p><img src="/post-res/single-log-joint/c915c4fcc98591ed-single-log-joint-banner.webp" alt="" /></p>

<h1 id="preface">Preface</h1>

<p><strong>TL;DR</strong></p>

<p>Standard Raft configuration changes use two log entries with multi-phase commits and careful state management. Can we complete a configuration change with just one log entry? We’ll introduce <strong>effective-config</strong>, prove its correctness, then discover why the simple approach isn’t so simple after all. The standard Joint Consensus method wins for good reasons.</p>

<p><strong>What We’ll Cover</strong></p>

<ol>
  <li>How Raft’s Joint Consensus works (the two-phase approach)</li>
  <li>The single-log-entry idea and its mechanics</li>
  <li>Why it’s theoretically correct</li>
  <li>Why it’s practically problematic (and the patches we’d need)</li>
  <li>Why we should stick with Joint Consensus</li>
</ol>

<h1 id="introduction-to-raft-joint-consensus-2-config-log-entries">Introduction to Raft Joint Consensus: 2 Config Log Entries</h1>

<p>Changing cluster membership in Raft is tricky. Switching from the old configuration <code class="language-plaintext highlighter-rouge">{a,b,c}</code> to a new one <code class="language-plaintext highlighter-rouge">{x,y,z}</code> in one step is dangerous.</p>

<p>Nodes can’t all switch configurations at the exact same moment. During the transition, some nodes (say <code class="language-plaintext highlighter-rouge">a,b</code>) might still be using <code class="language-plaintext highlighter-rouge">C_old</code> while others (<code class="language-plaintext highlighter-rouge">x,y,z</code>) have moved to <code class="language-plaintext highlighter-rouge">C_new</code>. If these two groups don’t overlap—meaning a quorum from <code class="language-plaintext highlighter-rouge">C_old</code> (like <code class="language-plaintext highlighter-rouge">{a,b}</code>) and a quorum from <code class="language-plaintext highlighter-rouge">C_new</code> (like <code class="language-plaintext highlighter-rouge">{x,y}</code>) share no common nodes—we could elect two leaders in the same term, violating Raft’s fundamental safety guarantee.</p>

<p>The Raft paper solves this with a two-phase protocol called <strong>Joint Consensus</strong>:</p>

<p><img src="/post-res/single-log-joint/a7acea752fd84833-raft-joint.x.svg" alt="Figure 1: Joint Consensus Two-Phase Process" /></p>

<ol>
  <li>
    <p><strong>Phase 1: Enter the Joint phase (<code class="language-plaintext highlighter-rouge">C_old_new</code>)</strong>
When the leader receives a configuration change request, it writes a log entry containing <code class="language-plaintext highlighter-rouge">C_old_new</code>—a joint configuration that includes both old and new members. In this state, any decision (like committing a log entry) needs approval from a quorum of <code class="language-plaintext highlighter-rouge">C_old</code> <em>and</em> a quorum of <code class="language-plaintext highlighter-rouge">C_new</code>. The leader starts using <code class="language-plaintext highlighter-rouge">C_old_new</code> as soon as it writes this entry to its own log.</p>
  </li>
  <li>
    <p><strong>Phase 2: Move to the new configuration (<code class="language-plaintext highlighter-rouge">C_new</code>)</strong>
Once <code class="language-plaintext highlighter-rouge">C_old_new</code> commits, the leader writes a second log entry containing just <code class="language-plaintext highlighter-rouge">C_new</code>. From this point forward, the leader uses only <code class="language-plaintext highlighter-rouge">C_new</code>, and all subsequent log entries need only commit on a <code class="language-plaintext highlighter-rouge">C_new</code> quorum. When this second entry commits, the configuration change is complete.</p>
  </li>
</ol>

<p>The intermediate joint phase ensures that any two quorums—whether based on <code class="language-plaintext highlighter-rouge">C_old</code>, <code class="language-plaintext highlighter-rouge">C_new</code>, or <code class="language-plaintext highlighter-rouge">C_old_new</code>—must overlap, preventing split brain. This requires <strong>two</strong> log entries for each configuration change.</p>

<h1 id="can-we-do-it-with-just-one-log-entry">Can We Do It With Just One Log Entry?</h1>

<p>Can we do this safely with just <strong>one</strong> log entry?</p>

<p>We need a new concept: <strong>effective-config</strong>. This is the configuration the leader <em>actually uses</em> to determine if log entries are committed. It might not match any specific configuration stored in a log entry—it’s a runtime state that changes as the configuration change progresses.</p>

<h2 id="terminology">Terminology</h2>

<ul>
  <li><strong>effective-config</strong>: The runtime configuration the leader uses to determine if entries are committed</li>
  <li><strong>Joint config</strong>: A configuration containing both old and new members, like <code class="language-plaintext highlighter-rouge">C_old_new = [{a,b,c}, {x,y,z}]</code></li>
  <li><strong>Uniform config</strong>: A configuration with just one set of members, like <code class="language-plaintext highlighter-rouge">C_new = {x,y,z}</code></li>
  <li><strong>Barrier entry</strong>: A marker log entry that signals the joint phase has safely ended</li>
</ul>

<h2 id="how-it-works">How It Works</h2>

<ul>
  <li>
    <p><strong>Starting point</strong>: The cluster is running with <code class="language-plaintext highlighter-rouge">C_old = {a,b,c}</code>, and that configuration has been committed. The effective-config is <code class="language-plaintext highlighter-rouge">C_old</code>.</p>

    <p><img src="/post-res/single-log-joint/667db9f105260fea-single-1-start.x.svg" alt="Figure 2: Single Log Entry Change - Initial State" /></p>
  </li>
  <li>
    <p><strong>Propose the change</strong>: To change to <code class="language-plaintext highlighter-rouge">C_new = {x,y,z}</code>, the leader writes a single log entry <code class="language-plaintext highlighter-rouge">entry-i</code> containing just <code class="language-plaintext highlighter-rouge">C_new</code>.</p>
  </li>
  <li>
    <p><strong>Enter joint mode immediately</strong>: The moment the leader appends <code class="language-plaintext highlighter-rouge">entry-i</code> to its own log—before it commits, before it replicates—the leader switches its effective-config to the joint configuration <code class="language-plaintext highlighter-rouge">C_old_new = [{a,b,c}, {x,y,z}]</code>. Now <code class="language-plaintext highlighter-rouge">entry-i</code> and all subsequent entries must commit on a quorum from <em>both</em> <code class="language-plaintext highlighter-rouge">{a,b,c}</code> and <code class="language-plaintext highlighter-rouge">{x,y,z}</code>.</p>

    <p><img src="/post-res/single-log-joint/64bd23d7d6bd8fd7-single-2-joint.x.svg" alt="Figure 3: Single Log Entry Change - Entering Joint Phase" /></p>
  </li>
  <li>
    <p><strong>Normal operation continues</strong>: The cluster keeps processing requests. Every entry commits using the joint quorum rules.</p>
  </li>
  <li>
    <p><strong>Exit joint mode</strong>: Once <code class="language-plaintext highlighter-rouge">entry-i</code> commits under <code class="language-plaintext highlighter-rouge">C_old_new</code>, the leader switches effective-config to <code class="language-plaintext highlighter-rouge">C_new = {x,y,z}</code>. All subsequent entries need only a <code class="language-plaintext highlighter-rouge">C_new</code> quorum.</p>
  </li>
</ul>

<p>With one log entry, the system transitions through three states: <code class="language-plaintext highlighter-rouge">C_old → C_old_new → C_new</code>.</p>

<h3 id="correctness-proof">Correctness Proof</h3>

<p>We need to show that we can’t elect two leaders—neither during the configuration change nor afterward.</p>

<p>Assume leader <code class="language-plaintext highlighter-rouge">t</code> is doing the configuration change (writing <code class="language-plaintext highlighter-rouge">entry-i</code>). Later, some candidate <code class="language-plaintext highlighter-rouge">u</code> tries to get elected in term <code class="language-plaintext highlighter-rouge">u &gt; t</code>. We prove <code class="language-plaintext highlighter-rouge">t</code> and <code class="language-plaintext highlighter-rouge">u</code> can’t both be leaders.</p>

<p><strong>Analyzing candidate <code class="language-plaintext highlighter-rouge">u</code>’s election</strong></p>

<p>Candidate <code class="language-plaintext highlighter-rouge">u</code> either has <code class="language-plaintext highlighter-rouge">entry-i</code> in its log or it doesn’t.</p>

<ul>
  <li>
    <p><strong>Case 1: <code class="language-plaintext highlighter-rouge">u</code> has <code class="language-plaintext highlighter-rouge">entry-i</code></strong></p>

    <p>Then <code class="language-plaintext highlighter-rouge">u</code>’s effective-config includes <code class="language-plaintext highlighter-rouge">{x,y,z}</code>. Leader <code class="language-plaintext highlighter-rouge">t</code>’s effective-config is either <code class="language-plaintext highlighter-rouge">C_old_new = [{a,b,c}, {x,y,z}]</code> (still in joint mode) or <code class="language-plaintext highlighter-rouge">C_new = {x,y,z}</code> (finished). Either way, it includes <code class="language-plaintext highlighter-rouge">{x,y,z}</code>.</p>

    <p>Since <code class="language-plaintext highlighter-rouge">u</code> needs a quorum from <code class="language-plaintext highlighter-rouge">{x,y,z}</code> to get elected, and <code class="language-plaintext highlighter-rouge">t</code> needs a quorum from <code class="language-plaintext highlighter-rouge">{x,y,z}</code> to stay leader, these quorums must overlap. No split brain.</p>
  </li>
  <li>
    <p><strong>Case 2: <code class="language-plaintext highlighter-rouge">u</code> doesn’t have <code class="language-plaintext highlighter-rouge">entry-i</code></strong></p>

    <p>Then <code class="language-plaintext highlighter-rouge">u</code>’s effective-config is <code class="language-plaintext highlighter-rouge">C_old = {a,b,c}</code>. Now we consider where leader <code class="language-plaintext highlighter-rouge">t</code> is:</p>

    <ul>
      <li>
        <p>If <code class="language-plaintext highlighter-rouge">t</code>’s effective-config is <code class="language-plaintext highlighter-rouge">C_old_new</code>, then <code class="language-plaintext highlighter-rouge">t</code> needs a quorum from <code class="language-plaintext highlighter-rouge">{a,b,c}</code> and <code class="language-plaintext highlighter-rouge">u</code> needs a quorum from <code class="language-plaintext highlighter-rouge">{a,b,c}</code>. These must overlap. No split brain.</p>
      </li>
      <li>
        <p>If <code class="language-plaintext highlighter-rouge">t</code>’s effective-config is <code class="language-plaintext highlighter-rouge">C_new = {x,y,z}</code>, that means <code class="language-plaintext highlighter-rouge">entry-i</code> committed under <code class="language-plaintext highlighter-rouge">C_old_new</code>. So <code class="language-plaintext highlighter-rouge">entry-i</code> must exist on a quorum of <code class="language-plaintext highlighter-rouge">{a,b,c}</code>. Those nodes have logs at least as long as index <code class="language-plaintext highlighter-rouge">i</code>.</p>

        <p>But <code class="language-plaintext highlighter-rouge">u</code> doesn’t have <code class="language-plaintext highlighter-rouge">entry-i</code>, so its log is shorter than <code class="language-plaintext highlighter-rouge">i</code>. When <code class="language-plaintext highlighter-rouge">u</code> requests votes from nodes in <code class="language-plaintext highlighter-rouge">{a,b,c}</code>, they’ll reject it because their logs are more up-to-date. The election fails.</p>
      </li>
    </ul>
  </li>
</ul>

<p>In every case, we can’t have both <code class="language-plaintext highlighter-rouge">t</code> and <code class="language-plaintext highlighter-rouge">u</code> as leaders. The algorithm is safe.</p>

<p>However, although theoretically correct, it introduces problems in actual implementation:</p>

<h2 id="problem-1-the-memory-only-transition">Problem 1: The Memory-Only Transition</h2>

<p>When we move from <code class="language-plaintext highlighter-rouge">C_old_new</code> to <code class="language-plaintext highlighter-rouge">C_new</code>, we only change the in-memory effective-config. Nothing hits disk. This creates trouble.</p>

<p>Nodes from <code class="language-plaintext highlighter-rouge">C_old</code> can still initiate elections and compete with <code class="language-plaintext highlighter-rouge">C_new</code> nodes, because <code class="language-plaintext highlighter-rouge">C_old</code> logs are as long as <code class="language-plaintext highlighter-rouge">C_new</code> logs. Even after the configuration change completes, <code class="language-plaintext highlighter-rouge">C_old</code> nodes can steal leadership from <code class="language-plaintext highlighter-rouge">C_new</code> nodes. The root cause is that the state change is not recorded on the persistent layer. This is problematic because nodes intended for removal can still become leaders.</p>

<p>Compare this to standard Joint Consensus: it writes a second log entry containing <code class="language-plaintext highlighter-rouge">C_new</code>. That entry acts as a barrier. Nodes from <code class="language-plaintext highlighter-rouge">C_old</code> have shorter logs and lose elections. The single-entry approach has no such barrier—the transition from <code class="language-plaintext highlighter-rouge">C_old_new</code> to <code class="language-plaintext highlighter-rouge">C_new</code> is invisible on disk.</p>

<p>Look at the diagram below. The cluster transitions from <code class="language-plaintext highlighter-rouge">C_old_new</code> to <code class="language-plaintext highlighter-rouge">C_new</code>, but no logs change. Leadership moves to node <code class="language-plaintext highlighter-rouge">x</code> in <code class="language-plaintext highlighter-rouge">{x,y,z}</code>. But nodes from <code class="language-plaintext highlighter-rouge">C_old</code> can still start elections and steal leadership from <code class="language-plaintext highlighter-rouge">x</code>.</p>

<p><img src="/post-res/single-log-joint/3e6fb36ca8cf87de-single-3-elect.x.svg" alt="Figure 4: Patch-1 Persistent Layer Problem Example" /></p>

<p><strong>Patch-1</strong>: After entering <code class="language-plaintext highlighter-rouge">C_new</code>, immediately append a no-op entry. This lengthens the logs of <code class="language-plaintext highlighter-rouge">C_new</code> nodes, blocking elections from <code class="language-plaintext highlighter-rouge">C_old</code> nodes.</p>

<h2 id="problem-2-the-restart-ambiguity">Problem 2: The Restart Ambiguity</h2>

<p>When a node restarts, it can’t tell if the cluster is in joint mode or has finished the change.</p>

<ul>
  <li>
    <p>The restarting node reads its log. It sees <code class="language-plaintext highlighter-rouge">entry-i</code> containing <code class="language-plaintext highlighter-rouge">C_old</code> and <code class="language-plaintext highlighter-rouge">entry-j</code> containing <code class="language-plaintext highlighter-rouge">C_new</code>.</p>
  </li>
  <li>
    <p>We know <code class="language-plaintext highlighter-rouge">entry-i</code> is committed (Raft requires it before starting a new change).</p>
  </li>
  <li>
    <p>But what about <code class="language-plaintext highlighter-rouge">entry-j</code>? The node can’t tell just from its local log:</p>

    <ul>
      <li>If <code class="language-plaintext highlighter-rouge">entry-j</code> isn’t committed yet, the cluster is in joint mode with effective-config <code class="language-plaintext highlighter-rouge">C_old_new</code></li>
      <li>If <code class="language-plaintext highlighter-rouge">entry-j</code> is committed, the cluster is using <code class="language-plaintext highlighter-rouge">C_new</code></li>
    </ul>
  </li>
</ul>

<p>Without talking to other nodes, there’s no way to know.</p>

<p><img src="/post-res/single-log-joint/004bf3df9c4de529-restart.x.svg" alt="Figure 5: Patch-2 New Node Restart State Example" /></p>

<p>In the diagram above, even if <code class="language-plaintext highlighter-rouge">entry-3</code> has committed, the restarting nodes <code class="language-plaintext highlighter-rouge">b</code>, <code class="language-plaintext highlighter-rouge">c</code>, <code class="language-plaintext highlighter-rouge">x</code>, <code class="language-plaintext highlighter-rouge">y</code> can’t tell whether the cluster is in joint mode or using the new configuration. (Nodes <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">z</code> never received <code class="language-plaintext highlighter-rouge">entry-3</code> and are still using <code class="language-plaintext highlighter-rouge">{a,b,c}</code>.)</p>

<p><strong>Patch-2</strong>: Always start in joint mode after a restart.</p>

<ol>
  <li>When a node starts up, it sets effective-config to the joint configuration formed from the last two config entries in its log</li>
  <li>It uses this joint config for elections and normal operation</li>
  <li>Only after confirming that the latest config entry has committed under the joint configuration can it switch to the new configuration</li>
</ol>

<p><strong>Example</strong>: A node sees configs <code class="language-plaintext highlighter-rouge">{a,b,c}</code> and <code class="language-plaintext highlighter-rouge">{u,v,w}</code> in its log. It starts with effective-config <code class="language-plaintext highlighter-rouge">[{a,b,c}, {u,v,w}]</code>. To become leader, it needs quorums from both groups. Only after it confirms the new config committed under the joint rules can it switch to just <code class="language-plaintext highlighter-rouge">{u,v,w}</code>.</p>

<h2 id="problem-3-calling-home-to-dead-nodes">Problem 3: Calling Home to Dead Nodes</h2>

<p>Patch-2 solves the ambiguity problem but creates a worse one: <strong>nodes might try to contact old cluster members that no longer exist, making elections impossible</strong>.</p>

<p><strong>Example</strong>:</p>

<p><img src="/post-res/single-log-joint/0507172bcfea4f35-restart-after-uniform.x.svg" alt="Figure 6: Regression to C-old-new After Restart" /></p>

<ol>
  <li>
    <p>The cluster changes from <code class="language-plaintext highlighter-rouge">{a,b,c}</code> to <code class="language-plaintext highlighter-rouge">{x,y,z}</code></p>
  </li>
  <li>
    <p>The config entry commits under <code class="language-plaintext highlighter-rouge">C_old_new</code></p>
  </li>
  <li>
    <p>The cluster transitions to <code class="language-plaintext highlighter-rouge">C_new = {x,y,z}</code></p>
  </li>
  <li>
    <p>Nodes <code class="language-plaintext highlighter-rouge">a</code>, <code class="language-plaintext highlighter-rouge">b</code>, <code class="language-plaintext highlighter-rouge">c</code> are no longer members. They get shut down, their data gets wiped, and they’re gone</p>
  </li>
  <li>
    <p>Then something happens and all remaining nodes restart</p>
  </li>
  <li>
    <p>Node <code class="language-plaintext highlighter-rouge">x</code> restarts and follows Patch-2: it sees configs <code class="language-plaintext highlighter-rouge">{a,b,c}</code> and <code class="language-plaintext highlighter-rouge">{x,y,z}</code> in its log, so it sets effective-config to <code class="language-plaintext highlighter-rouge">[{a,b,c}, {x,y,z}]</code></p>
  </li>
  <li>
    <p>Node <code class="language-plaintext highlighter-rouge">x</code> tries to run an election, but <code class="language-plaintext highlighter-rouge">b</code> and <code class="language-plaintext highlighter-rouge">c</code> don’t exist anymore! It can’t get a quorum from both groups. The election fails. The cluster is stuck.</p>
  </li>
</ol>

<p>This is state regression. The transition from <code class="language-plaintext highlighter-rouge">C_old_new</code> to <code class="language-plaintext highlighter-rouge">C_new</code> wasn’t persisted, so after a restart, the system rolls back to needing <code class="language-plaintext highlighter-rouge">C_old</code>.</p>

<h2 id="adding-a-barrier-to-prevent-regression">Adding a Barrier to Prevent Regression</h2>

<p>Restarting nodes need to <strong>know for certain</strong> that the joint phase has ended—proof that it’s safe to use <code class="language-plaintext highlighter-rouge">C_new</code> without calling back to <code class="language-plaintext highlighter-rouge">C_old</code>.</p>

<p><strong>Patch-3: Add a barrier entry</strong></p>

<p>After <code class="language-plaintext highlighter-rouge">entry-j</code> (containing <code class="language-plaintext highlighter-rouge">C_new</code>) commits under <code class="language-plaintext highlighter-rouge">C_old_new</code>, append a special <strong>barrier entry</strong> to mark that <code class="language-plaintext highlighter-rouge">entry-j</code> has committed.</p>

<blockquote>
  <p><strong>Important</strong>: The barrier must come <em>after</em> <code class="language-plaintext highlighter-rouge">entry-j</code> commits. Otherwise it can’t serve as proof of the commit.</p>
</blockquote>

<p>When a restarting node sees this barrier, it knows the joint phase ended successfully. It can safely use <code class="language-plaintext highlighter-rouge">C_new</code> for elections without trying to contact old nodes that might not exist anymore.</p>

<p>In the diagram below, when <code class="language-plaintext highlighter-rouge">entry-3</code> commits under <code class="language-plaintext highlighter-rouge">C_old_new</code>, we add barrier <code class="language-plaintext highlighter-rouge">entry-4</code>:</p>

<p><img src="/post-res/single-log-joint/a1b3d0adfde6319d-barrier.x.svg" alt="Figure 7: Patch-3 Introducing Barrier Entry Process" /></p>

<p>Now when all nodes restart, there’s no regression. Nodes <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> see the barrier, so they use <code class="language-plaintext highlighter-rouge">C_new = {x,y,z}</code> directly. Even though <code class="language-plaintext highlighter-rouge">b</code> and <code class="language-plaintext highlighter-rouge">c</code> are gone, <code class="language-plaintext highlighter-rouge">x</code> or <code class="language-plaintext highlighter-rouge">y</code> can still get elected:</p>

<p><img src="/post-res/single-log-joint/431e950cac307092-barrier-restart.x.svg" alt="Figure 8: Barrier Entry After Restart" /></p>

<blockquote>
  <p><strong>Alternative: Persisting commit-index</strong></p>

  <p>Instead of a barrier entry, we could persist the commit-index—an idea from <a href="https://weibo.com/u/1516609505">Ma Jianjiang</a>.</p>

  <p>The rule: joint consensus ends when commit-index reaches a quorum of <code class="language-plaintext highlighter-rouge">C_new</code>. To make this work, we’d need to persist commit-index (standard Raft doesn’t require this).</p>

  <p>When a node restarts, it checks: if the persisted commit-index covers the config change entry, it knows <code class="language-plaintext highlighter-rouge">C_old_new</code> finished and can safely use <code class="language-plaintext highlighter-rouge">C_new</code>. No need to contact old nodes.</p>

  <p>But this still has Problem 1—<code class="language-plaintext highlighter-rouge">C_old</code> and <code class="language-plaintext highlighter-rouge">C_new</code> nodes competing for leadership. Here’s why: <code class="language-plaintext highlighter-rouge">C_new</code> nodes don’t have extra log entries, and committing commit-index to just <code class="language-plaintext highlighter-rouge">C_new</code> doesn’t guarantee <code class="language-plaintext highlighter-rouge">C_old</code> nodes see it. This is the classic distributed systems dilemma of at-least-once vs at-most-once delivery:</p>

  <ul>
    <li><strong>At-least-once</strong> (commit on <code class="language-plaintext highlighter-rouge">C_old_new</code>): commit-index might succeed, then <code class="language-plaintext highlighter-rouge">C_old</code> nodes get decommissioned, then we can’t commit it again to reach them. We’re stuck.</li>
    <li><strong>At-most-once</strong> (commit on <code class="language-plaintext highlighter-rouge">C_new</code> only): commit-index reaches <code class="language-plaintext highlighter-rouge">C_new</code> but might not reach <code class="language-plaintext highlighter-rouge">C_old</code>. Those nodes don’t know the cluster moved on, so they keep trying to run elections.</li>
  </ul>

  <p>Either way, we can still end up with <code class="language-plaintext highlighter-rouge">C_old</code> and <code class="language-plaintext highlighter-rouge">C_new</code> nodes competing for leadership.</p>
</blockquote>

<p>So here’s what the <strong>patched single-log approach</strong> looks like:</p>

<ol>
  <li>
    <p>Start with <code class="language-plaintext highlighter-rouge">effective-config = C_old = {a,b,c}</code></p>
  </li>
  <li>
    <p>Leader writes <code class="language-plaintext highlighter-rouge">entry-j</code> containing <code class="language-plaintext highlighter-rouge">C_new = {x,y,z}</code> and immediately switches <code class="language-plaintext highlighter-rouge">effective-config</code> to <code class="language-plaintext highlighter-rouge">C_old_new = [{a,b,c}, {x,y,z}]</code></p>
  </li>
  <li>
    <p>All entries from index <code class="language-plaintext highlighter-rouge">j</code> onward replicate and commit under <code class="language-plaintext highlighter-rouge">C_old_new</code></p>
  </li>
  <li>
    <p><strong>Critical step</strong>: Once <code class="language-plaintext highlighter-rouge">entry-j</code> commits under <code class="language-plaintext highlighter-rouge">C_old_new</code>, the leader writes a special <strong>barrier entry</strong>. This entry has no configuration data—it just marks “the joint phase is done.” The leader can switch to <code class="language-plaintext highlighter-rouge">effective-config = C_new</code> and use <code class="language-plaintext highlighter-rouge">C_new</code> to replicate the barrier.</p>
  </li>
  <li>
    <p>When the barrier entry commits, the configuration change is complete</p>
  </li>
</ol>

<p><strong>Restart behavior</strong>:</p>

<p>When a node restarts, it reads its log. It sees <code class="language-plaintext highlighter-rouge">entry-i</code> (<code class="language-plaintext highlighter-rouge">C_old</code>) and <code class="language-plaintext highlighter-rouge">entry-j</code> (<code class="language-plaintext highlighter-rouge">C_new</code>). It checks: is there a barrier after <code class="language-plaintext highlighter-rouge">entry-j</code>?</p>

<ul>
  <li>
    <p><strong>Barrier present</strong>: Joint phase ended. Set <code class="language-plaintext highlighter-rouge">effective-config = C_new</code>. No need to contact old nodes.</p>
  </li>
  <li>
    <p><strong>No barrier</strong>: Joint phase might still be active. Set <code class="language-plaintext highlighter-rouge">effective-config = C_old_new</code>.</p>
  </li>
</ul>

<p>Patch-3 adds a second log entry. We’re no longer doing “one log entry” configuration changes. We need “one config entry + one barrier entry.”</p>

<h1 id="conclusion">Conclusion</h1>

<p>Configuration changes must pass through three states—<code class="language-plaintext highlighter-rouge">C_old → C_old_new → C_new</code>. One log entry gives us one bit of persistent information: <code class="language-plaintext highlighter-rouge">C_old</code> or <code class="language-plaintext highlighter-rouge">C_new</code>. That’s only two states. We can’t represent three states with two values.</p>

<p>To safely handle all three states, we need at least two log entries. That gives us two bits of information and up to four possible states, which is enough to encode the three states we actually need.</p>

<p>The “single-log-entry” approach, after all the patches, ends up needing two entries anyway—one for the configuration and one for the barrier. And it’s more complex than standard Joint Consensus, with trickier edge cases around restarts and state transitions.</p>

<p>Stick with Joint Consensus. It’s cleaner, simpler, and solves the problem directly without patches.</p>

<h2 id="references">References</h2>

<ul>
  <li>Diego Ongaro &amp; John Ousterhout. In Search of an Understandable Consensus Algorithm (Raft paper): https://raft.github.io/raft.pdf</li>
  <li>OpenRaft(rust): https://github.com/databendlabs/openraft</li>
  <li>etcd/raft source code: https://github.com/etcd-io/raft</li>
  <li>Hashicorp Raft implementation: https://github.com/hashicorp/raft</li>
</ul>

<p>Reference:</p>

<ul>
  <li>马健将 : <a href="https://weibo.com/u/1516609505">https://weibo.com/u/1516609505</a></li>
</ul>]]></content><author><name>Zhang Yanpo (drdr.xp)</name></author><category term="algo" /><category term="distributed" /><category term="raft" /><category term="config-change" /><category term="joint" /><summary type="html"><![CDATA[Is the single-log-entry approach to Raft configuration change simpler than the standard Joint Consensus?]]></summary></entry><entry><title type="html">Raft IO Execution Order (Revised)</title><link href="https://blog.openacid.com/algo/raft-io-order-fix/" rel="alternate" type="text/html" title="Raft IO Execution Order (Revised)" /><published>2025-10-04T00:00:00+00:00</published><updated>2025-10-04T00:00:00+00:00</updated><id>https://blog.openacid.com/algo/raft-io-order-fix</id><content type="html" xml:base="https://blog.openacid.com/algo/raft-io-order-fix/"><![CDATA[<p><img src="/post-res/raft-io-order-fix/62b7bb390d222f2e-raft-io-order-fix-banner.webp" alt="" /></p>

<h2 id="preface">Preface</h2>

<p>I need to come clean about something. In my <a href="https://blog.openacid.com/algo/raft-io-order/">previous article on IO ordering in Raft</a>, I tried to demonstrate the dangers of “writing log entries before term” using a committed data loss scenario. The problem? That example was fundamentally flawed—it didn’t actually capture the real issue with IO reordering at all.</p>

<p>So let’s fix that. This article walks through what I got wrong and, more importantly, presents a correct understanding of when and why IO reordering becomes dangerous in Raft implementations.</p>

<h2 id="what-went-wrong-in-my-original-analysis">What Went Wrong in My Original Analysis</h2>

<p>Let me show you the timeline I used in the previous article:</p>

<blockquote>
  <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Legend:
Ni:   Node i
Vi:   RequestVote, term=i
Li:   Establish Leader, term=i
Ei-j: Log entry, term=i, index=j

N5 |          V5  L5       E5-1
N4 |          V5           E5-1
N3 |  V1                V5,E5-1  E1-1
N2 |  V1      V5                 E1-1
N1 |  V1  L1                     E1-1
------+---+---+---+--------+-----+---------&gt; time
      t1  t2  t3  t4       t5    t6
</code></pre></div>  </div>

  <p>Here’s what I claimed would happen:</p>

  <ul>
    <li>At t5: N3 receives entry E5-1 from leader L5 (term=5) and needs to persist both term=5 and E5-1</li>
    <li>I argued: “If N3 writes E5-1 first but crashes before writing term=5, it could restart with <code class="language-plaintext highlighter-rouge">term=1, entries=[E5-1]</code>”</li>
    <li>At t6: The old leader L1 (term=1) could then overwrite E5-1, causing data loss</li>
  </ul>

  <p><strong>Here’s the flaw in my reasoning</strong>: Raft’s protocol explicitly requires that <em>both</em> the term update and log entries must be successfully persisted before a follower responds with success. If either IO fails or is incomplete, the leader never receives confirmation and therefore never considers the entry committed. The Raft paper’s design is actually bulletproof here.</p>
</blockquote>

<p>So if Raft’s design is correct, where does the IO ordering problem actually come from? The answer lies in a subtle gap between theory and implementation—specifically, how real Raft systems separate in-memory state from on-disk state.</p>

<h2 id="the-real-culprit-in-memory-vs-persisted-state">The Real Culprit: In-Memory vs Persisted State</h2>

<p>Here’s where things get interesting. The Raft paper describes a beautifully simple world where a server has just one state: what’s on disk. But real implementations need to be fast, so they introduce an optimization—they split their state into two layers:</p>

<p><strong>In-memory state</strong>: The “optimistic” view that updates immediately when receiving RPCs
<strong>Persisted state</strong>: The “durable” view that updates only after IO completes</p>

<p>Here’s how a typical implementation handles an appendEntries request:</p>

<ol>
  <li>Receive appendEntries RPC with <code class="language-plaintext highlighter-rouge">req.term</code></li>
  <li>If <code class="language-plaintext highlighter-rouge">req.term &gt; current_term</code>, immediately update <code class="language-plaintext highlighter-rouge">current_term</code> to <code class="language-plaintext highlighter-rouge">req.term</code></li>
  <li>Asynchronously submit a save-term IO operation</li>
  <li>Eventually update <code class="language-plaintext highlighter-rouge">persisted_term</code> when the IO completes</li>
</ol>

<p>In code, this looks like:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">RaftState</span> <span class="p">{</span>
    <span class="c1">// In-memory term - may be ahead of what's on disk</span>
    <span class="n">current_term</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>

    <span class="c1">// Persisted term on disk - the durable truth</span>
    <span class="n">persisted_term</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This separation is where the danger lurks. The Raft paper assumes only one “term” variable—what’s persisted on disk. But implementations now have <em>two</em> term values, and this introduces a behavior the paper never defined or analyzed.</p>

<p>The pattern above is ubiquitous in Raft implementations. And here’s the kicker: <em>without IO reordering, it works perfectly fine</em>. The bug only surfaces when IOs can complete out of order.</p>

<h2 id="a-concrete-example-where-io-reordering-breaks-raft">A Concrete Example: Where IO Reordering Breaks Raft</h2>

<p>Let’s build a scenario that actually exposes the bug. I’ll walk you through it step by step:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Legend:
Ni:   Node i
Vi:   RequestVote, term=i
Li:   Establish Leader, term=i
Ei-j: Log entry, term=i, index=j

N5 |          V5  L5     E5-1     E5-2
N4 |          V5         E5-1     E5-2
N3 |  V1              V5,E5-1  V5,E5-2  E1-1
N2 |  V1      V5                        E1-1
N1 |  V1  L1                            E1-1
------+---+---+---+------+--------+-----+------&gt; time
      t1  t2  t3  t4     t5       t6    t7
</code></pre></div></div>

<p>Here’s the sequence of events:</p>

<ul>
  <li><strong>t1-t4</strong>: Two elections occur. First N1 becomes leader (term=1), then N5 becomes leader (term=5)</li>
  <li><strong>t5</strong>: Leader L5 sends its first entry E5-1 to follower N3
    <ul>
      <li>N3’s current state: <code class="language-plaintext highlighter-rouge">current_term=1</code>, <code class="language-plaintext highlighter-rouge">persisted_term=1</code></li>
      <li>N3 receives <code class="language-plaintext highlighter-rouge">appendEntries(term=5, entries=[E5-1])</code></li>
      <li>N3 must persist both term=5 and entry E5-1</li>
      <li>N3 responds “success” only after both IOs complete</li>
    </ul>
  </li>
  <li><strong>t6</strong>: Leader L5 sends a second entry E5-2 to N3 ← <em>This is the critical moment</em>
    <ul>
      <li>N3 might still be waiting for t5’s IOs to complete</li>
      <li>Whether IO reordering can occur makes all the difference</li>
    </ul>
  </li>
  <li><strong>t7</strong>: The old leader L1 (term=1) attempts to replicate E1-1 to N3</li>
</ul>

<p>The bug manifests in what happens at <strong>t6</strong>—when the second AppendEntries arrives while the first one’s IOs are still in flight. Let’s zoom into N3’s internal state at each step.</p>

<h3 id="at-t5-the-first-appendentries">At t5: The First AppendEntries</h3>

<p>When N3 receives <code class="language-plaintext highlighter-rouge">appendEntries(term=5, entries=[E5-1])</code>, here’s what happens inside:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">handle_append_entries</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">req</span><span class="p">:</span> <span class="n">AppendEntries</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Check: Is the RPC term newer than our in-memory term?</span>
    <span class="k">if</span> <span class="n">req</span><span class="py">.term</span> <span class="o">&gt;</span> <span class="k">self</span><span class="py">.current_term</span> <span class="p">{</span>
        <span class="k">self</span><span class="py">.current_term</span> <span class="o">=</span> <span class="n">req</span><span class="py">.term</span><span class="p">;</span>           <span class="c1">// Update memory immediately: 1 → 5</span>
        <span class="k">self</span><span class="nf">.submit_io</span><span class="p">(</span><span class="nf">save_term</span><span class="p">(</span><span class="n">req</span><span class="py">.term</span><span class="p">));</span>    <span class="c1">// Queue IO to persist term=5</span>
    <span class="p">}</span>

    <span class="k">self</span><span class="nf">.submit_io</span><span class="p">(</span><span class="nf">save_entries</span><span class="p">(</span><span class="n">req</span><span class="py">.entries</span><span class="p">));</span>  <span class="c1">// Queue IO to persist E5-1</span>

    <span class="c1">// Wait for both IOs to complete before responding</span>
    <span class="nf">wait_for_both_ios</span><span class="p">();</span>
    <span class="k">return</span> <span class="nf">success</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>After this call executes, N3’s state looks like:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">current_term = 5</code> (memory updated immediately)</li>
  <li><code class="language-plaintext highlighter-rouge">persisted_term = 1</code> (disk not yet updated—IO still in flight)</li>
  <li>IO queue: <code class="language-plaintext highlighter-rouge">[save_term(5), save_entries(E5-1)]</code> waiting to complete</li>
</ul>

<p>So far, so good. This request is handled correctly—N3 won’t respond until both IOs finish. The trouble starts at the next moment.</p>

<h3 id="at-t6-the-second-appendentrieswhere-everything-goes-wrong">At t6: The Second AppendEntries—Where Everything Goes Wrong</h3>

<p>Now here’s the critical moment. Before t5’s IOs have completed, N3 receives a second request: <code class="language-plaintext highlighter-rouge">appendEntries(term=5, entries=[E5-2])</code>.</p>

<p>Most implementations check only the in-memory <code class="language-plaintext highlighter-rouge">current_term</code> to decide whether to persist the term. Watch what happens:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">handle_append_entries</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">req</span><span class="p">:</span> <span class="n">AppendEntries</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Check: Is 5 &gt; 5? Nope!</span>
    <span class="k">if</span> <span class="n">req</span><span class="py">.term</span> <span class="o">&gt;</span> <span class="k">self</span><span class="py">.current_term</span> <span class="p">{</span>
        <span class="c1">// We skip this branch entirely</span>
    <span class="p">}</span>

    <span class="c1">// We only queue the entries IO</span>
    <span class="k">self</span><span class="nf">.submit_io</span><span class="p">(</span><span class="nf">save_entries</span><span class="p">(</span><span class="n">req</span><span class="py">.entries</span><span class="p">));</span>

    <span class="c1">// We only wait for the entries IO to complete</span>
    <span class="nf">wait_for_io</span><span class="p">(</span><span class="n">save_entries</span><span class="p">);</span>
    <span class="k">return</span> <span class="nf">success</span><span class="p">();</span>  <span class="c1">// We're done!</span>
<span class="p">}</span>
</code></pre></div></div>

<p>See the problem? N3 returns success as soon as <code class="language-plaintext highlighter-rouge">save_entries(E5-2)</code> completes. But here’s the dangerous part: <strong>if IO reordering is allowed</strong>, the system might have:</p>

<ul>
  <li>✅ Completed <code class="language-plaintext highlighter-rouge">save_entries(E5-2)</code></li>
  <li>✅ Completed <code class="language-plaintext highlighter-rouge">save_entries(E5-1)</code></li>
  <li>❌ NOT completed <code class="language-plaintext highlighter-rouge">save_term(5)</code> (still in flight from t5)</li>
</ul>

<p>N3 happily returns success to Leader L5, which then considers E5-2 replicated and potentially committed.</p>

<p>Now imagine N3 crashes. When it restarts, its disk state is:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">persisted_term = 1</code> (the save_term(5) never finished)</li>
  <li><code class="language-plaintext highlighter-rouge">entries = [E5-1, E5-2]</code> (both successfully written)</li>
</ul>

<p>This is an inconsistent state that Raft’s protocol assumes can never exist. And it’s about to cause data loss.</p>

<h3 id="at-t7-the-data-loss-materializes">At t7: The Data Loss Materializes</h3>

<p>After N3 restarts with <code class="language-plaintext highlighter-rouge">term=1, entries=[E5-1, E5-2]</code>, the old leader L1 (from term=1) sends an appendEntries request: <code class="language-plaintext highlighter-rouge">appendEntries(term=1, entries=[E1-1])</code>.</p>

<p>N3’s logic:</p>

<ol>
  <li>Check: RPC term (1) == my local term (1) ✅</li>
  <li>Accept the request</li>
  <li>Write E1-1 at index=1, overwriting E5-1</li>
</ol>

<p><strong>The disaster</strong>: Entries E5-1 and E5-2, which Leader L5 believed were successfully replicated and possibly committed, have just been silently destroyed. We’ve lost committed data.</p>

<hr />

<p><strong>Important note</strong>: If IO reordering were <em>not</em> allowed, this bug wouldn’t occur. Here’s why: when <code class="language-plaintext highlighter-rouge">save_entries(E5-2)</code> completes at t6, it would guarantee that <code class="language-plaintext highlighter-rouge">save_term(5)</code> (queued earlier) has also completed. The sequential ordering ensures that N3’s disk state remains consistent, and the AppendEntries success response would be legitimate.</p>

<h2 id="the-root-cause-a-mismatch-between-theory-and-practice">The Root Cause: A Mismatch Between Theory and Practice</h2>

<p>Let’s crystallize what we’ve learned:</p>

<p><strong>The core issue</strong>: When deciding whether to persist the term, should we check <code class="language-plaintext highlighter-rouge">current_term</code> or <code class="language-plaintext highlighter-rouge">persisted_term</code>?</p>

<ul>
  <li>If IO reordering is <strong>not allowed</strong> → checking <code class="language-plaintext highlighter-rouge">current_term</code> is safe</li>
  <li>If IO reordering <strong>is allowed</strong> → we must check <code class="language-plaintext highlighter-rouge">persisted_term</code></li>
</ul>

<p>This isn’t obvious because the Raft paper never talks about in-memory vs persisted state—it only knows about one kind of state: what’s on disk. The paper says: <em>“Before responding to RPCs, a server must update its persistent state.”</em></p>

<p>But in real implementations with in-memory and persisted state split, this requirement needs to be more precise:</p>

<p><strong>Before returning success, we must ensure all IOs that make <code class="language-plaintext highlighter-rouge">persisted_term &gt;= req.term</code> have completed.</strong></p>

<p>Checking only <code class="language-plaintext highlighter-rouge">current_term</code> creates a window where we might respond successfully while the required disk updates are still in flight. If those updates can complete out of order, we’ve violated Raft’s safety guarantees.</p>

<h2 id="how-to-fix-it-check-persisted-state-not-in-memory-state">How to Fix It: Check Persisted State, Not In-Memory State</h2>

<p>If you need to support IO reordering, the fix is conceptually simple—check the on-disk term, not the in-memory term:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">handle_append_entries</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">req</span><span class="p">:</span> <span class="n">AppendEntries</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Check against disk state, not memory!</span>
    <span class="k">let</span> <span class="n">need_save_term</span> <span class="o">=</span> <span class="n">req</span><span class="py">.term</span> <span class="o">&gt;</span> <span class="k">self</span><span class="py">.persisted_term</span><span class="p">;</span>

    <span class="k">if</span> <span class="n">need_save_term</span> <span class="p">{</span>
        <span class="k">self</span><span class="py">.current_term</span> <span class="o">=</span> <span class="n">req</span><span class="py">.term</span><span class="p">;</span>
        <span class="k">self</span><span class="nf">.submit_io</span><span class="p">(</span><span class="nf">save_term</span><span class="p">(</span><span class="n">req</span><span class="py">.term</span><span class="p">));</span>
    <span class="p">}</span>

    <span class="k">self</span><span class="nf">.submit_io</span><span class="p">(</span><span class="nf">save_entries</span><span class="p">(</span><span class="n">req</span><span class="py">.entries</span><span class="p">));</span>

    <span class="c1">// Wait for the right IOs based on what we actually need</span>
    <span class="k">if</span> <span class="n">need_save_term</span> <span class="p">{</span>
        <span class="nf">wait_for_both_ios</span><span class="p">();</span>  <span class="c1">// Must wait for term update to complete</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="nf">wait_for_io</span><span class="p">(</span><span class="n">save_entries</span><span class="p">);</span>  <span class="c1">// Only need to wait for entries</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="nf">success</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>By checking <code class="language-plaintext highlighter-rouge">persisted_term</code> instead of <code class="language-plaintext highlighter-rouge">current_term</code>, we correctly detect when the term IO is still in flight and wait for it to complete.</p>

<p><strong>Caveat</strong>: This approach might submit multiple <code class="language-plaintext highlighter-rouge">save_term(T)</code> IOs for the same term T (if multiple AppendEntries arrive in quick succession). You’ll need to handle this carefully—either make the IO layer idempotent or add deduplication logic.</p>

<h2 id="how-production-systems-solve-this">How Production Systems Solve This</h2>

<p>Here’s the interesting part: most mature Raft implementations don’t actually support IO reordering. Instead, they eliminate the problem entirely by ensuring save-term and save-entries execute in order. This lets them safely check <code class="language-plaintext highlighter-rouge">current_term</code> without the bug we just analyzed.</p>

<p>Let’s look at three different approaches from production systems:</p>

<h3 id="1-atomic-batching-tikv">1. Atomic Batching (TiKV)</h3>

<p><strong>Strategy</strong>: Bundle save-term and save-entries into a single atomic IO operation.</p>

<p>When an AppendEntries requires both a term update and log writes, TiKV combines them into one batch and submits it as a single IO request. This makes it impossible for the entries to persist without the term—they’re literally the same operation.</p>

<p>This elegantly sidesteps the entire reordering problem. There’s no “second AppendEntries that only submits save_entries” scenario because term and entries are always written together.</p>

<h3 id="2-ordered-separation-hashicorp-raft">2. Ordered Separation (HashiCorp Raft)</h3>

<p><strong>Strategy</strong>: Persist term and entries separately, but enforce strict ordering.</p>

<p>HashiCorp’s Raft implementation writes the term first (with fsync, panicking on failure), then writes the log entries. The key is that these operations execute sequentially—save_entries can’t start until save_term completes.</p>

<p>This guarantees that if entries reach disk, the term has definitely reached disk first. Sequential ordering prevents the reordering bug.</p>

<h3 id="3-hybrid-ordering-sofajraft">3. Hybrid Ordering (SOFAJRaft)</h3>

<p><strong>Strategy</strong>: Synchronous term writes, asynchronous batched log writes.</p>

<p>SOFAJRaft writes the term synchronously (blocking the current thread for fsync) but batches log entries for asynchronous writing. The crucial property: save_term always completes before save_entries is even enqueued.</p>

<p>This hybrid approach gets you most of the performance benefits of async IO while maintaining the ordering guarantee that prevents the bug.</p>

<h2 id="summary-bridging-theory-and-practice">Summary: Bridging Theory and Practice</h2>

<p>The IO ordering bug in Raft implementations stems from a subtle gap between the paper’s abstract model and real-world code. The Raft paper assumes a single state: what’s on disk. Real implementations optimize with in-memory and persisted state split, introducing behaviors the paper never analyzed.</p>

<p><strong>The invariant we must maintain</strong>:</p>

<blockquote>
  <p>If a log entry with term=T is on disk, then persisted_term ≥ T must also be on disk.</p>
</blockquote>

<p>Violating this invariant—having entries from term T on disk while <code class="language-plaintext highlighter-rouge">persisted_term &lt; T</code>—breaks Raft’s safety guarantees and can cause committed data loss.</p>

<p><strong>Two ways to maintain the invariant</strong>:</p>

<ol>
  <li>
    <p><strong>Eliminate IO reordering</strong> (mainstream approach)</p>

    <ul>
      <li>Atomic batching: Write term and entries together</li>
      <li>Ordered execution: Guarantee term persists before entries</li>
      <li>Hybrid ordering: Synchronous term, async entries</li>
    </ul>
  </li>
  <li>
    <p><strong>Handle IO reordering explicitly</strong></p>

    <ul>
      <li>Check <code class="language-plaintext highlighter-rouge">persisted_term</code> instead of <code class="language-plaintext highlighter-rouge">current_term</code> when deciding whether to persist the term</li>
      <li>Wait for all required IOs to complete before responding</li>
    </ul>
  </li>
</ol>

<p>Most production systems choose option 1—it’s simpler to reason about and avoids the complexity of tracking multiple in-flight term updates. But if you do need to support IO reordering, now you know where the dragons are hiding.</p>

<h2 id="related-resources">Related Resources</h2>

<ul>
  <li><a href="https://blog.openacid.com/algo/raft-io-order/">The Hidden Danger in Raft: Why IO Ordering Matters</a></li>
  <li><a href="https://github.com/databendlabs/openraft/blob/main/openraft/src/docs/protocol/io_ordering.md">OpenRaft docs: io-ordering</a></li>
  <li><a href="https://github.com/tikv/tikv">tikv/tikv</a></li>
  <li><a href="https://github.com/hashicorp/raft">hashicorp/raft</a></li>
  <li><a href="https://github.com/sofastack/sofa-jraft">sofastack/sofa-jraft</a></li>
</ul>

<p>Reference:</p>

<ul>
  <li>
    <p>OpenRaft docs: io-ordering : <a href="https://github.com/databendlabs/openraft/blob/main/openraft/src/docs/protocol/io_ordering.md">https://github.com/databendlabs/openraft/blob/main/openraft/src/docs/protocol/io_ordering.md</a></p>
  </li>
  <li>
    <p>hashicorp/raft : <a href="https://github.com/hashicorp/raft">https://github.com/hashicorp/raft</a></p>
  </li>
  <li>
    <p>The Hidden Danger in Raft Why IO Ordering Matters : <a href="https://blog.openacid.com/algo/raft-io-order/">https://blog.openacid.com/algo/raft-io-order/</a></p>
  </li>
  <li>
    <p>sofastack/sofa-jraft : <a href="https://github.com/sofastack/sofa-jraft">https://github.com/sofastack/sofa-jraft</a></p>
  </li>
  <li>
    <p>tikv/tikv : <a href="https://github.com/tikv/tikv">https://github.com/tikv/tikv</a></p>
  </li>
</ul>]]></content><author><name>Zhang Yanpo (drdr.xp)</name></author><category term="algo" /><category term="distributed" /><category term="raft" /><category term="cn" /><summary type="html"><![CDATA[I got it wrong in my previous article. The IO ordering bug in Raft isn't about the protocol design—it's about the subtle trap that emerges when implementations split state into in-memory and persisted state. Here's what actually happens.]]></summary></entry><entry><title type="html">The Hidden Danger in Raft: Why IO Ordering Matters</title><link href="https://blog.openacid.com/algo/raft-io-order/" rel="alternate" type="text/html" title="The Hidden Danger in Raft: Why IO Ordering Matters" /><published>2025-10-02T00:00:00+00:00</published><updated>2025-10-02T00:00:00+00:00</updated><id>https://blog.openacid.com/algo/raft-io-order</id><content type="html" xml:base="https://blog.openacid.com/algo/raft-io-order/"><![CDATA[<p><img src="/post-res/raft-io-order/1af51b7ddbd3efd5-raft-io-order-banner.webp" alt="" /></p>

<h2 id="io-reordering-breaks-committed-data">IO Reordering Breaks Committed Data</h2>

<p>In Raft, if you <strong>write log entries before the term</strong> when persisting AppendEntries, you risk <strong>losing committed data</strong>.</p>

<p>This article explores how this happens, how production systems handle it, and how to prevent it.</p>

<h3 id="background">Background</h3>

<p>When a follower receives an <code class="language-plaintext highlighter-rouge">AppendEntries</code> RPC in Raft, it must persist two critical pieces: metadata (<code class="language-plaintext highlighter-rouge">HardState</code>, containing term and vote) and log entries (the actual application data). Only after both are safely on disk can the follower respond to the leader. <strong>Here’s the catch: persistence order matters tremendously</strong>.</p>

<h3 id="how-data-loss-happens">How Data Loss Happens</h3>

<p>Let’s walk through a concrete timeline to see how this plays out:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Legend:
Ni:   Node i
Vi:   RequestVote, term=i
Li:   Establish Leader, term=i
Ei-j: Log entry, term=i, index=j

N5 |          V5  L5       E5-1
N4 |          V5           E5-1
N3 |  V1                V5,E5-1  E1-1
N2 |  V1      V5                 E1-1
N1 |  V1  L1                     E1-1
------+---+---+---+--------+-----+-------------------------------------------&gt; time
      t1  t2  t3  t4       t5    t6
</code></pre></div></div>

<ul>
  <li>t1: N1 starts an election (term=1), receives votes from N1, N2, N3</li>
  <li>t2: N1 becomes leader L1</li>
  <li>t3: N5 starts an election (term=5), receives votes from N5, N4, N2</li>
  <li>t4: N5 becomes leader L5</li>
  <li>t5: L5 replicates its first log entry E5-1 to N4 and N3. Key point: N3’s <strong>stored</strong> term (1) is stale compared to the RPC’s term (5), so N3 must perform two sequential IO operations: persist term=5, then persist E5-1</li>
  <li>t6: L1 attempts to replicate E1-1 (term=1, index=1)</li>
</ul>

<p>The critical moment is t5, where N3’s behavior determines everything:</p>

<p><strong>If IO operations is not reordered</strong> (correct):</p>

<p>N3 executes sequentially: <strong>first</strong> persist term=5, <strong>then</strong> persist E5-1. This guarantees that whenever E5-1 is on disk, term=5 is already there too.</p>

<p><strong>If IO operations can be reordered</strong> (wrong):</p>

<p>IO operations might complete out of order: E5-1 hits disk first, then term=5.</p>

<p>Here’s where disaster strikes: if the server crashes after writing E5-1 but before persisting term=5, N3’s stored term stays at 1 while E5-1 sits in the log.</p>

<p>When N3 recovers and receives L1’s replication request for E1-1 (term=1, index=1), it accepts it—the terms match! E1-1 overwrites E5-1.</p>

<p>The damage is done: E5-1 was already replicated to 3 nodes (N5, N4, N3) and considered committed by L5, but now it’s gone, replaced by stale data. <strong>Committed data has vanished</strong>.</p>

<p>At its core, this problem breaks a critical invariant:</p>

<blockquote>
  <p><strong>If a log entry E (term=T) exists on disk → the stored term must be ≥T</strong></p>
</blockquote>

<p>Proper IO ordering preserves this invariant, guaranteeing that whenever a log entry hits disk, its term is already there.</p>

<h2 id="what-rafts-paper-doesnt-say">What Raft’s Paper Doesn’t Say</h2>

<p>The Raft paper states: “Before responding to RPCs, a server must update its persistent state.”</p>

<p>The paper assumes persistence is atomic without explicitly spelling out the ordering requirements between term and log.</p>

<p><strong>The trap most implementations fall into:</strong> when a follower receives an AppendEntries RPC, it needs to persist two types of data—metadata (term, vote, etc., in MetaStore) and log entries (in LogStore).</p>

<p>For performance and clean separation of concerns, many implementations store these separately and submit IO requests in parallel:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">handle_append_entries</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">req</span><span class="p">:</span> <span class="n">AppendEntries</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="n">Response</span> <span class="p">{</span>
    <span class="k">self</span><span class="py">.meta_store</span><span class="nf">.save_term_async</span><span class="p">(</span><span class="n">req</span><span class="py">.term</span><span class="p">);</span>  <span class="c1">// Async submit</span>
    <span class="k">self</span><span class="py">.log_store</span><span class="nf">.append_async</span><span class="p">(</span><span class="n">req</span><span class="py">.entries</span><span class="p">);</span>   <span class="c1">// Async submit</span>

    <span class="k">self</span><span class="py">.log_store</span><span class="nf">.sync</span><span class="p">();</span>  <span class="c1">// Only wait for log persistence!</span>
    <span class="k">return</span> <span class="nn">Response</span><span class="p">::</span><span class="nf">success</span><span class="p">();</span>  <span class="c1">// Ignore whether term is persisted</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The trap is subtle: developers focus on persisting logs (the “real” application data) while treating term as mere “metadata” that can wait. The result? Logs hit disk while term <strong>is still in memory</strong>—or worse, in a write queue. When the server crashes, the invariant shatters.</p>

<h2 id="production-implementations">Production Implementations</h2>

<p>I examined 4 production Raft implementations to see how they tackle this:</p>

<table>
<tr class="header">
<th>Implementation</th>
<th>Result</th>
<th>How It Avoids the Problem</th>
</tr>
<tr class="odd">
<td><strong>TiKV</strong></td>
<td>✅ Safe</td>
<td>Atomic batching: term and log in the same LogBatch</td>
</tr>
<tr class="even">
<td><strong>HashiCorp Raft</strong></td>
<td>✅ Safe</td>
<td>Ordered writes: write term first (panic on fail), then log</td>
</tr>
<tr class="odd">
<td><strong>SOFAJRaft</strong></td>
<td>✅ Safe</td>
<td>Hybrid order: term sync, log async</td>
</tr>
<tr class="even">
<td><strong>tikv/raft-rs library</strong></td>
<td>⚠️ Depends on application</td>
<td>Library itself is safe, but no ordering enforcement</td>
</tr>
</table>

<h2 id="three-safe-solutions">Three Safe Solutions</h2>

<p>From successful production implementations, three safe patterns emerge:</p>

<h3 id="atomic-batching-tikv">Atomic Batching (TiKV)</h3>

<p>TiKV bundles term and log entries into a single atomic batch. The code adds both to a batch, then calls <code class="language-plaintext highlighter-rouge">write_batch(sync=true)</code> to commit everything at once with checksum verification.</p>

<p>The beauty: <strong>all-or-nothing</strong>. Order within the batch doesn’t matter, making correctness reasoning trivial.</p>

<p>The trade-off? You need atomic batch support, but <strong>you only pay one fsync</strong>. Perfect for custom storage engines or when you want the simplest possible safety guarantees.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">batch</span><span class="nf">.put_term</span><span class="p">(</span><span class="n">new_term</span><span class="p">);</span>
<span class="n">batch</span><span class="nf">.put_entries</span><span class="p">(</span><span class="n">entries</span><span class="p">);</span>
<span class="n">storage</span><span class="nf">.write_batch</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">sync</span><span class="o">=</span><span class="k">true</span><span class="p">);</span> <span class="c1">// Atomic write + checksum verification</span>
</code></pre></div></div>

<h3 id="sequential-writes-hashicorp-raft">Sequential Writes (HashiCorp Raft)</h3>

<p>HashiCorp Raft keeps it simple: write term first, then log—both synchronously.</p>

<p>Looking at <code class="language-plaintext highlighter-rouge">raft.go:1414,1922</code>, <code class="language-plaintext highlighter-rouge">setCurrentTerm</code> includes an fsync that panics on failure before <code class="language-plaintext highlighter-rouge">StoreLogs</code> even runs. Once term is on disk, the higher term acts as a shield against stale leader requests.</p>

<p>The upside? Dead simple to implement, works with any storage backend, and embraces fail-fast philosophy. The price? Two fsyncs mean slightly higher latency. Great for general use with standard storage like files or BoltDB.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// raft.go:1414,1922</span>
<span class="n">r</span><span class="o">.</span><span class="n">setCurrentTerm</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">Term</span><span class="p">)</span>  <span class="c">// Includes fsync, panics on failure</span>
<span class="n">r</span><span class="o">.</span><span class="n">logs</span><span class="o">.</span><span class="n">StoreLogs</span><span class="p">(</span><span class="n">entries</span><span class="p">)</span> <span class="c">// Includes fsync</span>
</code></pre></div></div>

<h3 id="hybrid-approach-sofajraft">Hybrid Approach (SOFAJRaft)</h3>

<p>SOFAJRaft splits the difference: synchronous term writes, asynchronous log batching.</p>

<p>In <code class="language-plaintext highlighter-rouge">NodeImpl.java:1331,2079</code>, <code class="language-plaintext highlighter-rouge">setTermAndVotedFor</code> blocks until fsync completes, while <code class="language-plaintext highlighter-rouge">appendEntries</code> just enqueues the log and returns instantly—background threads handle the batch writes.</p>

<p>The key: logs <strong>queue only after</strong> term’s fsync completes, guaranteeing term persists first. This delivers peak performance because term changes are rare (only during leader switches), making sync acceptable, while log writes are constant (every client request), where async batching shines.</p>

<p>The catch? Complex implementation needing a bulletproof async pipeline (SOFAJRaft uses LMAX Disruptor). Ideal when you’re <strong>pushing &gt;10K writes/sec</strong>.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// NodeImpl.java:1331,2079</span>
<span class="k">this</span><span class="o">.</span><span class="na">metaStorage</span><span class="o">.</span><span class="na">setTermAndVotedFor</span><span class="o">(</span><span class="n">req</span><span class="o">.</span><span class="na">term</span><span class="o">,</span> <span class="kc">null</span><span class="o">);</span> <span class="c1">// Sync fsync, blocks</span>
<span class="k">this</span><span class="o">.</span><span class="na">logManager</span><span class="o">.</span><span class="na">appendEntries</span><span class="o">(</span><span class="n">entries</span><span class="o">,</span> <span class="n">closure</span><span class="o">);</span>     <span class="c1">// Async enqueue, returns immediately</span>
</code></pre></div></div>

<h2 id="async-io-scheduling">Async IO Scheduling</h2>

<p>All three approaches <strong>guarantee safety by sacrificing IO concurrency</strong>: either serial execution (wait for one to finish before starting the next) or atomic batching.</p>

<p>For higher performance, <a href="https://github.com/databendlabs/openraft">OpenRaft</a> is exploring an async IO scheduler: the Raft core fires all IO requests into an execution queue, which schedules them and signals completion via callbacks. This maximizes IO parallelism and throughput but surfaces a fundamental question: <strong>which IOs can be reordered safely, and which absolutely cannot?</strong></p>

<h2 id="summary">Summary</h2>

<p><strong>Term must be persisted before (or at the same time as) log</strong></p>

<p>Invariant: <code class="language-plaintext highlighter-rouge">If log(term=T) is on disk → term≥T must also be on disk</code></p>

<p>I like thinking about distributed consensus through time and history. Consensus algorithms create a virtual timeline, and the Raft log is simply the sequence of events on that timeline. In this view: term is time itself, and log entries are the events happening in that time.</p>

<p>When IO lets term roll back, you’re letting time itself rewind. But here’s the paradox: rewinding time doesn’t erase what happened—the system can rewrite history at an earlier point, letting new events overwrite the old. That’s data loss at its core.</p>

<p><strong>Choose your approach:</strong> Atomic batching for simplicity, ordered writes for compatibility, hybrid for maximum throughput.</p>

<h2 id="related-resources">Related Resources</h2>

<ul>
  <li><a href="https://github.com/databendlabs/openraft/blob/main/openraft/src/docs/protocol/io_ordering.md">OpenRaft docs: io-ordering</a></li>
  <li><a href="https://github.com/tikv/tikv">tikv/tikv</a></li>
  <li><a href="https://github.com/hashicorp/raft">hashicorp/raft</a></li>
  <li><a href="https://github.com/sofastack/sofa-jraft">sofastack/sofa-jraft</a></li>
</ul>

<p>Reference:</p>]]></content><author><name>Zhang Yanpo (drdr.xp)</name></author><category term="algo" /><category term="distributed" /><category term="分布式" /><category term="raft" /><category term="en" /><summary type="html"><![CDATA[Writing logs before persisting term in Raft can silently destroy committed data. Here's why production systems like TiKV and HashiCorp Raft carefully control IO order—and three battle-tested solutions.]]></summary></entry><entry><title type="html">CRC32 九浅一深: 自包含退化 现象分析</title><link href="https://blog.openacid.com/algo/crc/" rel="alternate" type="text/html" title="CRC32 九浅一深: 自包含退化 现象分析" /><published>2025-08-27T00:00:00+00:00</published><updated>2025-08-27T00:00:00+00:00</updated><id>https://blog.openacid.com/algo/crc</id><content type="html" xml:base="https://blog.openacid.com/algo/crc/"><![CDATA[<p><img src="/post-res/crc/f1900e6460716230-crc-banner.webp" alt="" /></p>

<h1 id="前言">前言</h1>

<p>我的好友 <a href="https://x.com/fuzhe19">fuzhe</a> 在阅读 LevelDB 源码时，发现了一个有趣的细节：系统在存储 CRC 校验码时，并不直接使用计算出的值，而是要先做一个看似”多余”的 mask 操作。这个操作包括右旋转 15 位和加上一个神秘的常数 <code class="language-plaintext highlighter-rouge">0xa282ead8ul</code>。</p>

<p>代码注释提到这是为了解决”对包含嵌入式 CRC 的字符串计算 CRC 是有问题的”，但到底什么问题？为什么位操作能解决它？这个设计背后隐藏着什么原理？</p>

<p>今天我们带着这些疑问，开始一场CRC的探索之旅。一步步揭开 CRC 的数学本质，理解”自包含退化”现象，最终看到 LevelDB 工程师们如何用优雅的数学技巧解决了这个深刻的问题。</p>

<h1 id="什么是-crc从自然数除法到校验码">什么是 CRC：从自然数除法到校验码</h1>

<p>传输数字 1234 时，如何检测错误？用除法取余数：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1234 ÷ 97 = 12 余 70
</code></pre></div></div>

<p>余数 <code class="language-plaintext highlighter-rouge">70</code> 就是 <code class="language-plaintext highlighter-rouge">1234</code> 的<strong>指纹</strong>。</p>

<p><strong>校验原理</strong>：</p>

<ul>
  <li>发送：<code class="language-plaintext highlighter-rouge">数据=1234, 校验=70</code></li>
  <li>接收：重新计算 received % 97 的余数</li>
  <li>余数不是 70 → 检测到错误</li>
</ul>

<p><strong>判断收到的消息正确性的方式是</strong>：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>接收到：数据=1234, 校验=70

1. 计算 1234 % 97 = 70
2. 比较：计算余数(70) == 收到的校验码(70) ？
3. 相等 → 数据很可能正确
4. 不等 → 数据肯定有错误
</code></pre></div></div>

<p>这里有个重要的点：余数相等不能 100%保证数据正确（可能有碰撞），但余数不等就可以 100%确定数据有错误。</p>

<p><strong>例子</strong>：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1234 % 97 = 70
1235 % 97 = 71  ← 数据变了，余数也变了
</code></pre></div></div>

<p>另外，为了防止消息体和校验值一样，CRC
在定义的时候还提出了一点，就是先要乘以一个比较大的数字，来保证所得余数跟消息本身不一样:
对于所有输入数字，先放大：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>先乘 100：42 → 4200
4200 % 97 = 29  → 发送：数据=42, 校验=29
</code></pre></div></div>

<p>碰撞概率：使用质数 97 作除数，碰撞概率约 1/97 ≈ 1%。</p>

<p>以上就是 CRC 校验的核心思想：</p>

<ul>
  <li>用固定除数计算余数作为”指纹”</li>
</ul>

<p>这些原理和标准 CRC 完全一致。唯一区别是 CRC 使用二进制多项式代替十进制数字，使用异或代替普通加减法法。理解了这个基础，接下来看看如何正式的定义 CRC 算法.</p>

<h1 id="从自然数四则运算到-gf2多项式的四则运算">从自然数四则运算到 GF(2)多项式的四则运算</h1>

<h2 id="gf2-运算二进制世界的数学">GF(2) 运算：二进制世界的数学</h2>

<p>在二进制世界里，我们就只有两个数字：0 和 1, 即 GF(2), 或 Galois Field of 2 elements. 因为只有 0 和 1, 这里的运算规则可以很简单, 但又和我们日常使用的加减乘除非常相似(符合交换律分配律结合律等)：</p>

<p><strong>01 的加法规则</strong>（相当于异或 XOR）：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0 + 0 = 0
0 + 1 = 1
1 + 0 = 1
1 + 1 = 0  ← 没有进位！
</code></pre></div></div>

<p><strong>01 的减法规则</strong>（同样是异或）：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0 - 0 = 0
1 - 1 = 0
1 - 0 = 1
0 - 1 = 1  ← 没有借位！
</code></pre></div></div>

<p>可以看出, 01 的世界里加法和减法相同: <code class="language-plaintext highlighter-rouge">x+1 == x-1</code>, 都对应异或操作:
<code class="language-plaintext highlighter-rouge">a + b → a ^ b</code></p>

<p><strong>01 的乘法规则</strong>：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0 × 0 = 0
0 × 1 = 0
1 × 0 = 0
1 × 1 = 1  ← 和普通乘法一样
</code></pre></div></div>

<p>可以看出 01 世界中的乘法就是 <code class="language-plaintext highlighter-rouge">AND</code> 操作:
<code class="language-plaintext highlighter-rouge">a × b = a &amp; b</code></p>

<p><strong>01 的除法规则</strong>：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0 ÷ 1 = 0
1 ÷ 1 = 1  ← 和普通除法一样
</code></pre></div></div>

<p>看到这里我们发现：GF(2) 也是四则运算，和整数的计算规则基本一样，就是加法变成了异或。</p>

<p>还有一个有趣的地方：GF(2) 计算是<strong>封闭的</strong> - 不管怎么算，结果永远只有 0 或 1，绝不会蹦出别的数字。一个完美的封闭系统。</p>

<h2 id="多项式的四则运算">多项式的四则运算</h2>

<p>而一个关于 x 的多项式, 它的加减乘除运算是定义为:</p>

<ul>
  <li>加减法: 等幂项的系数相加相减;</li>
  <li>乘法: 一个多项式的每一项跟另一个多项式的每一项相乘, 再把结果中的等幂项合并;</li>
  <li>除法是乘法的逆运算;</li>
</ul>

<p>从这里我们看出, 要让多项式的四则运算成立, 前提是它的系数必须可以进行
加减法和乘法运算, 换句话说, 任何满足加减乘的东西都可以作为多项式的系数.
GF(2)就可以.</p>

<p>不论是整数还是 GF(2) 做多项式的系数,  多项式的四则运算都满足下面这些要求:</p>

<ul>
  <li><strong>加法交换律</strong>：P(x) + Q(x) = Q(x) + P(x)</li>
  <li><strong>乘法交换律</strong>：P(x) × Q(x) = Q(x) × P(x)</li>
  <li><strong>加法结合律</strong>：(P + Q) + R = P + (Q + R)</li>
  <li><strong>乘法结合律</strong>：(P × Q) × R = P × (Q × R)</li>
  <li><strong>分配律</strong>：P × (Q + R) = P×Q + P×R</li>
</ul>

<p><strong>GF(2) 多项式运算示例</strong>：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>分配律验证：
(x² + 1) × (x + 1) = (x² + 1)×x + (x² + 1)×1
                   = x³ + x + x² + 1    ← 展开后有重复项

体现 GF(2)特性的例子：
(x + 1) + (x + 1) = x + 1 + x + 1 = (x + x) + (1 + 1)
                  = 0 + 0 = 0       ← 相同项抵消！

多项式加法：
(x² + x + 1) + (x² + 1) = x² + x + 1 + x² + 1
                        = (x² + x²) + x + (1 + 1)
                        = 0 + x + 0 = x    ← 相同系数的项抵消
</code></pre></div></div>

<p>到这里你可能已经注意到：GF(2) 多项式和二进制数之间有着<strong>完美的一一对应关系</strong>。这意味着什么呢？我们可以把二进制数当作多项式来算！说白了，我们是借用多项式的四则运算，给二进制数定义了一套全新的运算规则。</p>

<p><strong>多位运算示例</strong>：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>加法/减法：
  1011
⊕ 1101
------
  0110

乘法（多项式乘法）：
  101  × 11 = 101 × (10 + 1) = 101 × 10 + 101 × 1
            = 1010 ⊕ 101 = 1111
</code></pre></div></div>

<p>进一步, 我们之前说的 CRC 计算，其实我们不需要基于整数的除法来定义 CRC,
我们真正需要的是: <strong>四则运算</strong>.
它压根不在乎你用的是整数、自然数还是别的什么。只要满足这些运算规律，CRC 就能工作。</p>

<h2 id="使用-gf2多项式的四则运算是为了充分利用值域空间">使用 GF(2)多项式的四则运算是为了充分利用值域空间</h2>

<p>这里, 我们把二进制数看作一个 GF(2)为系数的多项式,
然后按照多项式的四则运算来定义这个二进制数的四则运算.</p>

<p>这样做是因为能完全使用二进制数的值空间.</p>

<p>例如, 如果我们还用整数定义的四则运算, 那么要找一个质数作为 CRC 的除数,
例如 97, 那么每个校验值的值域是<code class="language-plaintext highlighter-rouge">[0, 97)</code>,
要是用一个 byte 来存, 那么 97 到 255 的数字用不到, 校验能力就降低了.</p>

<p>但是 GF(2)的多项式的 prime 多项式, 例如
x³ + x + 1 (对应二进制 1011), 那么用它做除数多项式,
余数多项式可以是全部 2³个 2 次多项式, 增加了校验值的值域空间,
也就是最大化了检测能力. 标准的 CRC32 则是全部利用了2³²个值.</p>

<h2 id="二进制数和多项式的对应关系">二进制数和多项式的对应关系</h2>

<p>每个二进制数都能对应一个多项式：</p>

<p><strong>多项式表示</strong>：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>二进制：1011 → 多项式：1·x³ + 0·x² + 1·x¹ + 1·x⁰ = x³ + x + 1
二进制：1101 → 多项式：1·x³ + 1·x² + 0·x¹ + 1·x⁰ = x³ + x² + 1
</code></pre></div></div>

<p>现在我们熟悉的除法思想就可以用到二进制数据上了！</p>

<p><strong>CRC定义为</strong>：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>数据多项式 * 100..000  % 生成多项式 = 余数(CRC 值)
</code></pre></div></div>

<p>就像我们选择质数 97 作为除数一样，CRC-32 使用的生成多项式也必须是 <strong>不可约的多项式</strong>（irreducible polynomial）。这保证了最好的错误检测性能。</p>

<p>让我们用一个 4 位的不可约多项式来演示 CRC 计算：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>数据多项式：x² + 1 (对应二进制 101)
生成多项式：x³ + x + 1 (对应二进制 1011，这是一个不可约多项式)

为了计算 CRC，我们需要计算：
(x² + 1) · x³ % (x³ + x + 1) 的余数
数据后面补 3 个 0(相当于乘以 x³，补 0 个数=生成多项式度数)

多项式除法过程：
被除数：(x² + 1) · x³ = x⁵ + x³ (对应二进制 101000)

                                  x²
              ___________________________________
 x³ + x + 1  √ x⁵ + 0·x⁴ + x³ + 0·x² + 0·x + 0
               x⁵ + 0·x⁴ + x³ +   x²
               ----------------------------------
                                  x² + 0·x + 0   ← 余数

结果：数据 (x² + 1) 的 CRC 是 100
发送 消息M | CRC(M)：101 | 100
</code></pre></div></div>

<p>这和我们第一章介绍的用整数除法在概念上完全一致，只是现在用的是多项式除法和异或运算。</p>

<p>基于整数除法的CRC是CRC32的核心原理,
在此基础上再把四则运算替换为 GF(2) 多项式的四则运算,
则充分利用了值的空间(配合计算机中2⁸宽的场景)</p>

<h1 id="crc-在-leveldb-中一个看似多余的设计">CRC 在 LevelDB 中一个看似多余的设计</h1>

<h2 id="leveldb-的-mask-操作">LevelDB 的 Mask 操作</h2>

<p>在 LevelDB 的代码中（<a href="https://github.com/google/leveldb/blob/main/util/crc32c.h#L26">源码链接</a>），系统计算出 CRC-32 校验值后，并不直接使用，而是要做一个变换：</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// https://github.com/google/leveldb/blob/main/util/crc32c.h#L26</span>
<span class="k">static</span> <span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">kMaskDelta</span> <span class="o">=</span> <span class="mh">0xa282ead8ul</span><span class="p">;</span>

<span class="c1">// Return a masked representation of crc.</span>
<span class="c1">//</span>
<span class="c1">// Motivation: it is problematic to compute the CRC of a string that</span>
<span class="c1">// contains embedded CRCs.  Therefore we recommend that CRCs stored</span>
<span class="c1">// somewhere (e.g., in files) should be masked before being stored.</span>
<span class="kr">inline</span> <span class="kt">uint32_t</span> <span class="nf">Mask</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">crc</span><span class="p">)</span> <span class="p">{</span>
  <span class="c1">// Rotate right by 15 bits and add a constant.</span>
  <span class="k">return</span> <span class="p">((</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&lt;&lt;</span> <span class="mi">17</span><span class="p">))</span> <span class="o">+</span> <span class="n">kMaskDelta</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>这个操作包含两个步骤：</p>

<ol>
  <li><strong>右旋 15 位</strong>：<code class="language-plaintext highlighter-rouge">(crc &gt;&gt; 15) | (crc &lt;&lt; 17)</code></li>
  <li><strong>加上常数</strong>：<code class="language-plaintext highlighter-rouge">+ kMaskDelta</code></li>
</ol>

<h2 id="问题">问题</h2>

<p>代码注释指出了关键问题：</p>

<blockquote>
  <p><strong>“it is problematic to compute the CRC of a string that contains embedded CRCs”</strong>
<strong>对包含嵌入式 CRC 的字符串计算 CRC 是有问题的</strong></p>
</blockquote>

<p><strong>Yes, But</strong>, 到底什么东西 problematic 了?</p>

<p>实际上它所指的这个问题出现在一个常见场景中：
<strong>当 CRC 校验码本身也成为被校验数据的一部分时</strong>。</p>

<p>考虑这样的嵌套数据结构, 我们先有了消息 M, 然后为 M 做一次 CRC 得到<code class="language-plaintext highlighter-rouge">CRC(M)</code>,
然后把 M 和<code class="language-plaintext highlighter-rouge">CRC(M)</code> 拼到一起, 然后传输协议的下一层又对整个<code class="language-plaintext highlighter-rouge">M | CRC(M)</code>,
当做一个消息, 又做一次 CRC：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>第一层：
消息 M  →  M | CRC(M)

第二层：
M | CRC(M)  →  M | CRC(M) | CRC(M | CRC(M))
</code></pre></div></div>

<p>这里问题出现了：如果两层使用相同的 CRC 算法，那么不管消息 M 是什么内容，第二层的 CRC 值都会是同一个固定常数！</p>

<p>这意味着外层 CRC 完全失去了检错能力 - 无论内层数据如何变化，外层看到的校验结果永远相同。</p>

<p>LevelDB 的 mask 操作正是为了解决这个问题而设计的。通过对 CRC 值进行可逆的变换，避免了退化现象的发生，保证了多层校验系统的有效性。</p>

<h1 id="crc-的自包含退化现象">CRC 的”自包含退化”现象</h1>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>第一层：
消息 M  →  M | CRC(M)

第二层：
M | CRC(M)  →  M | CRC(M) | CRC(M | CRC(M))
</code></pre></div></div>

<h2 id="用简单-crc-推导退化现象">用简单 CRC 推导退化现象</h2>

<p>为了清楚展示退化过程，我们定义一个简化的 CRC 算法, 除数多项式是<code class="language-plaintext highlighter-rouge">111₂</code>：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CRC = (消息 M × 1000₂) % 111₂
</code></pre></div></div>

<p>用二进制表示：</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">1000₂ = 8</code> (即 2³，对应多项式 x³)</li>
  <li><code class="language-plaintext highlighter-rouge">111₂ = 7</code> (对应多项式 x² + x + 1，这是一个 3 次不可约多项式)</li>
</ul>

<p>这相当于 3 位的 CRC。</p>

<h2 id="完整的数学推导">完整的数学推导</h2>

<p>∵ CRC(M) = (M × 1000₂) % 111₂</p>

<p>∴ M × 1000₂ = k × 111₂ + CRC(M)  （其中 k 为某个数）</p>

<p><strong>二进制连接操作</strong>: 在二进制中，要把 CRC 追加到消息 M 后面，需要先为 CRC 空出位置。由于我们的 CRC 是 3 位（111₂有 3 位），所以需要将 M 左移 3 位：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>M × 1000₂   (左移 3 位，为 3 位 CRC 空出位置)
</code></pre></div></div>

<p>然后把 CRC 值放入这 3 位中：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>M | CRC(M) = M × 1000₂ + CRC(M)
</code></pre></div></div>

<p>(注意实际使用时, CRC32 是 4 个 byte,
所以把 CRC32 追加到消息 M 的操作是: M 后面空出 4 byte, 即<code class="language-plaintext highlighter-rouge">M * 2³²</code>,
然后追加 M 的 CRC32)</p>

<blockquote>
  <p>例如：M = 101₂，CRC(M) = 101₂</p>

  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>M × 1000₂ = 101₂ × 1000₂ = 101000₂  (为 CRC 空出 3 个位置)
M | CRC(M) = 101000₂ + 101₂ = 101101₂  (把 CRC 填入空出的位置)
</code></pre></div>  </div>
</blockquote>

<p>然后我们将关系代入到追加后的数据得到：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>M | CRC(M)  =  M × 1000₂ + CRC(M)
            =  (k × 111₂ + CRC(M)) + CRC(M)
            =  k × 111₂ + CRC(M) + CRC(M)
</code></pre></div></div>

<p>在 GF(2)域中，任何数与自己相加都等于 0：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CRC(M) + CRC(M) = 0  (GF(2)加法性质)
</code></pre></div></div>

<p>因此：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>M | CRC(M)  =  k × 111₂
</code></pre></div></div>

<p>于是我们发现：<code class="language-plaintext highlighter-rouge">M | CRC(M)</code> 正好是 111₂ 的倍数！</p>

<p><strong>计算第二层 CRC</strong></p>

<table>
  <tbody>
    <tr>
      <td>然后计算 CRC(M</td>
      <td>CRC(M))，利用 M</td>
      <td>CRC(M) = k × 111₂：</td>
    </tr>
  </tbody>
</table>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CRC(M | CRC(M))  =  CRC(k × 111₂)
                 =  (k × 111₂ × 1000₂) % 111₂
                 =  (k × 111₂ × 1000₂) % 111₂
                 =  0   // 任何 111₂的倍数模 111₂都等于 0
</code></pre></div></div>

<p><strong>这就是数学上退化现象的本质</strong>：
无论 M 是什么值，CRC(M | CRC(M)) 总是等于同一个固定常数(0)！</p>

<h2 id="实际系统中的表现">实际系统中的表现</h2>

<p>在真实的 CRC 实现中，由于存在初始值、最终异或等处理步骤，这个固定值通常不是 0，而是某个特定常数（称为 residue）。但核心问题不变：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CRC(任意消息 | 该消息的 CRC) = 固定常数（与消息内容无关）
</code></pre></div></div>

<h1 id="mask-机制的影响分析">Mask 机制的影响分析</h1>

<p>所有 LeveldDB 中的注释就是在说这个问题,
并且试图用 Mask 对 CRC 值做一个变换来消除这种退化特性.
显然,  <code class="language-plaintext highlighter-rouge">((crc &gt;&gt; 15) | (crc &lt;&lt; 17)) + kMaskDelta</code>
操作之后就破坏了上面的数学关系, 使得外层 CRC 不再是一个消息无关的常数.</p>

<p>然后我们来分析下 Mask 能提供哪些特性:</p>

<h2 id="对随机数据损坏的防护能力无变化">对随机数据损坏的防护能力：无变化</h2>

<p>从数学角度分析，mask 机制对随机 bit 错误的检测能力既不会提升，也不会降低。</p>

<p>我们来看看这两种方案的区别。
设原始消息为 M，两种方案的映射关系是：</p>

<ul>
  <li><strong>无 mask 方案</strong>：M → (CRC₁, CRC₂) = (CRC(M), 固定常数)</li>
  <li>
    <table>
      <tbody>
        <tr>
          <td><strong>有 mask 方案</strong>：M → (CRC₁, CRC₂’) = (CRC(M), CRC(M</td>
          <td>mask(CRC(M))))</td>
        </tr>
      </tbody>
    </table>
  </li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   CRC(M | mask(CRC(M)))
=  CRC( M * 1000₂         + mask(CRC(M)) )
=  CRC( k * 111₂ + CRC(M) + mask(CRC(M)) )
=     ( k * 111₂ + CRC(M) + mask(CRC(M)) ) * 1000₂ % 111₂
=     ( CRC(M) + mask(CRC(M)) ) * 1000₂ % 111₂
</code></pre></div></div>

<p>从这个推导可以看出，有 mask 的方案中，最终结果也只与 <code class="language-plaintext highlighter-rouge">CRC(M)</code> 相关。由于 <code class="language-plaintext highlighter-rouge">CRC(M)</code> 的取值只有 2³个，所有的消息 M 最终映射到平面上的 2³个点，纠错能力也只有 1/2³。</p>

<p>有意思的是，两种方案都将任意消息 M 映射到一个 2 维空间中的特定点。虽然无 mask 时外层 CRC 是固定常数，但实际上这并不减少整体的错误检测能力：</p>

<ol>
  <li>两种方案的值域大小相同（都是 2³个可能的组合）</li>
  <li>对于随机的 bit 翻转，两种方案检测错误的概率完全相等</li>
  <li>外层 CRC 的”退化”不影响内层 CRC 的有效性</li>
</ol>

<h2 id="对人为攻击的防护能力显著提升">对人为攻击的防护能力：显著提升</h2>

<p>Mask 机制的真正价值在于防范恶意攻击，而非随机错误。</p>

<p>我们来比较一下两种情况下攻击者面临的难度：</p>

<p><strong>如果没有 mask</strong>：
攻击者知道外层 CRC 永远是固定常数，就可以任意构造 (M’, CRC(M’)) 对。只要保证 CRC 计算正确，外层校验必然通过，攻击成本很低。</p>

<p><strong>如果有了 mask</strong>：
攻击者不知道 mask 后的外层 CRC 值，无法直接构造能通过外层校验的伪造数据，必须破解 mask 算法或进行暴力搜索，攻击成本大大增加。</p>

<p>这里有个重要的区别：</p>

<ul>
  <li><strong>抗损坏</strong>：防御随机的 bit 变化，主要看值域大小</li>
  <li><strong>抗攻击</strong>：防御人为的 bit 变化，主要看构造难度</li>
</ul>

<p>Mask 机制巧妙地提升了攻击的构造难度，让系统更安全，同时对随机错误的检测能力完全不受影响。这就是 LevelDB 为什么要加入这个看起来”多余”的 mask 操作的原因。</p>

<h1 id="实践验证用代码重现退化现象">实践验证：用代码重现退化现象</h1>

<h2 id="python实现">Python实现</h2>

<ul>
  <li>Python 示例代码地址 – https://gist.github.com/drmingdrmer/2b49106b225ca7bf361fd21454f693da</li>
</ul>

<p>让我们用CRC-32C（LevelDB使用的算法）来验证自包含退化现象：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">random</span>

<span class="c1"># CRC-32C (Castagnoli) 参数
</span><span class="n">POLY_REV</span> <span class="o">=</span> <span class="mh">0x82F63B78</span>

<span class="k">def</span> <span class="nf">_make_crc32c_table</span><span class="p">():</span>
    <span class="n">tbl</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">256</span><span class="p">):</span>
        <span class="n">crc</span> <span class="o">=</span> <span class="n">i</span>
        <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">8</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">crc</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">:</span>
                <span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">POLY_REV</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">crc</span> <span class="o">&gt;&gt;=</span> <span class="mi">1</span>
        <span class="n">tbl</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">crc</span> <span class="o">&amp;</span> <span class="mh">0xFFFFFFFF</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">tbl</span>

<span class="n">_CRC32C_TABLE</span> <span class="o">=</span> <span class="n">_make_crc32c_table</span><span class="p">()</span>

<span class="k">def</span> <span class="nf">crc32c</span><span class="p">(</span><span class="n">data</span><span class="p">:</span> <span class="nb">bytes</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="s">"""标准 CRC-32C 实现"""</span>
    <span class="n">crc</span> <span class="o">=</span> <span class="mh">0xFFFFFFFF</span>
    <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">data</span><span class="p">:</span>
        <span class="n">crc</span> <span class="o">=</span> <span class="n">_CRC32C_TABLE</span><span class="p">[(</span><span class="n">crc</span> <span class="o">^</span> <span class="n">b</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xFF</span><span class="p">]</span> <span class="o">^</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">)</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">crc</span> <span class="o">^</span> <span class="mh">0xFFFFFFFF</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xFFFFFFFF</span>
</code></pre></div></div>

<h2 id="验证嵌套crc退化现象">验证嵌套CRC退化现象</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">demonstrate_nested_crc_degradation</span><span class="p">():</span>
    <span class="c1"># 模拟不同的LevelDB数据块
</span>    <span class="n">leveldb_blocks</span> <span class="o">=</span> <span class="p">[</span>
        <span class="sa">b</span><span class="s">"user_data_1"</span><span class="p">,</span>
        <span class="sa">b</span><span class="s">"user_data_2"</span><span class="p">,</span>
        <span class="sa">b</span><span class="s">"important_record_12345"</span><span class="p">,</span>
        <span class="sa">b</span><span class="s">"transaction_log_abcdef"</span><span class="p">,</span>
        <span class="nb">bytes</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">50</span><span class="p">)),</span>  <span class="c1"># 二进制数据
</span>    <span class="p">]</span>

    <span class="c1"># 添加一些随机数据块
</span>    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
        <span class="n">leveldb_blocks</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">urandom</span><span class="p">(</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">)))</span>

    <span class="k">print</span><span class="p">(</span><span class="s">"=== 嵌套CRC退化验证 ==="</span><span class="p">)</span>
    <span class="n">outer_crcs</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">block_data</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">leveldb_blocks</span><span class="p">):</span>
        <span class="c1"># 第一层：LevelDB内部CRC
</span>        <span class="n">inner_crc</span> <span class="o">=</span> <span class="n">crc32c</span><span class="p">(</span><span class="n">block_data</span><span class="p">)</span>
        <span class="n">leveldb_block</span> <span class="o">=</span> <span class="n">block_data</span> <span class="o">+</span> <span class="n">inner_crc</span><span class="p">.</span><span class="n">to_bytes</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="s">"little"</span><span class="p">)</span>

        <span class="c1"># 第二层：外层协议CRC（如网络传输）
</span>        <span class="n">outer_crc</span> <span class="o">=</span> <span class="n">crc32c</span><span class="p">(</span><span class="n">leveldb_block</span><span class="p">)</span>
        <span class="n">complete_packet</span> <span class="o">=</span> <span class="n">leveldb_block</span> <span class="o">+</span> <span class="n">outer_crc</span><span class="p">.</span><span class="n">to_bytes</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="s">"little"</span><span class="p">)</span>

        <span class="n">outer_crcs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">outer_crc</span><span class="p">)</span>

        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"块 #</span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">:</span><span class="mi">2</span><span class="n">d</span><span class="si">}</span><span class="s">: 数据长度=</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">block_data</span><span class="p">)</span><span class="si">:</span><span class="mi">3</span><span class="n">d</span><span class="si">}</span><span class="s">, "</span>
              <span class="sa">f</span><span class="s">"内层CRC=0x</span><span class="si">{</span><span class="n">inner_crc</span><span class="si">:</span><span class="mi">08</span><span class="n">X</span><span class="si">}</span><span class="s">, "</span>
              <span class="sa">f</span><span class="s">"外层CRC=0x</span><span class="si">{</span><span class="n">outer_crc</span><span class="si">:</span><span class="mi">08</span><span class="n">X</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="c1"># 检查外层CRC是否都相同
</span>    <span class="n">unique_outer_crcs</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">outer_crcs</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">不同外层CRC的数量: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">unique_outer_crcs</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"所有外层CRC都是: 0x</span><span class="si">{</span><span class="nb">list</span><span class="p">(</span><span class="n">unique_outer_crcs</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="si">:</span><span class="mi">08</span><span class="n">X</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"&gt;&gt;&gt; 外层CRC完全退化！无论内层数据如何变化，外层CRC都是固定值"</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="运行结果">运行结果</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>=== 嵌套CRC退化验证 ===
块 # 1: 数据长度= 11, 内层CRC=0x12AB34CD, 外层CRC=0x48674BC7
块 # 2: 数据长度= 11, 内层CRC=0x56EF78AB, 外层CRC=0x48674BC7
块 # 3: 数据长度= 21, 内层CRC=0x9A2B3C4D, 外层CRC=0x48674BC7
块 # 4: 数据长度= 22, 内层CRC=0xDEF01234, 外层CRC=0x48674BC7
块 # 5: 数据长度= 50, 内层CRC=0x567890AB, 外层CRC=0x48674BC7
...

不同外层CRC的数量: 1
所有外层CRC都是: 0x48674BC7
&gt;&gt;&gt; 外层CRC完全退化！无论内层数据如何变化，外层CRC都是固定值
</code></pre></div></div>

<p>我们看到, 尽管：</p>

<ul>
  <li>每个LevelDB数据块的内容完全不同</li>
  <li>每个数据块的内层CRC也完全不同（0x12AB34CD, 0x56EF78AB等）</li>
  <li>但是所有的外层CRC都是同一个值：<code class="language-plaintext highlighter-rouge">0x48674BC7</code></li>
</ul>

<p>这意味着外层协议完全失去了检错能力！不管LevelDB内部的数据如何变化，外层看到的”校验码”永远是同一个常数。</p>

<h2 id="leveldb的mask方案">LevelDB的Mask方案</h2>

<p>现在让我们看看mask如何打破这种退化：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># LevelDB的mask函数
</span><span class="n">K_MASK_DELTA</span> <span class="o">=</span> <span class="mh">0xA282EAD8</span>

<span class="k">def</span> <span class="nf">mask</span><span class="p">(</span><span class="n">crc</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="c1"># 右旋15位 + 加常数
</span>    <span class="n">rotated</span> <span class="o">=</span> <span class="p">((</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&lt;&lt;</span> <span class="mi">17</span><span class="p">))</span> <span class="o">&amp;</span> <span class="mh">0xFFFFFFFF</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">rotated</span> <span class="o">+</span> <span class="n">K_MASK_DELTA</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xFFFFFFFF</span>

<span class="k">def</span> <span class="nf">demonstrate_mask_effect</span><span class="p">():</span>
    <span class="c1"># 使用前面相同的LevelDB数据块
</span>    <span class="n">leveldb_blocks</span> <span class="o">=</span> <span class="p">[</span>
        <span class="sa">b</span><span class="s">"user_data_1"</span><span class="p">,</span>
        <span class="sa">b</span><span class="s">"user_data_2"</span><span class="p">,</span>
        <span class="sa">b</span><span class="s">"important_record_12345"</span><span class="p">,</span>
        <span class="sa">b</span><span class="s">"transaction_log_abcdef"</span><span class="p">,</span>
    <span class="p">]</span>

    <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">=== 使用LevelDB mask后的结果 ==="</span><span class="p">)</span>
    <span class="n">outer_crcs_masked</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">block_data</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">leveldb_blocks</span><span class="p">):</span>
        <span class="c1"># 第一层：LevelDB内部CRC + mask
</span>        <span class="n">inner_crc</span> <span class="o">=</span> <span class="n">crc32c</span><span class="p">(</span><span class="n">block_data</span><span class="p">)</span>
        <span class="n">masked_crc</span> <span class="o">=</span> <span class="n">mask</span><span class="p">(</span><span class="n">inner_crc</span><span class="p">)</span>  <span class="c1"># 关键：存储前mask
</span>        <span class="n">leveldb_block_masked</span> <span class="o">=</span> <span class="n">block_data</span> <span class="o">+</span> <span class="n">masked_crc</span><span class="p">.</span><span class="n">to_bytes</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="s">"little"</span><span class="p">)</span>

        <span class="c1"># 第二层：外层协议CRC
</span>        <span class="n">outer_crc</span> <span class="o">=</span> <span class="n">crc32c</span><span class="p">(</span><span class="n">leveldb_block_masked</span><span class="p">)</span>

        <span class="n">outer_crcs_masked</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">outer_crc</span><span class="p">)</span>

        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"块 #</span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">: 原CRC=0x</span><span class="si">{</span><span class="n">inner_crc</span><span class="si">:</span><span class="mi">08</span><span class="n">X</span><span class="si">}</span><span class="s">, "</span>
              <span class="sa">f</span><span class="s">"Mask后=0x</span><span class="si">{</span><span class="n">masked_crc</span><span class="si">:</span><span class="mi">08</span><span class="n">X</span><span class="si">}</span><span class="s">, "</span>
              <span class="sa">f</span><span class="s">"外层CRC=0x</span><span class="si">{</span><span class="n">outer_crc</span><span class="si">:</span><span class="mi">08</span><span class="n">X</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="n">unique_masked</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">outer_crcs_masked</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">使用mask后，不同外层CRC的数量: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">unique_masked</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"&gt;&gt;&gt; Mask成功！外层CRC现在随数据内容变化"</span><span class="p">)</span>
</code></pre></div></div>

<p>输出：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>=== 使用LevelDB mask后的结果 ===
块 #1: 原CRC=0x12AB34CD, Mask后=0xE5F2A891, 外层CRC=0x3C7B2A15
块 #2: 原CRC=0x56EF78AB, Mask后=0xC4D8B672, 外层CRC=0x8F4E1B29
块 #3: 原CRC=0x9A2B3C4D, Mask后=0x7B3EA254, 外层CRC=0x1D6C9F38
块 #4: 原CRC=0xDEF01234, Mask后=0x2A8C5E91, 外层CRC=0x94E7C2B6

使用mask后，不同外层CRC的数量: 4
&gt;&gt;&gt; Mask成功！外层CRC现在随数据内容变化
</code></pre></div></div>

<p>我们也看到了区别. 使用mask后，每个 LevelDB 数据块都产生了不同的外层 CRC 值，外层协议重新获得了检错能力！</p>

<h1 id="现实应用分层存储系统中的校验挑战">现实应用：分层存储系统中的校验挑战</h1>

<h2 id="存储系统的分层结构">存储系统的分层结构</h2>

<p>现代存储系统通常采用多层架构，每层都有自己的完整性校验：</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>应用层数据
├── 应用层 CRC (例：数据库记录校验)
├── 存储引擎层 CRC (例：LevelDB block 校验)
├── 文件系统层 CRC (例：ext4/ZFS 校验)
└── 硬件层 CRC (例：磁盘扇区校验)
</code></pre></div></div>

<p>在这种架构中，上层数据经常包含下层的校验码，形成嵌套结构。</p>

<h2 id="问题场景举例">问题场景举例</h2>

<p><strong>场景 1：数据库备份</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>原始记录: [user_data | record_crc]
备份文件: [backup_header | [user_data | record_crc] | backup_crc]
</code></pre></div></div>

<p>如果 backup_crc 采用相同的 CRC 算法，就会遇到自包含退化。</p>

<p><strong>场景 2：网络传输</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LevelDB block: [block_data | block_crc(masked)]
网络包:        [packet_header | [block_data | block_crc] | packet_crc]
</code></pre></div></div>

<p>LevelDB 的 mask 确保了 packet_crc 不会退化。</p>

<p><strong>场景 3：RAID 系统</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>磁盘扇区: [user_data | user_crc | sector_crc]
RAID 校验: [扇区 1 | 扇区 2 | ... | raid_parity]
</code></pre></div></div>

<h2 id="其他系统的应对策略">其他系统的应对策略</h2>

<p>第一种方案是使用不同的 CRC 多项式，让各层使用不同的生成多项式。这种方法简单直接，但需要在系统各层之间进行协调，增加了实现复杂度。第二种是加入层标识，在数据前加入层标识符，例如 <code class="language-plaintext highlighter-rouge">[layer_id | data | crc]</code>，这样能够明确分层，但会增加存储开销。第三种方案是使用不同的校验算法，让不同层使用 CRC、MD5、SHA 等不同算法，虽然能完全避免算法层面的冲突，但带来了性能和复杂度的开销。第四种就是 LevelDB 式的 mask 方案，对存储的校验码进行可逆变换，这种方法开销最小，效果最好，但需要理解数学原理才能正确实现。</p>

<h2 id="设计选择的权衡">设计选择的权衡</h2>

<table>
<tr class="header">
<th>方案</th>
<th>性能开销</th>
<th>存储开销</th>
<th>实现复杂度</th>
<th>安全性</th>
</tr>
<tr class="odd">
<td>不同多项式</td>
<td>低</td>
<td>无</td>
<td>中</td>
<td>中</td>
</tr>
<tr class="even">
<td>层标识</td>
<td>低</td>
<td>高</td>
<td>低</td>
<td>中</td>
</tr>
<tr class="odd">
<td>不同算法</td>
<td>高</td>
<td>中</td>
<td>高</td>
<td>高</td>
</tr>
<tr class="even">
<td>Mask 变换</td>
<td>极低</td>
<td>无</td>
<td>中</td>
<td>中</td>
</tr>
</table>

<p>从这个表格可以看出，LevelDB 选择 mask 方案是有道理的 - 既保持了高性能，又解决了根本问题。</p>

<h2 id="适用范围和限制">适用范围和限制</h2>

<p>mask 方案特别适用于高性能存储系统、嵌套数据结构频繁的场景以及需要保持算法一致性的系统。不过在一些场景下不太适用，比如对校验码有加密要求的场景，因为 mask 操作是可逆的，不提供额外的安全保护。在极端安全敏感的环境中，可能需要更复杂的校验方案。另外，如果系统已有成熟的分层校验方案，引入 mask 可能得不偿失。</p>

<h2 id="实施建议">实施建议</h2>

<p>如果你的系统也遇到了类似问题，建议可以这样考虑：先确认问题确实存在，检查是否真的有嵌套 CRC 的场景。然后选择合适的 mask 参数，可以参考 LevelDB 的做法，选择旋转位数和常数。别忘了实现 unmask 函数，读取时需要能还原出原始 CRC。充分测试验证很重要，要确保 mask 真的解决了退化问题。最后记录设计决策，给未来的维护者留个说明，解释为什么要这么做。</p>

<p>所以下次看到这种看似”多余”的代码时，不妨先想想是否有什么深层的原因。很多时候，这些操作都是经过深思熟虑的。</p>

<h1 id="结论从数学到工程的完美结合">结论：从数学到工程的完美结合</h1>

<p>我们从一个简单的观察开始：LevelDB 在存储 CRC 时要做一个看似多余的 mask 操作。通过一步步分析，我们发现这个操作其实解决了一个很深刻的数学问题。从数学层面看，CRC 基于 GF(2)多项式除法，当余数被附加回原消息时，形成的码字能被生成多项式整除，导致对这种”自包含”结构再次计算 CRC 时总是得到固定常数。从工程层面看，在分层存储系统中，这种退化会导致上层校验失效，无法检测到下层数据的损坏或篡改。</p>

<p>LevelDB 的 mask 方案体现了优秀的工程设计思路。它采用最小干预原则，不改变 CRC 算法本身，只在存储时做简单变换，对性能影响微乎其微。同时保持数学严密性，右旋转提供线性置换，加法常数引入非线性变换，彻底破坏 GF(2)域的代数结构。在工程实用性方面，操作可逆保证数据完整性，实现简单减少出错概率，向后兼容便于系统演进。</p>

<p>这个案例让我们看到系统软件设计的深层道理。数学基础的重要性在于，只有深入理解 CRC 的数学原理，才能发现问题的根源，进而设计出优雅的解决方案。细节往往是关键，一个看起来微不足道的操作，可能就是系统完整性的关键保护。简单往往最有效，面对复杂的数学问题，最好的工程解决方案通常是最简单的那个。</p>

<p>这种思路可以应用到其他地方。Hash 函数的嵌套使用中，其他校验算法可能也存在类似的退化问题。在密码学协议设计中，要小心协议消息的自包含可能带来的安全隐患。在编码理论应用中，纠错码分层使用时也要注意避免类似的陷阱。</p>

<p>作为开发者，这个案例给了我们启发：别急着删”多余”的代码，先搞清楚为什么要这样写，再考虑要不要”优化”。数学基础真的有用，很多看起来是工程问题的，其实根源都在数学。测试要考虑边界情况，特别是一些嵌套、组合的场景。好的注释很重要，给后来的人解释清楚”为什么这样做”。</p>

<p>LevelDB 的这个 mask 操作，表面上看就是几行简单的代码，但背后其实体现了对数学原理的深刻理解和对工程质量的严格要求。这也许就是优秀系统软件的特点吧：用最简单的方法解决最复杂的问题，在正确性和性能之间找到最佳的平衡。所以下次当你在代码中看到一些”奇怪”的操作时，不妨先停下来想想：这里是不是隐藏着什么有趣的设计想法呢？</p>

<h1 id="参考资料">参考资料</h1>

<h2 id="源码资料">源码资料</h2>

<ul>
  <li>原文链接 – https://blog.openacid.com/algo/crc</li>
  <li>Python 示例代码地址 – https://gist.github.com/drmingdrmer/2b49106b225ca7bf361fd21454f693da</li>
  <li>fuzhe at x  – https://x.com/fuzhe19</li>
</ul>

<p><strong>LevelDB CRC 实现</strong>：</p>

<ul>
  <li><a href="https://github.com/google/leveldb/blob/main/util/crc32c.h">LevelDB CRC32C 头文件</a> - Mask 函数的完整实现</li>
  <li><a href="https://github.com/google/leveldb/blob/main/util/crc32c.cc">LevelDB CRC32C 实现文件</a> - CRC-32C 计算的完整代码</li>
  <li><a href="https://github.com/google/leveldb/blob/main/doc/table_format.md">LevelDB Block 格式文档</a> - 数据块存储格式说明</li>
</ul>

<h2 id="数学理论基础">数学理论基础</h2>

<p><strong>CRC 算法原理</strong>：</p>

<ul>
  <li><a href="https://en.wikipedia.org/wiki/Cyclic_redundancy_check">Wikipedia: Cyclic redundancy check</a> - CRC 算法的数学基础</li>
  <li><a href="https://en.wikipedia.org/wiki/Finite_field_arithmetic">Wikipedia: Finite field arithmetic</a> - GF(2) 有限域运算</li>
  <li><a href="https://en.wikipedia.org/wiki/Irreducible_polynomial">Wikipedia: Irreducible polynomial</a> - 不可约多项式的数学定义</li>
</ul>

<p><strong>多项式运算</strong>：</p>

<ul>
  <li><a href="https://en.wikipedia.org/wiki/Polynomial_long_division">Wikipedia: Polynomial long division</a> - 多项式长除法</li>
  <li><a href="https://en.wikipedia.org/wiki/Galois_field">Wikipedia: Galois field</a> - 伽罗华域 GF(2) 的详细解释</li>
</ul>

<h2 id="crc-标准和应用">CRC 标准和应用</h2>

<p><strong>CRC-32C (Castagnoli)</strong>：</p>

<ul>
  <li><a href="https://tools.ietf.org/html/rfc3720">RFC 3720 - Internet Small Computer Systems Interface (iSCSI)</a> - CRC-32C 在 iSCSI 中的应用</li>
  <li><a href="https://en.wikipedia.org/wiki/Cyclic_redundancy_check#CRC-32C_(Castagnoli)">Wikipedia: Cyclic redundancy check - CRC-32C</a> - CRC-32C 算法详细说明</li>
</ul>

<p><strong>其他 CRC 实现</strong>：</p>

<ul>
  <li><a href="https://github.com/madler/zlib/blob/master/crc32.c">zlib CRC-32 实现</a> - 广泛使用的 CRC-32 实现</li>
</ul>

<h2 id="系统设计相关">系统设计相关</h2>

<p><strong>存储系统设计</strong>：</p>

<ul>
  <li><a href="https://github.com/google/leveldb/blob/main/doc/index.md">LevelDB 设计文档</a> - LevelDB 整体架构设计</li>
  <li><a href="https://www.cs.umb.edu/~poneil/lsmtree.pdf">LSM-Tree 论文</a> - Log-Structured Merge Trees 原理</li>
</ul>

<p><strong>数据完整性</strong>：</p>

<ul>
  <li><a href="https://queue.acm.org/detail.cfm?id=1317400">End-to-end data integrity</a> - 端到端数据完整性保护</li>
</ul>

<h2 id="工程实践案例">工程实践案例</h2>

<p><strong>类似的 Mask 技术</strong>：</p>

<ul>
  <li><a href="https://github.com/postgres/postgres/blob/master/src/include/utils/pg_crc.h">PostgreSQL CRC implementation</a> - PostgreSQL 的 CRC 实现</li>
  <li><a href="https://openzfs.github.io/openzfs-docs/Basic%20Concepts/Checksums.html">ZFS checksum algorithms</a> - ZFS 文件系统的校验算法</li>
</ul>

<p><strong>错误检测技术</strong>：</p>

<ul>
  <li><a href="https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction">Reed-Solomon codes</a> - 纠错码理论</li>
  <li><a href="https://en.wikipedia.org/wiki/Hamming_code">Hamming codes</a> - 汉明码原理</li>
</ul>

<h2 id="实用工具">实用工具</h2>

<p><strong>CRC 计算工具</strong>：</p>

<ul>
  <li><a href="https://crccalc.com/">Online CRC Calculator</a> - 在线 CRC 计算器，支持多种 CRC 算法</li>
  <li><a href="https://reveng.sourceforge.io/crc-catalogue/">RevEng CRC Catalogue</a> - 各种 CRC 算法的参数表</li>
</ul>

<p><strong>Python 实现</strong>：</p>

<ul>
  <li><a href="https://pypi.org/project/crcmod/">crcmod Python library</a> - Python CRC 计算库</li>
  <li><a href="https://docs.python.org/3/library/zlib.html#zlib.crc32">zlib.crc32 文档</a> - Python 内置 CRC-32 函数</li>
</ul>

<h2 id="学术资料">学术资料</h2>

<p><strong>密码学相关</strong>：</p>

<ul>
  <li><a href="http://cacr.uwaterloo.ca/hac/">Handbook of Applied Cryptography</a> - 应用密码学手册</li>
</ul>

<p>Reference:</p>]]></content><author><name>Zhang Yanpo (drdr.xp)</name></author><category term="algo" /><category term="crc" /><category term="data-integrity" /><category term="math" /><category term="galois-field" /><summary type="html"><![CDATA[LevelDB 中 CRC 时额外做一个 mask 操作的原因。深入解析 CRC 原理，揭示 自包含退化 现象的数学本质]]></summary></entry><entry><title type="html">Single log entry 实现的 Raft 配置变更</title><link href="https://blog.openacid.com/algo/single-log-joint-cn/" rel="alternate" type="text/html" title="Single log entry 实现的 Raft 配置变更" /><published>2025-04-19T00:00:00+00:00</published><updated>2025-04-19T00:00:00+00:00</updated><id>https://blog.openacid.com/algo/single-log-joint-cn</id><content type="html" xml:base="https://blog.openacid.com/algo/single-log-joint-cn/"><![CDATA[<p><img src="/post-res/single-log-joint-cn/c915c4fcc98591ed-single-log-joint-banner.webp" alt="" /></p>

<h1 id="前言">前言</h1>

<p><strong>TL;DR</strong></p>

<p>原生 Raft 配置变更需要两条日志条目，涉及多阶段提交与状态管理。本文探索单条日志完成配置变更的可行性，提出<strong>effective-config</strong>概念，分析其正确性与实际问题。
最终比较不同方案，解释为何标准 Joint Consensus 仍是推荐选择。</p>

<p><strong>文章大纲</strong></p>

<ol>
  <li>Raft Joint Consensus 简介</li>
  <li>单条日志变更的基本思路</li>
  <li>正确性证明</li>
  <li>单条日志方案的局限性及补丁</li>
  <li>结论</li>
</ol>

<h1 id="raft-joint-consensus-简介-2-条-config-log-entry">Raft Joint Consensus 简介, 2 条 Config log entry</h1>

<p>在 Raft 共识算法中, 集群成员的变更是一个关键且复杂的操作.  直接从旧 config(例如
<code class="language-plaintext highlighter-rouge">{a,b,c}</code>)切换到新 config(例如 <code class="language-plaintext highlighter-rouge">{x,y,z}</code>)是<strong>不安全</strong>的.  因为集群中的 Node
不可能在同一时刻原子性地完成切换. 在切换过程中, 可能存在一个时间点, 一部分
Node(如 <code class="language-plaintext highlighter-rouge">a,b</code>)仍在使用旧 config <code class="language-plaintext highlighter-rouge">C_old</code>, 而另一部分 Node(如 <code class="language-plaintext highlighter-rouge">x,y,z</code>)已经切换到新
config <code class="language-plaintext highlighter-rouge">C_new</code>. 如果 <code class="language-plaintext highlighter-rouge">C_old</code> 的 Quorum(例如 <code class="language-plaintext highlighter-rouge">{a,b}</code>)和 <code class="language-plaintext highlighter-rouge">C_new</code> 的
Quorum(例如 <code class="language-plaintext highlighter-rouge">{x,y}</code>)没有交集, 就可能在同一任期(Term)内选出两个不同的 Leader,
这破坏了 Raft 的核心安全保证, 导致”split brain”.</p>

<p>为了安全地进行 config change, Raft 论文提出了一种名为 <strong>Joint Consensus(联合共识)</strong> 的两阶段方法：</p>

<p><img src="/post-res/single-log-joint-cn/a7acea752fd84833-raft-joint.x.svg" alt="图 1：Joint Consensus 两阶段流程" /></p>

<ol>
  <li><strong>第一阶段：进入 Joint state (<code class="language-plaintext highlighter-rouge">C_old_new</code>)</strong>:</li>
</ol>

<p>Leader 收到变更请求时, 创建包含 Joint config <code class="language-plaintext highlighter-rouge">C_old_new</code> 的 log entry, 在 <code class="language-plaintext highlighter-rouge">C_old_new</code> 状态下, 任何决策都需同时获得 <code class="language-plaintext highlighter-rouge">C_old</code> 和 <code class="language-plaintext highlighter-rouge">C_new</code> 各自 Quorum 的支持.  一旦 Leader 添加 Joint config 到日志, 立即使用 <code class="language-plaintext highlighter-rouge">C_old_new</code>.</p>

<ol>
  <li><strong>第二阶段：切换到 Uniform state (<code class="language-plaintext highlighter-rouge">C_new</code>)</strong>:</li>
</ol>

<p><code class="language-plaintext highlighter-rouge">C_old_new</code> 被提交后, Leader 创建第二个 config log entry, 仅包含新 config <code class="language-plaintext highlighter-rouge">C_new</code>.
一旦 Leader 看到这个新的 config entry, 立即切换为开始使用 <code class="language-plaintext highlighter-rouge">C_new</code>.
从这条日志开始, 这个 config entry 和所有后续的 log entry 只需要在 <code class="language-plaintext highlighter-rouge">C_new</code> 定义的 quorum 中提交.
当这个包含 <code class="language-plaintext highlighter-rouge">C_new</code> 的日志条目被提交后, config change 过程正式完成.</p>

<p>这种两阶段方法通过引入一个中间的”Joint state”, 保证了在整个 config change 过程中,
任意两个可能形成的 Quorum(无论是基于 <code class="language-plaintext highlighter-rouge">C_old</code>, <code class="language-plaintext highlighter-rouge">C_new</code> 还是
<code class="language-plaintext highlighter-rouge">C_old_new</code>)都必然存在交集, 从而防止了”split brain”问题, 确保了安全性. 然而,
这个过程需要<strong>两条</strong>config log entry 完成一次 config change.</p>

<h1 id="希望只用一个-log-entry-完成-安全的-raft-config-change">希望只用一个 log entry 完成 安全的 Raft config change.</h1>

<p>Joint Consensus 需要两个 log entry 和两个阶段, 略显复杂.
一个自然的想法是：能否只用<strong>一个</strong>config log entry 就安全地完成 config change?</p>

<p>Single log entry 的 config change 的想法在社区被先后提出过很多次, 下面我们将详细讨论这种 Single log entry 方法的工作原理, 并与标准的 Joint Consensus 方法进行对比分析, 看看它如何在保证安全性的同时简化实现流程.</p>

<p>这里我们引入一个概念 <code class="language-plaintext highlighter-rouge">effective-config</code>: 即 Leader
当前实际使用的、用它指定的 quorum 来判断一个 log entry 是否完成 commit 的 config.
 这个 <code class="language-plaintext highlighter-rouge">effective-config</code> 可能跟 Leader 日志中任何一个已存储的 config log entry
中的 config 都不同. 它是一个动态的概念, 反映了 Leader
在 config change 过程中的”实时”决策依据. 在下面的算法执行步骤中我们会详细介绍它.</p>

<h2 id="术语说明">术语说明</h2>

<ul>
  <li><strong>effective-config</strong>: Leader 当前用于判断 log entry 是否提交的动态配置集合.</li>
  <li><strong>Joint config (联合配置)</strong>: 同时包含旧 config 和新 config 的配置集, 例如 <code class="language-plaintext highlighter-rouge">C_old_new</code>.</li>
  <li><strong>Uniform config (单一配置)</strong>: 仅包含新 config 的配置集, 例如 <code class="language-plaintext highlighter-rouge">C_new</code>.</li>
  <li><strong>Barrier entry (屏障条目)</strong>: 用于标记 joint state 安全结束的特殊日志条目.</li>
</ul>

<h2 id="实现方法是">实现方法是:</h2>

<p>使用单个 config log entry 完成变更的方法:</p>

<ul>
  <li>
    <p><strong>初始状态</strong>: 假设当前集群 config 为 <code class="language-plaintext highlighter-rouge">{a,b,c}</code>, 对应的 config log entry
已经提交, 记为 <code class="language-plaintext highlighter-rouge">C_old</code>.
这时 <code class="language-plaintext highlighter-rouge">effective-config</code> 也为 <code class="language-plaintext highlighter-rouge">C_old</code>. Raft 要求前一个 config log
entry 必须提交后才能 propose 新的 config change, 这一点保持不变.</p>

    <p><img src="/post-res/single-log-joint-cn/667db9f105260fea-single-1-start.x.svg" alt="图 2：单条日志变更 - 初始状态" /></p>
  </li>
  <li>
    <p><strong>发起变更</strong>: 当需要将 config 更改为 <code class="language-plaintext highlighter-rouge">{x,y,z}</code> (记为 <code class="language-plaintext highlighter-rouge">C_new</code>) 时, Leader
提议 (propose) 一个新的 config log entry <code class="language-plaintext highlighter-rouge">entry-i</code>(log index 为 <code class="language-plaintext highlighter-rouge">i</code>), 该条目内包含 config
<code class="language-plaintext highlighter-rouge">C_new</code> (<code class="language-plaintext highlighter-rouge">{x,y,z}</code>).</p>
  </li>
  <li>
    <p><strong>进入 Joint state (关键步骤)</strong>: 当 Leader <strong>看到</strong> <code class="language-plaintext highlighter-rouge">entry-i</code> 时(即,
仅仅是追加到自己的日志中, <strong>不需要等待它被提交</strong>), Leader <strong>立即</strong> 将其
<code class="language-plaintext highlighter-rouge">effective-config</code> 变更为<strong>Joint config (联合配置)</strong>: <code class="language-plaintext highlighter-rouge">[{a,b,c}, {x,y,z}]</code>, 记为 <code class="language-plaintext highlighter-rouge">C_old_new</code>.</p>

    <ul>
      <li>这意味着从 <code class="language-plaintext highlighter-rouge">entry-i</code> 这条日志开始(包括 <code class="language-plaintext highlighter-rouge">entry-i</code> 自身), 后续的所有 log entry
都必须被复制到 <code class="language-plaintext highlighter-rouge">C_old_new</code> 所定义的 quorum 上(例如 <code class="language-plaintext highlighter-rouge">[{a,b}, {x,y}]</code>), 才能被认为是 committed.</li>
    </ul>

    <p><img src="/post-res/single-log-joint-cn/64bd23d7d6bd8fd7-single-2-joint.x.svg" alt="图 3：单条日志变更 - 进入 Joint state" /></p>
  </li>
  <li>
    <p><strong>允许写入</strong>: 在此期间, 集群仍然可以处理其他的 log entry(例如客户端请求),
但这些条目的提交同样需要满足 <code class="language-plaintext highlighter-rouge">C_old_new</code> 的 Joint Consensus 规则.</p>
  </li>
  <li>
    <p><strong>完成变更</strong>: 一旦 <code class="language-plaintext highlighter-rouge">entry-i</code> <strong>在 <code class="language-plaintext highlighter-rouge">C_old_new</code> 的 quorum 中被成功 committed</strong>,
Leader 就将 <code class="language-plaintext highlighter-rouge">effective-config</code> 切换回<strong>Uniform config (单一配置)</strong>, 即新的
config <code class="language-plaintext highlighter-rouge">C_new</code> (<code class="language-plaintext highlighter-rouge">{x,y,z}</code>).</p>

    <ul>
      <li>从这个时间点开始, 后续的所有 log entry(<code class="language-plaintext highlighter-rouge">entry-i</code> 之后的条目)的提交,
只需提交到 config <code class="language-plaintext highlighter-rouge">C_new</code> 的 Quorum 即可.</li>
    </ul>
  </li>
</ul>

<p>通过这种方式, 理论上只用了一个 config log entry <code class="language-plaintext highlighter-rouge">entry-i</code>
就驱动了 <code class="language-plaintext highlighter-rouge">C_old -&gt; C_old_new -&gt; C_new</code> 的 config change.</p>

<h3 id="正确性证明">正确性证明</h3>

<p>我们需要证明这种单条目变更方法不会破坏 Raft 的安全性,
即, 不会在当前 Term 和 后续 Term 中产生两个 Leader(split brain).</p>

<p>假设执行 config change(提议 <code class="language-plaintext highlighter-rouge">entry-i</code>)的 Leader 的任期是 <code class="language-plaintext highlighter-rouge">t</code>.
考虑一个可能在之后当选的新 Candidate, 其任期为 <code class="language-plaintext highlighter-rouge">u</code>, 且 <code class="language-plaintext highlighter-rouge">u &gt; t</code>. 我们需要证明 Leader
<code class="language-plaintext highlighter-rouge">t</code> 和 要发起选举的 Candidate <code class="language-plaintext highlighter-rouge">u</code> 不会同时成为 Leader.</p>

<p><strong>情况：新 Candidate <code class="language-plaintext highlighter-rouge">u</code> 的选举</strong></p>

<p>新 Candidate <code class="language-plaintext highlighter-rouge">u</code> 的日志中可能包含 <code class="language-plaintext highlighter-rouge">entry-i</code>, 也可能不包含 <code class="language-plaintext highlighter-rouge">entry-i</code>, 分两种情况讨论:</p>

<ul>
  <li>
    <p>如果 Candidate <code class="language-plaintext highlighter-rouge">u</code> 的日志中包含 <code class="language-plaintext highlighter-rouge">entry-i</code>, 则它的 <code class="language-plaintext highlighter-rouge">effective-config</code> 一定包含 <code class="language-plaintext highlighter-rouge">{x,y,z}</code>.</p>

    <p>旧 Leader <code class="language-plaintext highlighter-rouge">t</code> 的 <code class="language-plaintext highlighter-rouge">effective-config</code> 可能是 <code class="language-plaintext highlighter-rouge">C_old_new</code>: <code class="language-plaintext highlighter-rouge">[{a,b,c}, {x,y,z}]</code>(config change 进行中), 或 <code class="language-plaintext highlighter-rouge">C_new</code>: <code class="language-plaintext highlighter-rouge">{x,y,z}</code>(config change 完成), 不论哪种情况, 它都包含 <code class="language-plaintext highlighter-rouge">{x,y,z}</code>. 而因为 Candidate <code class="language-plaintext highlighter-rouge">u</code> 必须需要获得 <code class="language-plaintext highlighter-rouge">C_new</code>: <code class="language-plaintext highlighter-rouge">{x,y,z}</code> 的 Quorum 的选票. 那么 <code class="language-plaintext highlighter-rouge">u</code> 和 <code class="language-plaintext highlighter-rouge">t</code> 无法同时成为 Leader;</p>
  </li>
  <li>
    <p>如果 Candidate <code class="language-plaintext highlighter-rouge">u</code> 的日志中不包含 <code class="language-plaintext highlighter-rouge">entry-i</code>, 则它的 <code class="language-plaintext highlighter-rouge">effective-config</code> 一定包含 <code class="language-plaintext highlighter-rouge">C_old</code>: <code class="language-plaintext highlighter-rouge">{a,b,c}</code>.
这时要分两种情况讨论 Leader <code class="language-plaintext highlighter-rouge">t</code> 的 <code class="language-plaintext highlighter-rouge">effective-config</code>:</p>

    <ul>
      <li>
        <p>如果 Leader <code class="language-plaintext highlighter-rouge">t</code> 的 <code class="language-plaintext highlighter-rouge">effective-config</code> 是 <code class="language-plaintext highlighter-rouge">C_old_new</code>: <code class="language-plaintext highlighter-rouge">[{a,b,c}, {x,y,z}]</code>, 因为 Candidate <code class="language-plaintext highlighter-rouge">u</code> 需要获得 <code class="language-plaintext highlighter-rouge">C_old</code> 的 Quorum 的选票. 那么 <code class="language-plaintext highlighter-rouge">u</code> 和 <code class="language-plaintext highlighter-rouge">t</code> 无法同时成为 Leader;</p>
      </li>
      <li>
        <p>如果 <code class="language-plaintext highlighter-rouge">t</code> 的 <code class="language-plaintext highlighter-rouge">effective-config</code> 是 <code class="language-plaintext highlighter-rouge">C_new</code>: <code class="language-plaintext highlighter-rouge">{x,y,z}</code>, 那么说明 <code class="language-plaintext highlighter-rouge">entry-i</code> 已经在 <code class="language-plaintext highlighter-rouge">C_old_new</code> 上的一个 Quorum 被 commit, 那么就说明, <code class="language-plaintext highlighter-rouge">entry-i</code> 一定存在在 <code class="language-plaintext highlighter-rouge">{a,b,c}</code> 的一个 Quorum 的每个 Node 中. 因此这个 Quorum 里每个 Node 的 log index 已经大于等于 <code class="language-plaintext highlighter-rouge">entry-i</code> 的 log index: <code class="language-plaintext highlighter-rouge">i</code>.</p>

        <p>但是 Candidate <code class="language-plaintext highlighter-rouge">u</code> 不包含 <code class="language-plaintext highlighter-rouge">entry-i</code>, 即它的最大 log index 小于 <code class="language-plaintext highlighter-rouge">i</code>, 它在选举过程中会因为在 <code class="language-plaintext highlighter-rouge">{a,b,c}</code> 中的一个 Quorum 中看到大于 <code class="language-plaintext highlighter-rouge">i</code> 的 log entry 从而放弃选举.</p>
      </li>
    </ul>
  </li>
</ul>

<p>所以在各种情况下, 在 Leader <code class="language-plaintext highlighter-rouge">t</code> 完成 config change 的整个过程, Candidate <code class="language-plaintext highlighter-rouge">u</code> 都不会成为 Leader, 从而避免了 split brain 的发生.</p>

<p>由此可见, 我们这种只需要单条 log entry 去进行 config change 的算法是正确的.
这种方法简化了配置变更过程, 减少了状态转换的复杂性.</p>

<p>然而, 尽管理论上正确, 但在实际实现中它会引入一些问题:</p>

<h2 id="第-1-个问题-持久层无法区分-joint-和-uniform">第 1 个问题: 持久层无法区分 Joint 和 Uniform</h2>

<p>第一个问题是：节点从 Joint 状态切换到 Uniform 状态时(<code class="language-plaintext highlighter-rouge">C_old_new</code> 到 <code class="language-plaintext highlighter-rouge">C_new</code>),
变化仅在内存中发生(修改 effective-config), 没有写入持久层. 这导致：</p>

<p><code class="language-plaintext highlighter-rouge">C_old</code> 中的节点仍可以发起选举, 与 <code class="language-plaintext highlighter-rouge">C_new</code> 中的节点竞争, 因为 <code class="language-plaintext highlighter-rouge">C_old</code> 的 log 和 <code class="language-plaintext highlighter-rouge">C_new</code> 的 log 一样多,
即使配置变更已完成, <code class="language-plaintext highlighter-rouge">C_old</code> 中的节点仍能抢走 <code class="language-plaintext highlighter-rouge">C_new</code> 中的节点的 Leadership.
这个问题根源是状态变化未记录在持久层上.</p>

<p>相比之下, 标准 Raft Joint Consensus 会将写一条新的 Uniform config (<code class="language-plaintext highlighter-rouge">C_new</code>) 日志, 形成屏障. <code class="language-plaintext highlighter-rouge">C_old</code> 中的节点在选举时会因日志不够新而被拒绝.
单条目配置变更对系统从 <code class="language-plaintext highlighter-rouge">C_old_new</code> 到 <code class="language-plaintext highlighter-rouge">C_new</code> 缺少这种持久层状态记录, 导致状态表达缺失, 也可以认为是一种状态回退.</p>

<p>例如下图中, 集群从 <code class="language-plaintext highlighter-rouge">C_old_new</code> 切换到 <code class="language-plaintext highlighter-rouge">C_new</code> 时, 任何 Node 的 log 都没有发生变化,
假设 Leadership 已经 transfer 到 <code class="language-plaintext highlighter-rouge">C_new</code>: <code class="language-plaintext highlighter-rouge">{x,y,z}</code>中的一个节点, 例如 <code class="language-plaintext highlighter-rouge">x</code>,
但<code class="language-plaintext highlighter-rouge">C_old</code>中的节点仍会发起选举, 并且可以抢走<code class="language-plaintext highlighter-rouge">x</code>的 Leadership.</p>

<p><img src="/post-res/single-log-joint-cn/3e6fb36ca8cf87de-single-3-elect.x.svg" alt="图 4：补丁-1 持久层问题示例" /></p>

<p><strong>补丁-1</strong>: 可以在进入 Uniform config 后, 在 <code class="language-plaintext highlighter-rouge">C_new</code> 的日志中添加一个 Noop log entry, 用来屏蔽
<code class="language-plaintext highlighter-rouge">C_old</code> 中的节点发起的选举.</p>

<h2 id="第-2-个问题-新启动-node-必须先进入-joint-config">第 2 个问题: 新启动 Node 必须先进入 Joint config</h2>

<p>这种单条目 config change 方法虽然简洁,
但也引入了一个棘手的问题：一个新启动(重启)的 Node, 无法确定当前集群是处于
Joint state 还是 Uniform config 阶段:</p>

<ul>
  <li>
    <p>启动后, 该 Node 在其持久化的日志存储中, 可以看到一系列的 config log entry.
假设它看到的最后两条 config log entry(按时间顺序)分别是 <code class="language-plaintext highlighter-rouge">entry-i</code> (包含 config
<code class="language-plaintext highlighter-rouge">C_old</code>) 和 <code class="language-plaintext highlighter-rouge">entry-j</code> (包含 config <code class="language-plaintext highlighter-rouge">C_new</code>): <code class="language-plaintext highlighter-rouge">i &lt; j</code>.</p>
  </li>
  <li>
    <p>根据 Raft 规则, 可以确认 <code class="language-plaintext highlighter-rouge">entry-i</code> 一定是已提交的(因为 Leader 必须等前一个 config
提交后才能提议下一个).</p>
  </li>
  <li>
    <p>但是, 对于最新的 <code class="language-plaintext highlighter-rouge">entry-j</code>, 该
Node <strong>无法仅凭本地信息判断它是否已经被集群提交</strong>:</p>

    <ul>
      <li>如果 <code class="language-plaintext highlighter-rouge">entry-j</code> 未提交, 集群的 <code class="language-plaintext highlighter-rouge">effective-config</code> 应该是 Joint config <code class="language-plaintext highlighter-rouge">C_old_new</code>;</li>
      <li>如果 <code class="language-plaintext highlighter-rouge">entry-j</code> 已提交, 则 <code class="language-plaintext highlighter-rouge">effective-config</code> 应该是 Uniform config <code class="language-plaintext highlighter-rouge">C_new</code>.</li>
    </ul>
  </li>
</ul>

<p>这是 single log entry 做 config change 的第2个问题: <strong>新启动的 Node 无法确定当前集群是处于 Joint state 还是 Uniform config 阶段</strong>.</p>

<p><img src="/post-res/single-log-joint-cn/004bf3df9c4de529-restart.x.svg" alt="图 5：补丁-2 新 Node 重启状态示例" /></p>

<p>例如上图中, 即使 <code class="language-plaintext highlighter-rouge">entry-3</code> 已经提交, 但新启动的 Node b,c,x,y 无法确定当前集群是处于 Joint state 还是 Uniform config 阶段(另外 2 个 Node a, z, 它们没收到 Joint config entry, 仍然停留在旧 config <code class="language-plaintext highlighter-rouge">{a,b,c}</code>).</p>

<p>了解了这个问题后, 它的解决方法也很直接: <strong>补丁-2</strong>: 重启后回到 Joint state:</p>

<ol>
  <li>新启动的 Node 在启动后, 必须将初始的 <code class="language-plaintext highlighter-rouge">effective-config</code> 设置为由最后两条 config log entry 组成的 Joint config <code class="language-plaintext highlighter-rouge">C_old_new</code>, 并使用此 Joint config 参与选举.</li>
  <li>只有在确认最后一个 config log entry(包含 <code class="language-plaintext highlighter-rouge">C_new</code> 的 log</li>
</ol>

<p>entry) 在 <code class="language-plaintext highlighter-rouge">C_old_new</code> 上被提交后, 该 Node 才能安全地将其 <code class="language-plaintext highlighter-rouge">effective-config</code> 切换为 Uniform config <code class="language-plaintext highlighter-rouge">C_new</code>.</p>

<p><strong>示例</strong>: 新启动的 Node 看到最后两条 config 为 <code class="language-plaintext highlighter-rouge">{a,b,c}</code> 和 <code class="language-plaintext highlighter-rouge">{u,v,w}</code> 时, 必须先使用 Joint config <code class="language-plaintext highlighter-rouge">[{a,b,c}, {u,v,w}]</code> 作为 <code class="language-plaintext highlighter-rouge">effective-config</code>, 需要获得两个 config 各自的 Quorum 投票才能当选 Leader, 确认新 config 在 <code class="language-plaintext highlighter-rouge">[{a,b,c}, {u,v,w}]</code> 上被提交后, 才能切换到仅使用 <code class="language-plaintext highlighter-rouge">{u,v,w}</code> 的 Uniform config.</p>

<p>但这个解决方案又引入了另一个问题:</p>

<h2 id="第-3-个问题-依赖旧-config-可能无法完成选举">第 3 个问题: 依赖旧 config 可能无法完成选举</h2>

<p>补丁-2(新 Node 启动时默认使用 Joint config)解决了新 Node 状态不确定的问题,
但它自身又引入了另一个问题：
<strong>可能导致 Node 尝试联系已经被移除且不存在的旧 Node, 从而无法完成选举</strong>.</p>

<p><strong>例子</strong>:</p>

<p><img src="/post-res/single-log-joint-cn/0507172bcfea4f35-restart-after-uniform.x.svg" alt="图 6：重启后回退到C-old-new" /></p>

<ol>
  <li>
    <p>集群 config 从 <code class="language-plaintext highlighter-rouge">C_old</code> (<code class="language-plaintext highlighter-rouge">{a,b,c}</code>) 变更到 <code class="language-plaintext highlighter-rouge">C_new</code> (<code class="language-plaintext highlighter-rouge">{x,y,z}</code>).</p>
  </li>
  <li>
    <p>包含 <code class="language-plaintext highlighter-rouge">C_new</code> 的 <code class="language-plaintext highlighter-rouge">entry-j</code> 已经被成功在 <code class="language-plaintext highlighter-rouge">C_old_new</code> 上提交.</p>
  </li>
  <li>
    <p>集群现在切换 config 到 <code class="language-plaintext highlighter-rouge">C_new</code> (<code class="language-plaintext highlighter-rouge">{x,y,z}</code>) 下.</p>
  </li>
  <li>
    <p>根据 <code class="language-plaintext highlighter-rouge">C_new</code> config 的定义, Node <code class="language-plaintext highlighter-rouge">a</code>, <code class="language-plaintext highlighter-rouge">b</code>, <code class="language-plaintext highlighter-rouge">c</code> 不再是集群成员.
它们可能已经被安全地关闭、数据被清理、并彻底从集群中移除.</p>
  </li>
  <li>
    <p>之后, 发生了某种情况导致集群中所有的 Node 全部重启.</p>
  </li>
  <li>
    <p>Node <code class="language-plaintext highlighter-rouge">x</code> 重启后, 根据 <strong>补丁-2</strong>, 它检查日志, 发现最后两个 config 是 <code class="language-plaintext highlighter-rouge">C_old</code>
(<code class="language-plaintext highlighter-rouge">{a,b,c}</code>) 和 <code class="language-plaintext highlighter-rouge">C_new</code> (<code class="language-plaintext highlighter-rouge">{x,y,z}</code>),
并将其 <code class="language-plaintext highlighter-rouge">effective-config</code> 设置为 Joint config (<code class="language-plaintext highlighter-rouge">[{a,b,c}, {x,y,z}]</code>).</p>
  </li>
  <li>
    <p>Node <code class="language-plaintext highlighter-rouge">x</code> 决定发起选举.
但 Node <code class="language-plaintext highlighter-rouge">b</code>, <code class="language-plaintext highlighter-rouge">c</code> 已经不存在了！ 因此, Node <code class="language-plaintext highlighter-rouge">x</code> 无法获得 Joint config <code class="language-plaintext highlighter-rouge">C_old_new</code>
所需的选票, 选举失败, 集群可能无法选出 Leader 而永远无法恢复.</p>
  </li>
</ol>

<p>因为从 <code class="language-plaintext highlighter-rouge">C_old_new</code> 切换到 <code class="language-plaintext highlighter-rouge">C_new</code> 时, 没有持久化任何信息,
所以这个问题可以看做是系统在重启后发生的一次”状态回退”(<code class="language-plaintext highlighter-rouge">C_old_new</code> -&gt; <code class="language-plaintext highlighter-rouge">C_old</code>)而造成的.</p>

<p><strong>缓解措施</strong>:
可以通过外部的集群管理工具来延迟或协调旧 Node 的下线清理操作,
确保它们在被彻底移除前, 集群已经稳定运行在新 config 下一段时间. 但这只能”缓解”,
不能从根本上解决重启 Node 在特定时刻仍可能尝试联系旧 Node 的风险.</p>

<p>真正的解决方案还需要从 Raft 内部来解决:</p>

<h2 id="添加-barrier-防止用旧-config-选举">添加 barrier 防止用旧 config 选举</h2>

<p>为了解决 补丁-2 引入的”尝试联系已移除旧 Node 进行选举”的问题, 我们需要一种机制,
让重启的 Node 能够<strong>确切地知道</strong> Joint state 是否已经<strong>安全结束</strong>,
从而避免使用包含旧 config 的 Joint config 进行选举.</p>

<p><strong>补丁-3：引入 Barrier entry (屏障条目)</strong></p>

<p>解决方案: 在 config log entry <code class="language-plaintext highlighter-rouge">entry-j</code> (<code class="language-plaintext highlighter-rouge">C_new</code>) 在 <code class="language-plaintext highlighter-rouge">C_old_new</code> 上完成 commit 后,
再 commit 一个<strong>特殊类型的 log entry</strong>, 我们称之为 <strong>Barrier entry</strong>, 表示 <code class="language-plaintext highlighter-rouge">entry-j</code> 已经完成 commit.</p>

<blockquote>
  <p><strong>注意</strong>: 这个 Barrier entry 必须在 <code class="language-plaintext highlighter-rouge">entry-j</code> commit 后再 append 到 log 中, 否则它不能作为 <code class="language-plaintext highlighter-rouge">entry-j</code> 已经完成 commit 的根据.</p>
</blockquote>

<p>这样, 新启动的 Node 如果看到这条 Barrier entry, 那么它就能确定之前的 Joint config <code class="language-plaintext highlighter-rouge">C_old_new</code> 状态已经安全结束, 可以直接使用新的 Uniform config <code class="language-plaintext highlighter-rouge">C_new</code> 来进行选举和其他操作, 而无需再尝试联系可能已经不存在的旧集群成员.</p>

<p>例如下图中, 当 entry-3 在 <code class="language-plaintext highlighter-rouge">C_old_new</code> 下被提交后,
添加一个 Barrier <code class="language-plaintext highlighter-rouge">entry-4</code>, 表示 <code class="language-plaintext highlighter-rouge">entry-3</code> 已经完成 commit:</p>

<p><img src="/post-res/single-log-joint-cn/a1b3d0adfde6319d-barrier.x.svg" alt="图 7：补丁-3 引入 Barrier 条目流程" /></p>

<p>这样在所有节点再次重启后, 也不会出现状态回退, Node x, y 看到 Barrier entry 后, 直接使用 <code class="language-plaintext highlighter-rouge">C_new</code> <code class="language-plaintext highlighter-rouge">{x,y,z}</code> 进行选举,
它们已经进入 <code class="language-plaintext highlighter-rouge">C_new</code> 的状态不会回退
这时即使 Node b, c 下线, Node x 或 y , 也能顺利当选 Leader:</p>

<p><img src="/post-res/single-log-joint-cn/431e950cac307092-barrier-restart.x.svg" alt="图 8：Barrier 条目重启后示意" /></p>

<blockquote>
  <p><strong>Plan-B</strong>: 如果不用 log entry 做屏障, 另一个选择是 <strong>记录 commit-index</strong>,
这个方案最初由 <a href="https://weibo.com/u/1516609505">马健将</a> 提出.</p>

  <p>这个方案中, 要求 Joint Consensus 结束的标准为：commit-index 同步到一个 <code class="language-plaintext highlighter-rouge">C_new</code> 的 quorum,
并且要求 Raft 的实现 <strong>持久化 commit-index 的值</strong>.</p>

  <p>虽然标准 Raft 是不需要记录 commit-index 的值的, 但是记录 commit-index 几个题外话的好处,
在这里的好处就是可以当做一个状态存储的补充, 来区分 Joint 和 Uniform 状态.</p>

  <p>因为 commit-index 记录了哪些 log entry 已经被提交, 如果 commit-index 包含 config change entry 的 index,
就可以知道 <code class="language-plaintext highlighter-rouge">C_old_new</code> 已经提交, 可以直接进入 <code class="language-plaintext highlighter-rouge">C_new</code> 状态, 从而避免出现向已清理节点请求选票的问题;</p>

  <p>但是这个方案仍会出现 <code class="language-plaintext highlighter-rouge">C_old_new</code> 与 <code class="language-plaintext highlighter-rouge">C_new</code> 争夺 Leadership 的问题(问题-1),
因为 <code class="language-plaintext highlighter-rouge">C_new</code> 没有更多的 log, 而且 commit-index 只在 <code class="language-plaintext highlighter-rouge">C_new</code> 上提交无法保证一定会同步给 <code class="language-plaintext highlighter-rouge">C_old</code> 中的节点.
这是分布式中消息到底是<strong>投递至少一次</strong>还是<strong>投递至多一次</strong>的选择的问题：</p>

  <ul>
    <li>如果选择至少一次, 即要求 commit-index 在 <code class="language-plaintext highlighter-rouge">C_old_new</code> 上 commit, 则可能出现 commit-index commit 成功了, <code class="language-plaintext highlighter-rouge">C_old</code> 节点被清理了, 然后再次尝试 commit commit-index 时无法完成的情况.</li>
    <li>所以只能选择至多一次, 即 commit-index 只在 <code class="language-plaintext highlighter-rouge">C_new</code> 上 commit, 在 <code class="language-plaintext highlighter-rouge">C_old</code> 上尽量投递, 但这会导致 <code class="language-plaintext highlighter-rouge">C_old</code> 中节点可能没有收到 commit-index 更新的情况, 使它们不知道集群已经变化, 继续发起选举.</li>
  </ul>

  <p>所以还有几率出现 <code class="language-plaintext highlighter-rouge">C_old_new</code> 与 <code class="language-plaintext highlighter-rouge">C_new</code> 争夺 Leadership 的问题(问题-1).</p>
</blockquote>

<p>最后, <strong>修改后的 Single log config change</strong> 流程如下:</p>

<ol>
  <li>
    <p>初始状态同上, <code class="language-plaintext highlighter-rouge">effective-config</code> 为 <code class="language-plaintext highlighter-rouge">C_old</code> (<code class="language-plaintext highlighter-rouge">{a,b,c}</code>).</p>
  </li>
  <li>
    <p>Leader 提议一个包含 <code class="language-plaintext highlighter-rouge">C_new = {x,y,z}</code> 的 config log entry: <code class="language-plaintext highlighter-rouge">entry-j</code>, 并立即将 <code class="language-plaintext highlighter-rouge">effective-config</code> 切换为 Joint config <code class="language-plaintext highlighter-rouge">C_old_new = [{a,b,c}, {x,y,z}]</code>.</p>
  </li>
  <li>
    <p>所有 <code class="language-plaintext highlighter-rouge">index &gt;= j</code> 的日志在 <code class="language-plaintext highlighter-rouge">C_old_new</code> 下进行复制和提交.</p>
  </li>
  <li>
    <p><strong>关键步骤</strong>: 一旦 <code class="language-plaintext highlighter-rouge">entry-j</code> 在 <code class="language-plaintext highlighter-rouge">C_old_new</code> 下<strong>成功提交</strong>, Leader 立即提议一个特殊的 <strong>Barrier entry</strong>. 这个 Barrier entry 不包含任何 config 信息, 它的存在仅仅是为了标记”Joint state 的安全结束点”. Leader 在提议 Barrier entry 的同时, 可以立即将 <code class="language-plaintext highlighter-rouge">effective-config</code> 切换为 <code class="language-plaintext highlighter-rouge">C_new = {x,y,z}</code>, 并使用这个新 config 来复制和提交 Barrier entry.</p>
  </li>
  <li>
    <p><strong>当 Barrier entry 被提交后</strong>, config change 过程结束.</p>
  </li>
</ol>

<p>对应的, <strong>新 Node 启动时的行为</strong>:</p>

<p>一个新启动的 Node 检查日志. 它看到最后两个 config 日志是 <code class="language-plaintext highlighter-rouge">entry-i</code> (<code class="language-plaintext highlighter-rouge">C_old</code>) 和 <code class="language-plaintext highlighter-rouge">entry-j</code> (<code class="language-plaintext highlighter-rouge">C_new</code>). 它<strong>首先检查</strong>在 <code class="language-plaintext highlighter-rouge">entry-j</code> 之后, 是否存在一个<strong>Barrier entry</strong>:</p>

<ul>
  <li>
    <p><strong>如果存在 Barrier entry</strong>: 这意味着 Joint state <code class="language-plaintext highlighter-rouge">C_old_new</code> 已经安全结束. 该 Node 应该直接将
<code class="language-plaintext highlighter-rouge">effective-config</code> 设置为<code class="language-plaintext highlighter-rouge">C_new</code>. 不需要去联系可能已经被移除的旧 Node.</p>
  </li>
  <li>
    <p><strong>如果不存在 Barrier entry</strong>:
这意味着 Joint state 可能仍在进行中. 那么该 Node
应该将 <code class="language-plaintext highlighter-rouge">effective-config</code> 设置为 Joint config
<code class="language-plaintext highlighter-rouge">C_old_new</code>.</p>
  </li>
</ul>

<p>但是, 到这里细心的你一定已经发现: 这个 补丁-3 引入了第二条 log entry, 不再是严格意义上的”单条日志条目”完成, 而是需要”一条 config 条目 + 一条 Barrier 条目”.</p>

<h1 id="总结">总结</h1>

<p>因为 config change 必然让系统经历 3 个阶段: <code class="language-plaintext highlighter-rouge">C_old -&gt; C_old_new -&gt; C_new</code>,
如果只用一条 log, 那么在持久层层面, 它只能表示 1 bit 的信息, 即最多表示 2 个状态: <code class="language-plaintext highlighter-rouge">C_old</code> 或 <code class="language-plaintext highlighter-rouge">C_new</code>,
这是问题的根本所在,
一个通用且安全的 config change 算法, 至少需要 2 条 log entry 参与, 才能提供 2 bit 的信息, 最多能表达 4 个状态, 来支持系统所需的 3 个状态.</p>

<p>所以, 还是老老实实回到 Joint Consensus 吧, 它更简洁.</p>

<h2 id="参考资料">参考资料</h2>

<ul>
  <li>Diego Ongaro &amp; John Ousterhout. In Search of an Understandable Consensus Algorithm (Raft 原论文): https://raft.github.io/raft.pdf</li>
  <li>OpenRaft(rust): https://github.com/databendlabs/openraft</li>
  <li>etcd/raft 项目源码: https://github.com/etcd-io/raft</li>
  <li>Hashicorp Raft 实现: https://github.com/hashicorp/raft</li>
</ul>

<p>Reference:</p>

<ul>
  <li>马健将 : <a href="https://weibo.com/u/1516609505">https://weibo.com/u/1516609505</a></li>
</ul>]]></content><author><name>Zhang Yanpo (drdr.xp)</name></author><category term="algo" /><category term="distributed" /><category term="分布式" /><category term="raft" /><category term="config-change" /><category term="joint" /><summary type="html"><![CDATA[单条日志实现Raft配置变更的方法，对比标准的Joint Consensus更简洁吗？]]></summary></entry></feed>