Understanding Cassandra Read Repair vs Anti-Entropy Repair: A Compaction-Aware Runbook

This runbook is for the exact moment when you have decided that one table needs its consistency reconciled and you must choose how — synchronously on the query path with read repair, or on a schedule with anti-entropy repair — without spiking p99 latency or flooding the compaction queue. It assumes Cassandra 4.x or 5.x (where the probabilistic read_repair_chance knobs are gone and incremental repair is the default), a deployment where every node reports Up/Normal, and an operator with nodetool access plus a Python 3.10+ automation host. It sits beneath the read repair vs anti-entropy repair comparison, which explains why the two paths differ; this page is the runnable how. If you are still deciding which compaction strategy will absorb the SSTables repair generates, read the STCS vs LCS vs TWCS trade-offs first, because that choice changes every throttle value below.

Treat repair as a compaction-aware process, not an isolated maintenance task. Every reconciled token range produces fresh SSTables that immediately enter the compaction queue, so an unthrottled repair on a busy table is indistinguishable from a compaction incident.

Read Repair: The Query-Path Configuration

Read repair fires only when a client query at a consistency level above ONE touches divergent replicas. Under the default read_repair = 'BLOCKING', a digest mismatch triggers synchronous reconciliation before the coordinator returns the result — correct, but it borrows coordinator CPU from application traffic and writes back SSTables that feed compaction. The only other valid value is 'NONE', which hands all consistency responsibility to background anti-entropy and tunable consistency levels (QUORUM, LOCAL_QUORUM).

-- Inspect the current setting before changing anything (run in cqlsh).
DESCRIBE TABLE my_keyspace.my_table;

-- Disable synchronous read-path reconciliation on a latency-sensitive table.
ALTER TABLE my_keyspace.my_table WITH read_repair = 'NONE';

Safety check: Confirm the current value on active OLTP tables with DESCRIBE TABLE before altering; never toggle blind.
Expected output: ALTER TABLE returns nothing on success. Re-run DESCRIBE TABLE and confirm read_repair = 'NONE'.
Rollback path: If query latency climbs after the change, revert with ALTER TABLE my_keyspace.my_table WITH read_repair = 'BLOCKING'; and watch coordinator thread-pool saturation for 15 minutes.

Setting 'NONE' is only safe if you guarantee anti-entropy runs often enough to bound drift — and that cadence must stay shorter than gc_grace_seconds to avoid deleted-data resurrection, a coupling covered in tombstone management and garbage collection.

Pre-Conditions / Safety Gates

Before any anti-entropy run, every check below must pass. Repair reconstructs Merkle trees over token ranges and streams the diffs, so a deployment that is already behind on compaction or split on schema will only get worse.

# 1. Every node must be Up/Normal (UN). Any DN/UJ aborts the run.
nodetool status
# Expected: leading "UN" on every row.

# 2. Exactly one schema version — a split means gossip has not converged.
nodetool describecluster
# Expected: a single hash under "Schema versions".

# 3. Compaction backlog must be below the safety threshold.
nodetool compactionstats -H
# Expected: "pending tasks: N" where N < 8.

# 4. No stream should already be in flight.
nodetool netstats | grep -E "Receiving|Sending" || echo "streams idle"
# Expected: "streams idle".

If pending compactions already exceed the threshold, resolve that first — the compaction backlog analysis and alerting workflow explains how to drain a queue without downtime. On Cassandra 4.x and 5.x use nodetool tablestats for per-table SSTable counts; the legacy cfstats alias is deprecated and should not appear in new automation.

Implementation: A Gated, Compaction-Aware Orchestrator

The orchestrator below runs an incremental primary-range repair (-pr) only after the pre-flight gate passes, then polls the compaction backlog mid-flight and aborts if repair-generated SSTables overwhelm the queue. It is idempotent by design: it exits cleanly on a breached gate rather than forcing a repair, so a scheduler can retry it on the next window without side effects.

#!/usr/bin/env python3
# requirements: Python 3.10+ standard library only (no third-party packages).
"""Gated, compaction-aware anti-entropy repair for Cassandra 4.x/5.x."""
import logging
import signal
import subprocess
import sys
import time
from types import FrameType

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

MAX_PENDING_COMPACTIONS: int = 8      # Gate: defer repair above this backlog.
REPAIR_TIMEOUT_SEC: int = 3600        # Hard ceiling on a single primary-range run.
KEYSPACE: str = "production_data"


def run_nodetool(cmd: list[str], timeout: int = 30) -> str:
    """Execute a nodetool command with strict error handling."""
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout, check=True)
        return result.stdout.strip()
    except subprocess.CalledProcessError as exc:
        logging.error("nodetool failed: %s. Stderr: %s", cmd, exc.stderr)
        raise
    except subprocess.TimeoutExpired:
        logging.error("nodetool timed out: %s", cmd)
        raise


def get_pending_compactions() -> int:
    """Parse `compactionstats` for the current pending-task count."""
    output = run_nodetool(["nodetool", "compactionstats", "-H"])
    for line in output.splitlines():
        # compactionstats prints e.g. "pending tasks: 2".
        if line.strip().lower().startswith("pending tasks:"):
            try:
                return int(line.split(":", 1)[1].strip())
            except ValueError:
                continue
    return 0


def execute_repair() -> None:
    """Run repair behind pre-flight, mid-flight, and post-flight guardrails."""
    logging.info("Pre-flight: checking compaction backlog...")
    pending = get_pending_compactions()
    logging.info("Current pending compactions: %d", pending)

    # Guard clause: refuse to add repair SSTables to an already-saturated queue.
    if pending > MAX_PENDING_COMPACTIONS:
        logging.warning("Backlog exceeds threshold. Deferring repair; safe to retry next window.")
        sys.exit(1)

    cmd = ["nodetool", "repair", "-pr", KEYSPACE]
    logging.info("Executing: %s", " ".join(cmd))
    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    start = time.time()

    while proc.poll() is None:
        if time.time() - start > REPAIR_TIMEOUT_SEC:
            logging.error("Repair timeout reached. Aborting.")
            proc.kill()
            logging.info("Rollback: process killed. Check `nodetool netstats` for hung streams.")
            sys.exit(2)

        time.sleep(15)
        # Mid-flight guard: repair deltas must not double the backlog.
        if get_pending_compactions() > MAX_PENDING_COMPACTIONS * 2:
            logging.error("Critical backlog during repair. Aborting.")
            proc.kill()
            logging.info("Rollback: repair killed. Run `nodetool compactionstats` to watch the queue drain.")
            sys.exit(3)

    stdout, stderr = proc.communicate()
    if proc.returncode != 0:
        logging.error("Repair failed with code %d. Stderr: %s", proc.returncode, stderr)
        logging.info("Rollback: grep system.log for streaming errors; schedule a manual range repair.")
        sys.exit(4)

    logging.info("Repair completed. Output: %s", stdout[:150])
    logging.info("Post-repair pending compactions: %d", get_pending_compactions())


if __name__ == "__main__":
    signal.signal(signal.SIGTERM, lambda s, f: sys.exit(0))  # type: ignore[arg-type]
    execute_repair()

The sequence below shows how nodetool repair uses Merkle trees to reconcile replicas while streaming only the differing ranges.

Before invoking the orchestrator, throttle streaming so repair cannot saturate disk. Align the ceiling with the compaction strategy: TWCS isolates repair SSTables by time window, while LCS aggressively merges the deltas and will trigger cascading compaction if you do not rate-limit.

# Cap repair streaming bandwidth (value in megabits/s) before the run.
nodetool setstreamthroughput 100

Safety check: Confirm the current cap with nodetool getstreamthroughput and verify disk util% < 60% via iostat -x 1 first.
Expected output: setstreamthroughput prints nothing; getstreamthroughput echoes the value in megabits/s.
Rollback path: Restore the default with nodetool setstreamthroughput 0 (unlimited) if streaming stalls, and watch system.log for StreamSession timeouts.

Verification Steps

After the orchestrator exits 0, confirm the reconciliation actually landed and that the queue is draining rather than growing.

# 1. Confirm the repair session was recorded as successful.
cqlsh -e "SELECT keyspace_name, columnfamily_name, status \
          FROM system_distributed.repair_history LIMIT 5;"
# Expected: recent rows for your keyspace with status = 'SUCCESS'.

# 2. Confirm the backlog is shrinking, not climbing.
nodetool compactionstats -H
# Expected: "pending tasks" trending toward 0.

# 3. Confirm SSTable integrity on the repaired table (positional: keyspace then table).
nodetool verify production_data user_events
# Expected: exit code 0 and no error lines; corruption is reported on stderr and in system.log.

# 4. Confirm SSTable counts are stable, not exploding from unmerged deltas.
nodetool tablestats production_data.user_events | grep "SSTable count"
# Expected: a stable count consistent with the table's compaction strategy.

A SELECT at LOCAL_QUORUM against a previously divergent partition should now return the reconciled value with no blocking read-repair write-back visible in coordinator metrics.

Troubleshooting

StreamingTimeoutException (or hung streams in nodetool netstats). Root cause: streaming throughput outran disk or network capacity, usually because the throttle was left at 0 (unlimited) while compaction competed for the same I/O. Fix: kill the run, lower nodetool setstreamthroughput (start at 50–100 Mb/s), stop stray validation compactions with nodetool stop VALIDATION, confirm nodetool netstats shows zero active streams, and re-run during a lower-traffic window. Interpreting the queue while this unfolds is covered in interpreting nodetool compactionstats output.

Deleted rows reappear after repair (data resurrection). Root cause: repair streamed data from a replica that still held a row whose tombstone had already been purged elsewhere because the repair cadence exceeded gc_grace_seconds. Fix: never let repair intervals drift past gc_grace_seconds; if resurrection already happened, re-delete the affected keys and shorten the schedule. The purge mechanics are detailed in tombstone management and garbage collection.

Compaction backlog keeps climbing mid-repair (orchestrator exits code 3). Root cause: repair-generated SSTables are entering the queue faster than the strategy can merge them, typical of LCS on a write-heavy table. Fix: the guard clause already killed the run — let the queue drain, raise concurrent_compactors or lower stream throughput, and consider whether the table’s strategy suits its write pattern per the STCS vs LCS vs TWCS trade-offs before retrying.

Read repair vs anti-entropy repair — the parent guide explaining why the two reconciliation paths differ and when each fires.
Cassandra architecture and compaction fundamentals — how the storage engine and cluster fabric interact around repair and compaction.
Tombstone management and garbage collection — the gc_grace_seconds mechanics your repair cadence must respect to avoid resurrection.
Understanding STCS vs LCS vs TWCS — how each compaction strategy absorbs the SSTables repair and read repair generate.

Back to Read Repair vs Anti Entropy Repair