Operational Guide: Read Repair vs Anti-Entropy Repair in Cassandra

Modern Defaults & The Shift to Scheduled Consistency

Cassandra’s consistency guarantees rely on two distinct synchronization pathways: path-dependent read repair and scheduled anti-entropy (AE) repair. Cassandra 4.0 removed the legacy read_repair_chance and dclocal_read_repair_chance table options entirely (historically dclocal_read_repair_chance defaulted to 0.1 and read_repair_chance to 0.0), replacing them with the per-table read_repair option whose only valid values are 'BLOCKING' (the default) and 'NONE'. Blocking read repair still runs at query time on a digest mismatch for consistency levels above ONE. This architectural pivot reflects production reality: reactive, path-dependent synchronization introduces unpredictable tail latency, fragments SSTables, and directly competes with background compaction threads. Modern clusters treat AE repair as the authoritative consistency enforcement mechanism, while read repair (read_repair = 'BLOCKING') handles in-query reconciliation for niche, low-throughput query patterns. For a foundational breakdown of how these mechanisms fit into broader storage engine behavior, see Cassandra Architecture & Compaction Fundamentals.

Read Repair: Path-Dependent Synchronization & Latency Penalties

Read repair triggers synchronously when a coordinator detects divergent data across replicas during a client query. The coordinator reconciles discrepancies by streaming the most recent mutation to lagging nodes before returning the result set. While conceptually elegant, this mechanism directly conflicts with LSM Tree Mechanics in Cassandra. Out-of-band writes generated during read repair bypass normal write-path coordination, producing uncoordinated SSTables, inflating tombstone density, and triggering premature compaction cycles.

The sequence below traces how blocking read repair detects and fixes a divergence at query time.

sequenceDiagram participant Co as Coordinator participant R1 as Replica 1 participant R2 as Replica 2 Co->>R1: Read full data Co->>R2: Read digest R1-->>Co: Full data and digest R2-->>Co: Digest Note over Co: Digest mismatch detected Co->>R1: Fetch newest data R1-->>Co: Newest version Co->>R2: Write back newest version
Blocking read repair reconciles a stale replica at query time

In high-throughput environments, read repair manifests as:

  • Tail latency spikes: Synchronous streaming blocks coordinator threads until all replicas converge.
  • Compaction queue saturation: Fragmented, uncoordinated SSTables force SizeTieredCompactionStrategy (STCS) to merge prematurely, consuming CPU and I/O bandwidth.
  • Tombstone accumulation: Divergent deletes or updates generate overlapping tombstones that exceed safe thresholds before gc_grace_seconds expires.

Operational Directive: Set read_repair = 'NONE' via ALTER TABLE on tables where in-query reconciliation is undesirable; the default 'BLOCKING' only fires on a digest mismatch at CL>ONE. Monitor ReadRepairLatency and CompactionPendingTasks via JMX. Prefer 'NONE' on tables utilizing TimeWindowCompactionStrategy (TWCS) or high-write workloads.

Anti-Entropy Repair: Authoritative Range Validation

Anti-entropy repair operates asynchronously via nodetool repair, validating token ranges across replicas and streaming missing or divergent ranges. Modern Cassandra (4.0/5.x) leverages incremental repair, which tracks the repaired status of each SSTable via per-SSTable repairedAt metadata (and the system.repairs table), so subsequent runs only re-validate data that has not yet been marked repaired. This drastically reduces network overhead and disk I/O compared to repeatedly comparing the full data set.

AE repair integrates tightly with Data Partitioning & Token Ring Basics by operating strictly on primary ranges per node. Repair coordination relies on Node Gossip & Failure Detection Protocols to identify live endpoints, validate schema agreement, and negotiate streaming sessions. Because repair runs independently of client reads, it can be safely scheduled during maintenance windows or throttled during peak traffic.

Key v4.x/v5.x Improvements:

  • Incremental repair is the default execution mode of nodetool repair (no flag); pass -full to force a full repair.
  • Merkle trees are still built to detect divergence; repair then streams only the differing ranges between replicas.
  • system_distributed.repair_history provides granular audit trails for compliance and troubleshooting.

Compaction Strategy Alignment & Tombstone Lifecycle

Repair and compaction are deeply coupled. As detailed in Understanding STCS vs LCS vs TWCS, each compaction strategy responds differently to repair-generated SSTables:

  • STCS: Aggressively merges incoming repair streams. High repair frequency can trigger runaway compaction if compaction_throughput_mb_per_sec is misconfigured.
  • LCS: Maintains sorted runs that benefit from incremental repair. LCS handles repair deltas gracefully but requires careful sstable_size_in_mb tuning to prevent excessive L0 overlap.
  • TWCS: Isolates data into immutable time windows. Repair must align with window boundaries; otherwise, cross-window tombstone resurrection can occur if gc_grace_seconds is shorter than the repair interval.

Effective Tombstone Management & Garbage Collection requires gc_grace_seconds to exceed the maximum repair interval. If a node is down longer than gc_grace_seconds, tombstones may be purged before repair synchronizes the deletion, leading to data resurrection. Note that the node-level cassandra.yaml settings tombstone_warn_threshold (default 1000) and tombstone_failure_threshold (default 100000) cap how many tombstones a single read query may scan; they are global yaml parameters, not per-table properties, and cannot be set via nodetool.

Automated Repair Workflows for v4.x/v5.x

Manual nodetool repair execution does not scale. Production environments require deterministic, idempotent automation that respects cluster topology, throttles I/O, and validates completion. Below is a validated Python orchestration pattern aligned with Cassandra 4.x/5.x standards.

1. Primary Range Discovery

Primary token ranges per node come from the ring topology, not from getendpoints (which returns the replica endpoints for a single partition key). Use nodetool ring <keyspace> or nodetool describering <keyspace> to enumerate ranges.

import subprocess

def get_ring(node_ip: str, keyspace: str) -> str:
    cmd = f"ssh {node_ip} nodetool describering {keyspace}"
    result = subprocess.run(cmd.split(), capture_output=True, text=True)
    # Parse the TokenRange entries for precise token boundaries
    return result.stdout.strip()

2. Throttled Repair Execution

Run primary-range repair (incremental is the default) and throttle streaming separately via setstreamthroughput to prevent compaction starvation. Use -j for job threads; do not combine -pr with -local.

def run_repair(node_ip: str, keyspace: str, table: str, throttle_mbps: int = 50):
    # Throttle streaming bandwidth first (value is in megabits/s)
    subprocess.run(
        f"ssh {node_ip} nodetool setstreamthroughput {throttle_mbps}".split(),
        capture_output=True, text=True, check=True,
    )
    cmd = f"ssh {node_ip} nodetool repair -pr -j 2 {keyspace} {table}"
    process = subprocess.run(cmd.split(), capture_output=True, text=True)
    if process.returncode != 0:
        raise RuntimeError(f"Repair failed: {process.stderr}")
    return process.stdout

3. Validation & Alerting Integration

Observe active repair and streaming activity directly via nodetool netstats and nodetool compactionstats (validation compactions appear there). For dashboards, integrate with Prometheus/Grafana using the JMX Exporter against the real metric MBeans (note the type: colon and key=value form):

  • org.apache.cassandra.metrics:type=Compaction,name=PendingTasks
  • org.apache.cassandra.metrics:type=Client,name=Timeouts
  • org.apache.cassandra.metrics:type=Streaming,... for streaming throughput during repair

Schedule repairs using cron or Kubernetes CronJobs, staggering node execution to maintain quorum availability. For multi-region deployments, align with Consistency Level Selection for Multi-DC Deployments to ensure repair does not violate QUORUM or LOCAL_QUORUM guarantees during streaming. When implementing Cross-Cluster Replication & Conflict Resolution, isolate repair domains per cluster and avoid overlapping token ranges to prevent split-brain synchronization loops.

Operational Directives & Validation Checklist

Before deploying repair automation to production, validate against the following SRE standards:

Checkpoint v4.x/v5.x Requirement
Read Repair State read_repair = 'NONE' (or default 'BLOCKING') per table via ALTER TABLE
Repair Mode Incremental (default, no flag) with primary range (-pr) targeting
Throttling nodetool setstreamthroughput <megabits/s> sized to leave headroom for client I/O
GC Grace Alignment gc_grace_seconds ≥ 2× maximum repair interval
Schema Agreement nodetool describecluster shows a single schema version before execution
Tombstone Safety nodetool tablestats tombstone counts well below the yaml tombstone_failure_threshold
Multi-DC Routing Repair scoped to local DC; cross-DC sync handled via dedicated replication pipelines

Repair is not a substitute for proper consistency configuration. Always pair scheduled AE repair with appropriate CONSISTENCY levels, monitor streaming validation success rates, and rotate repair windows to distribute I/O load. When executed deterministically, anti-entropy repair guarantees eventual consistency without compromising read latency or compaction stability.

Related guides