Read Repair vs Anti-Entropy Repair in Cassandra 4.x and 5.x

Cassandra keeps replicas consistent through two mechanisms that are easy to conflate and expensive to confuse: read repair, which reconciles divergence on the read path at query time, and anti-entropy repair, which reconciles it on a schedule with nodetool repair. This page is for DBAs, distributed-systems engineers, and DevOps teams who need to decide which pathway owns consistency on a given table, and who need runnable checks to run repair without starving compaction or resurrecting deleted data. It sits beneath the broader Cassandra architecture and compaction fundamentals guide, which frames how the storage engine and cluster fabric interact; here we drill into the two reconciliation paths themselves. Reach for it when you are choosing read_repair per table, scheduling repair cadence against gc_grace_seconds, or debugging why a table’s compaction queue spikes whenever traffic does — on Cassandra 4.x or 5.x, where the legacy probabilistic read-repair knobs are gone and incremental repair is the default.

How the Two Reconciliation Paths Differ

Both paths converge divergent replicas, but they differ in when they run, what triggers them, and how much they cost — and that difference dictates which one you lean on in production.

Read repair is path-dependent: it only fires when a client query touches divergent data. At any consistency level above ONE, the coordinator asks one replica for the full data and the others for a digest (a hash of the response). When the digests disagree, the coordinator resolves the newest version by timestamp and, under the default read_repair = 'BLOCKING', blocks the query until it has written that version back to the stale replicas within the read’s consistency scope. The reconciliation is therefore correct but reactive — it repairs only the exact rows a client happened to read, and it does so on the latency-critical path.

The sequence below traces how blocking read repair detects and fixes a divergence at query time.

Anti-entropy repair is comprehensive and scheduled. Rather than waiting for a read, nodetool repair walks whole token ranges and compares replicas structurally. Each participating replica builds a Merkle tree — a binary hash tree whose leaves cover contiguous sub-ranges of the token space and whose parent nodes hash their children — over the ranges under repair. Replicas exchange trees and descend only into subtrees whose hashes disagree, so the diff cost is proportional to the divergence, not to the dataset size. Where leaves differ, the replicas stream the missing or newer data to each other. Because it operates over ranges rather than individual rows, anti-entropy repair reconciles data no client has read recently — expired hints, mutations dropped under load, and writes missed during an outage — which read repair never reaches.

The two paths interact with the storage engine very differently. Read repair issues out-of-band writes that bypass normal write-path batching; those writes land as small, uncoordinated SSTables that the LSM tree mechanics in Cassandra must later compact away, inflating tombstone density and read amplification on hot tables. Anti-entropy repair streams larger, range-aligned SSTables and, on 4.x/5.x, uses incremental repair to mark data as already-repaired via per-SSTable repairedAt metadata (tracked in system.repairs), so subsequent runs re-validate only unrepaired data. Repair also depends on the distributed layer: it enumerates ranges over the data partitioning and token ring and relies on the node gossip and failure detection protocols to find live endpoints and confirm schema agreement before negotiating streaming sessions.

The practical consequence is a division of labor that modern clusters standardize on: scheduled anti-entropy repair is the authoritative consistency mechanism, run to completion within every gc_grace_seconds window, while read repair is left at its default only where in-query convergence genuinely helps and disabled where it merely taxes latency and compaction.

Configuration Reference

Read repair is a per-table schema property; repair cadence and throttling are operational settings. Cassandra 4.0 removed the probabilistic knobs entirely — read_repair_chance and dclocal_read_repair_chance no longer exist and will be rejected by the schema parser — leaving a single per-table read_repair option.

Key	Default (4.x/5.x)	Valid range	Impact on repair / compaction / throughput
`read_repair` (per table)	`'BLOCKING'`	`'BLOCKING'` or `'NONE'`	`'BLOCKING'` reconciles and blocks on digest mismatch at `CL > ONE`, adding tail latency and out-of-band SSTables; `'NONE'` disables on-read reconciliation, shifting all convergence to scheduled repair.
`gc_grace_seconds` (per table)	`864000` (10 days)	`0`–`2^31-1`	Tombstone purge deadline; must exceed the full-coverage repair interval or deletes resurrect.
`-pr` / `--partitioner-range`	off	flag	Scopes repair to this node’s primary ranges so a ring-wide cycle repairs each range once. Must not be combined with `-local`.
incremental (default) vs `-full`	incremental	flag	Incremental re-validates only unrepaired SSTables; `-full` re-diffs everything and is the choice after data loss or first adoption.
`nodetool setstreamthroughput`	`200` (Mb/s, 0 = unlimited)	non-negative integer	Caps repair streaming bandwidth so validation and streaming leave headroom for client I/O and compaction.
`compaction_throughput_mb_per_sec`	`64` (yaml)	non-negative integer	Bounds how fast repair-generated SSTables are compacted back into their tiers; too low lets a backlog form.

Disable on-read reconciliation on a table where it only adds latency and SSTable churn — typically high-write or time-series tables that rely on scheduled repair instead:

-- Shift all reconciliation to scheduled anti-entropy repair for a write-heavy table.
ALTER TABLE ks.events WITH read_repair = 'NONE';

Leave the default in place where low-throughput, read-mostly tables benefit from converging on the exact rows clients read:

-- Keep blocking read repair (the default) on a read-mostly reference table.
ALTER TABLE ks.reference WITH read_repair = 'BLOCKING';

Prefer read_repair = 'NONE' on tables using TimeWindowCompactionStrategy: out-of-band read-repair writes can land outside the current time window and defeat the strategy’s window isolation, and the time-series strategy selection guidance assumes writes arrive in timestamp order.

Running a Gated Primary-Range Repair

Manual, ad-hoc nodetool repair does not scale and, run carelessly, saturates streaming or races an unhealthy ring. Run these steps in order and stop if any gate fails.

Confirm a single schema version across the deployment. A split schema means gossip has not converged and streaming will diverge.
```
nodetool describecluster
```
Expected — exactly one schema version listing all live nodes:
```
Cluster Information:
    Name: prod
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
        2b6f...c41: [10.0.1.10, 10.0.1.11, 10.0.1.12]
```
Verify every node is UN (Up/Normal). Repairing against a DN node streams into a void and leaves ranges unreconciled.
```
nodetool status my_keyspace
```
Gate: no node shows DN/?. If one is down, defer the repair rather than partially reconciling.
Confirm no streams are already in flight. Overlapping streams saturate inter-node bandwidth and corrupt progress tracking.
```
nodetool netstats | head -n 3
```
Expected when idle:
```
Mode: NORMAL
Not sending any streams.
Not receiving any streams.
```
Throttle streaming, then start the incremental primary-range repair. Incremental is the default on 4.x/5.x (pass -full only to force a full diff); -pr scopes to this node’s primary ranges so a ring-wide sweep repairs each range once.
```
nodetool setstreamthroughput 50
nodetool repair -pr -j 2 my_keyspace events
```

The Python driver below folds those gates into an idempotent, cron-safe routine. It parses the plain-text output of nodetool (neither status nor netstats supports JSON on 4.x/5.x) and refuses to repair against an unhealthy ring or over a busy stream.

#!/usr/bin/env python3
# Requirements: Python 3.10+, a local `nodetool` on PATH. No third-party deps.
"""Gate an incremental primary-range repair on ring health (Cassandra 4.x/5.x)."""

import subprocess
from datetime import datetime, timezone


def _nodetool(args: list[str], timeout: int = 30) -> str:
    """Run nodetool, raising on a non-zero exit or timeout."""
    proc = subprocess.run(
        ["nodetool", *args], capture_output=True, text=True, timeout=timeout
    )
    if proc.returncode != 0:
        raise RuntimeError(f"nodetool {' '.join(args)} failed: {proc.stderr.strip()}")
    return proc.stdout


def ring_is_healthy() -> bool:
    """True only if there is a single schema version and no down nodes."""
    describe = _nodetool(["describecluster"])
    schema_lines = [ln for ln in describe.splitlines() if ":" in ln and "[" in ln]
    if len(schema_lines) > 1:
        print(f"Schema disagreement; deferring:\n{describe}")
        return False
    status = _nodetool(["status"])
    down = [ln for ln in status.splitlines() if ln[:2] in ("DN", "DL", "DJ", "DM")]
    if down:
        print(f"Down nodes detected; deferring:\n{''.join(down)}")
        return False
    return True


def streaming_idle() -> bool:
    """True when no repair or bootstrap streams are already in flight."""
    net = _nodetool(["netstats"])
    return "Not sending any streams." in net and "Not receiving any streams." in net


def repair(keyspace: str, table: str, throttle_mbps: int = 50, timeout: int = 7200) -> bool:
    """Throttle streaming, then run a gated incremental primary-range repair."""
    if not (ring_is_healthy() and streaming_idle()):
        print("Pre-flight gate failed; not starting repair.")
        return False
    # Cap streaming bandwidth first so repair leaves headroom for client I/O.
    _nodetool(["setstreamthroughput", str(throttle_mbps)])
    stamp = datetime.now(timezone.utc).isoformat()
    print(f"[{stamp}] Starting incremental -pr repair on {keyspace}.{table}")
    # Incremental is the default on 4.x/5.x; -pr repairs each range once per ring sweep.
    proc = subprocess.run(
        ["nodetool", "repair", "-pr", "-j", "2", keyspace, table],
        capture_output=True, text=True, timeout=timeout,
    )
    if proc.returncode != 0:
        print(f"Repair failed (exit {proc.returncode}): {proc.stderr.strip()}")
        return False
    print("Repair completed successfully.")
    return True


if __name__ == "__main__":
    repair("my_keyspace", "events")

Stagger this routine across nodes (cron or a Kubernetes CronJob) so quorum stays available, and complete a full ring sweep well inside each table’s gc_grace_seconds.

Verification & Observability

Confirm the outcome rather than trusting the exit code alone.

Watch the repair drain, not just finish. nodetool netstats should return to Not sending any streams. / Not receiving any streams.; a progress line stuck below 100% signals bandwidth saturation — lower setstreamthroughput and retry.
Confirm validation compactions ran. Merkle-tree builds surface as validation compactions; watch them and any follow-on merge with nodetool compactionstats -H, and confirm pending tasks returns toward baseline afterward.
Audit repair history (4.x/5.x). The system_distributed.repair_history and system_distributed.parent_repair_history tables record each session’s ranges, participants, and status for compliance and troubleshooting:
```
SELECT keyspace_name, columnfamily_name, status, range_begin, range_end
FROM system_distributed.parent_repair_history LIMIT 20;
```
Track read-repair activity via JMX. The org.apache.cassandra.metrics:type=ReadRepair MBeans expose read-repair rates; a sustained climb on a table means replicas are diverging faster than scheduled repair reconciles them. Export it and compaction pending tasks through the Prometheus JMX Exporter:
- org.apache.cassandra.metrics:type=Compaction,name=PendingTasks
- org.apache.cassandra.metrics:type=ReadRepair,name=RepairedBlocking
Grep the logs. system.log records Repair session ... finished on success and Validation failed / Sync failed on divergence that could not be reconciled; repeated Streaming error lines point at bandwidth or a flapping endpoint.

Failure Modes & Rollback

Data resurrection from a repair cadence longer than gc_grace_seconds. If a full-coverage repair does not complete within gc_grace_seconds, compaction can purge a tombstone before every replica received the delete; a replica that missed it then reintroduces the “deleted” row on the next repair. Detect it by comparing your slowest observed ring-sweep duration against each table’s gc_grace_seconds, and watch for reappearing rows after repair. Rollback: raise gc_grace_seconds (ALTER TABLE ks.tbl WITH gc_grace_seconds = 1209600) to cover the real cadence, then re-run nodetool repair -full on the affected table so the delete propagates before the next purge. This ties directly to tombstone management and garbage collection, which governs when a tombstone becomes eligible for purge.

Compaction backlog from unthrottled read repair. On a hot, high-write table left at read_repair = 'BLOCKING', out-of-band repair writes generate a stream of small SSTables that STCS merges prematurely, pushing CompactionPendingTasks up and stalling flushes. Detect it with nodetool compactionstats climbing in lockstep with query volume and rising ReadRepair metrics. Rollback: set ALTER TABLE ks.tbl WITH read_repair = 'NONE' to stop the on-read writes, let the queue drain, and rely on scheduled repair for convergence.

Streaming storm from a full repair on a busy ring. Kicking off nodetool repair -full cluster-wide at peak — or on many nodes at once — saturates inter-node links, timing out client reads at QUORUM. Detect it with nodetool netstats showing many concurrent sessions and a spike in Client,name=Timeouts. Rollback: there is no clean cancel for an in-flight session, so lower setstreamthroughput immediately to relieve the link, let active sessions finish, and reschedule as staggered incremental -pr runs during off-peak windows.

Frequently Asked Questions

Do I still need anti-entropy repair if read repair is enabled?

Yes. Read repair only reconciles rows that clients actually read, and only under the consistency levels a given query uses; it never touches cold data, expired hints, or mutations dropped during an outage. Anti-entropy repair is the only mechanism that walks entire token ranges and reconciles data no one has queried recently. Treat scheduled repair as authoritative and read repair as a supplementary, path-dependent optimization.

Should I set read_repair to NONE on every table?

No — set it per workload. Disable it on high-write and time-series tables, where out-of-band writes add tail latency and fragment SSTables (especially under TWCS, where they can land outside the current window). Keep the 'BLOCKING' default on low-throughput, read-mostly tables, where converging on exactly the rows clients read is cheap and useful.

What happens if gc_grace_seconds is shorter than my repair cadence?

Tombstones can be purged during compaction before a repair propagates the delete to every replica, resurrecting deleted data on replicas that missed it. Keep your full-coverage repair cycle comfortably shorter than gc_grace_seconds, or raise gc_grace_seconds to cover the real cadence, and gate repairs on ring health so a skipped or failed cycle does not silently exceed the grace window.

Is incremental or full repair the right default on Cassandra 4.x/5.x?

Incremental is the default and the right routine choice: it marks repaired SSTables via repairedAt and re-validates only unrepaired data, so recurring runs are cheap. Use -full after data loss, after restoring from backup, when first adopting incremental on a legacy table, or when you suspect the repaired/unrepaired boundary is inconsistent and want a clean baseline.

Why did removing read_repair_chance not hurt consistency?

Because the probabilistic knobs only reconciled a random fraction of reads in the background and never guaranteed convergence — they masked divergence rather than fixing it, while adding unpredictable latency and SSTable churn. Cassandra 4.0 dropped them in favor of deterministic blocking read repair on digest mismatch plus scheduled anti-entropy repair, which together give stronger, more predictable guarantees.

Cassandra architecture and compaction fundamentals — the parent guide framing how the storage engine and cluster fabric fit together.
Understanding Cassandra read repair vs anti-entropy repair — a focused walkthrough of the two reconciliation paths and when each fires.
Data partitioning and token ring basics — the ranges that anti-entropy repair diffs and streams over.
Node gossip and failure detection protocols — how repair finds live endpoints and confirms schema agreement before streaming.
Tombstone management and garbage collection — the gc_grace_seconds mechanics that repair cadence must respect to avoid resurrection.
Understanding STCS vs LCS vs TWCS — how each compaction strategy absorbs repair-generated and read-repair SSTables.

Read Repair vs Anti-Entropy Repair in Cassandra 4.x and 5.x

Related guides