Resolving High Compaction Backlog Without Downtime in Cassandra 4.x/5.x

High compaction backlog in production Apache Cassandra clusters manifests as elevated PendingCompactions, degraded p99 read/write latency, and eventual CompactionExecutor thread starvation. Use this runbook when a live node is already behind — pending tasks climbing while BytesCompacted flattens — and you must recover it without draining traffic or restarting the process. It assumes Cassandra 4.0, 4.1, or 5.0, a nodetool/JMX path to every target node, and Python 3.10+ for the automation. It sits under Compaction Backlog Analysis & Alerting; establish the velocity-aware telemetry described there before running any procedure below, because every safety gate here reads the same counters. In Cassandra 5.0, UnifiedCompactionStrategy and the modern I/O scheduler improve predictability, but aggressive write bursts, a mismatched strategy among the trade-offs between STCS, LCS, and TWCS, or an uncoordinated anti-entropy repair can still saturate the compaction queue. Resolving that without downtime requires deterministic I/O arbitration, dynamic throughput scaling, and precise repair-compaction coordination — implemented below as idempotent automation with explicit safety gates and step-by-step validation.

Pre-conditions & safety gates

Before modifying any runtime parameter, isolate whether the backlog stems from write amplification, compaction-strategy misalignment, or repair-stream contention. These checks are read-only; run the full sequence on a representative node and stop if any gate fails.

# 1. Show pending tasks and the active-compaction table.
#    Remaining bytes = sum of (total - completed) across the table rows;
#    nodetool does NOT print an "active tasks:" / "pending bytes:" line.
nodetool compactionstats --human-readable

Safety Check: Verify nodetool is responsive and the node is in UN state via nodetool status. Do not run if the node is DN or UJ. Expected Output:

pending tasks: 312
id                                   compaction type  keyspace  table    completed  total      unit  progress
c1d2e3f0-2a3b-11ef-8c4d-1a2b3c4d5e6f  Compaction       app       events   120 GiB    300 GiB    bytes  40.00%
d2e3f4a1-2a3b-11ef-8c4d-1a2b3c4d5e6f  Compaction       app       sessions  60 GiB    198 GiB    bytes  30.30%
Active compaction remaining time :   1h12m44s

Rollback Path: N/A (read-only diagnostics). The full column-by-column reading of this output is covered in interpreting nodetool compactionstats output.

# 2. Verify CompactionExecutor queue saturation.
#    Gate on the Pending column (tpstats has no "Max" column).
nodetool tpstats | grep -E "Pool Name|CompactionExecutor"

Safety Check: Ensure the JMX port is accessible and no other nodetool operation is running concurrently. Expected Output:

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
CompactionExecutor                4       312          18452         0                 0

Rollback Path: N/A.

# 3. Cross-reference with active repair streams.
nodetool netstats | grep -i "repair"

Safety Check: Confirm no ongoing nodetool rebuild or streaming operation that could be interrupted. Expected Output: Either empty (no active repairs) or lines showing Repair streaming sessions with byte counts. Rollback Path: N/A.

Intervention thresholds — proceed to remediation only when at least one is breached:

PendingCompactions > 200 per node.
CompactionExecutor Active = concurrent_compactors AND Pending > 50.
Remaining compaction bytes (sum of total - completed from compactionstats) exceed what available disk I/O bandwidth can drain in the maintenance window (e.g. > 400 GB on a 200 MB/s SSD array).

If thresholds are breached, continue to dynamic scaling. Critical: do not execute nodetool cleanup or nodetool scrub during active backlog — these compete for the same I/O scheduler and will trigger severe latency spikes or OutOfMemoryError conditions.

Implementation: idempotent throughput scaling and repair arbitration

Cassandra 4.x/5.x supports live adjustment of compaction throughput and concurrency. The Python module below is an idempotent resolver with explicit pre-flight validation, incremental scaling, and automatic rollback on metric regression. It captures baseline values first so every applied change is reversible, and caps concurrency to avoid I/O thrashing.

#!/usr/bin/env python3
# Requires: Python 3.10+, nodetool on PATH, node in UN state.
"""
Idempotent compaction backlog resolver for Cassandra v4.x/v5.x.
Scales compaction throughput and concurrency safely without downtime.
"""
import subprocess
import sys
import time
import logging
import re
from typing import Optional

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
)

def run_nodetool(args: list[str], timeout: int = 15) -> tuple[int, str, str]:
    """Execute nodetool with an explicit timeout and error propagation."""
    cmd = ["nodetool"] + args
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout, check=True)
        return result.returncode, result.stdout.strip(), result.stderr.strip()
    except subprocess.CalledProcessError as e:
        logging.error(f"nodetool {args} failed: {e.stderr.strip()}")
        return e.returncode, "", e.stderr.strip()
    except subprocess.TimeoutExpired:
        logging.error(f"nodetool {args} timed out after {timeout}s")
        return -1, "", "Timeout"

def get_pending_bytes() -> Optional[int]:
    """Estimate remaining compaction bytes from compactionstats.

    nodetool does NOT emit a "pending bytes:" line. The real output has one row
    per active compaction with `completed` and `total` byte columns, so the
    remaining work is the sum of (total - completed) across those rows.
    """
    rc, out, _ = run_nodetool(["compactionstats", "--human-readable"])
    if rc != 0:
        return None

    multipliers = {
        "bytes": 1, "B": 1,
        "KiB": 1024, "MiB": 1024**2, "GiB": 1024**3, "TiB": 1024**4,
    }

    def to_bytes(value: str, unit: str) -> float:
        return float(value) * multipliers.get(unit, 1)

    # Match the completed/total pair on each table row, e.g.
    #   ... 120 GiB    300 GiB    bytes  40.00%
    # or the non-human-readable form: ... 128849018880 322122547200 bytes 40.00%
    row_re = re.compile(
        r"([\d.]+)\s*(bytes|B|KiB|MiB|GiB|TiB)\s+([\d.]+)\s*(bytes|B|KiB|MiB|GiB|TiB)\s+\w+\s+[\d.]+%"
    )
    remaining = 0.0
    for completed_v, completed_u, total_v, total_u in row_re.findall(out):
        remaining += to_bytes(total_v, total_u) - to_bytes(completed_v, completed_u)
    return int(max(remaining, 0))

def scale_compaction(target_throughput: int, target_concurrency: int,
                     original_throughput: int, original_concurrency: int,
                     max_attempts: int = 3) -> bool:
    """Incrementally scale compaction parameters with validation and rollback."""
    logging.info(f"Target: {target_throughput} MB/s, {target_concurrency} threads")
    baseline = 0

    for attempt in range(1, max_attempts + 1):
        # Apply settings.
        rc1, _, _ = run_nodetool(["setcompactionthroughput", str(target_throughput)])
        rc2, _, _ = run_nodetool(["setconcurrentcompactors", str(target_concurrency)])

        if rc1 != 0 or rc2 != 0:
            logging.error("Failed to apply scaling parameters. Rolling back.")
            run_nodetool(["setcompactionthroughput", str(original_throughput)])
            run_nodetool(["setconcurrentcompactors", str(original_concurrency)])
            return False

        time.sleep(45)  # Allow the scheduler to stabilize.
        pending = get_pending_bytes()
        if pending is None:
            logging.warning("Unable to read pending bytes. Aborting.")
            return False

        logging.info(f"Attempt {attempt}: Pending bytes = {pending / (1024**3):.2f} GiB")

        # Validation gate: expect >= 15% reduction per cycle.
        if attempt == 1:
            baseline = pending
        else:
            reduction = (baseline - pending) / baseline
            if reduction >= 0.15:
                logging.info("Backlog draining successfully. Scaling complete.")
                return True
            logging.warning(f"Insufficient drain rate ({reduction:.2%}). Rolling back.")
            run_nodetool(["setcompactionthroughput", str(original_throughput)])
            run_nodetool(["setconcurrentcompactors", str(original_concurrency)])
            return False

    return False

if __name__ == "__main__":
    # Pre-flight safety check.
    rc, _, _ = run_nodetool(["status"])
    if rc != 0:
        logging.critical("Node not in UN state. Exiting.")
        sys.exit(1)

    # Capture current state for guaranteed rollback.
    rc, out, _ = run_nodetool(["getcompactionthroughput"])
    # Output: "Current compaction throughput: 64 MB/s"
    match = re.search(r"(\d+)", out) if rc == 0 else None
    current_tp = int(match.group(1)) if match else 16
    rc, out, _ = run_nodetool(["getconcurrentcompactors"])
    current_cc = int(out) if rc == 0 and out.isdigit() else 2

    # Scale to 2x throughput, +2 concurrency (cap at 8 threads to avoid I/O thrashing).
    new_tp = min(current_tp * 2, 256)
    new_cc = min(current_cc + 2, 8)

    success = scale_compaction(new_tp, new_cc, current_tp, current_cc)
    sys.exit(0 if success else 1)

Safety Check: The script verifies nodetool status returns UN before execution, captures baseline throughput/concurrency for guaranteed rollback, and caps concurrency at 8 to prevent CPU/IO thrashing on standard NVMe arrays. The 15% per-cycle drain gate auto-reverts if the extra I/O is not actually reducing the queue. For the reasoning behind the throughput ceiling itself, see how to tune compaction_throughput_mb_per_sec safely. Expected Output:

2026-05-12 14:02:11 [INFO] Target: 32 MB/s, 4 threads
2026-05-12 14:02:56 [INFO] Attempt 1: Pending bytes = 382.10 GiB
2026-05-12 14:03:42 [INFO] Attempt 2: Pending bytes = 318.44 GiB
2026-05-12 14:03:42 [INFO] Backlog draining successfully. Scaling complete.

Rollback Path: Automatic on validation failure or nodetool error. Manual rollback: nodetool setcompactionthroughput <original_value> and nodetool setconcurrentcompactors <original_value>.

Repair operations generate large SSTable merges that compete directly with compaction, so once throughput is scaled you must bound streaming before letting any repair run, or the backlog simply regenerates.

# Throttle repair streaming bandwidth (value in megabits/s) before repairing.
nodetool setstreamthroughput 100

Safety Check: Record the current value via nodetool getstreamthroughput first. Do not set below 50 MB/s if the deployment is under heavy write load. Expected Output: The command succeeds silently; confirm with nodetool getstreamthroughput. Rollback Path: nodetool setstreamthroughput 0 (unlimited, the default).

# Run a scoped repair with compaction-friendly flags (-seq == --sequential).
nodetool repair --full --in-local-dc -seq --trace --ignore-unreplicated-keyspaces

Safety Check: Ensure no other nodetool repair is running (nodetool compactionstats should show 0 Validation tasks). Run during an off-peak window where possible. Expected Output: Repair progress logs stream to stdout/JMX; completion returns exit code 0. Rollback Path: Pressing Ctrl+C only detaches the nodetool client — it does not cleanly stop the server-side repair, and a full repair does not resume from a checkpoint. To actually stop in-flight work, run nodetool stop VALIDATION (and nodetool stop COMPACTION if merges are saturating I/O) on the affected node, then revert the streaming throttle if needed and plan to re-run the full repair from the start. Aligning compaction_throughput with disk IOPS capacity and token distribution is covered in the parent Advanced Compaction Strategy Tuning & Monitoring guide.

The end-to-end remediation phases are summarized below.

Verification steps

Once the queue drains, confirm stability and revert temporary scaling to baseline so the elevated settings do not exhaust resources during normal operation.

# 1. Verify backlog normalization (pending count + remaining bytes from the table rows).
nodetool compactionstats --human-readable

Safety Check: Confirm pending tasks is low and the remaining bytes (sum of total - completed across rows) are < 10% of total data size per node. Expected Output:

pending tasks: 1
id                                   compaction type  keyspace  table   completed  total    unit  progress
e3f4a5b6-2a3b-11ef-8c4d-1a2b3c4d5e6f  Compaction       app       events  8 GiB      12.4 GiB bytes  64.52%
Active compaction remaining time :   0h01m03s

Rollback Path: If remaining bytes rebound > 200 GiB, re-run the resolver with conservative scaling (new_tp = current_tp * 1.5, new_cc = current_cc + 1).

# 2. Restore baseline compaction parameters.
nodetool setcompactionthroughput 16
nodetool setconcurrentcompactors 2

Safety Check: Monitor tpstats for 5 minutes post-revert. Ensure CompactionExecutor Pending does not spike > 50. Rollback Path: If latency spikes or the pending queue saturates, immediately re-apply scaled values: nodetool setcompactionthroughput 32 and nodetool setconcurrentcompactors 4.

# 3. Final latency & thread-pool validation.
nodetool tpstats | grep -E "Pool Name|MutationStage|ReadStage|CompactionExecutor"

Safety Check: Cross-reference application p99 latency dashboards. tpstats has no Max column — gate on the Pending, Blocked, and All time blocked columns staying near zero for all pools. Rollback Path: If thread starvation persists, investigate write amplification via nodetool tablestats and consider a strategy migration (e.g. STCS → LCS or TWCS).

Troubleshooting

WriteTimeoutException during scaling. Raising concurrent_compactors too aggressively steals CPU and disk bandwidth from the write path, so mutations time out at the coordinator. Root cause: compaction I/O oversubscribed the device. Fix: lower concurrency back toward min(cores, disks) and cap compaction throughput at 50% of sustained disk write bandwidth; never let compaction plus repair streaming exceed the device ceiling.
DiskFailurePolicy / FSWriteError events while draining. Scaling throughput beyond what the array can sustain saturates the I/O queue and Cassandra interprets stalled writes as disk failure. Root cause: throughput ceiling set above real device bandwidth. Fix: revert to baseline throughput immediately, benchmark the disk with a controlled nodetool setcompactionthroughput sweep, and keep headroom for flushes and repair. Whole-table expiry pressure that shows up as OutOfSpaceException is a related symptom — free space via archival before scaling, not by racing compaction.
Backlog rebounds within minutes of restoring baseline. The resolver reports success but pending tasks climb again as soon as scaled settings revert. Root cause: the true bottleneck is a mismatched compaction strategy or a repair that reintroduces overlapping SSTables faster than baseline throughput can merge them, not a transient burst. Fix: keep the intermediate scaled values, then address the source — re-evaluate the table’s strategy against the STCS, LCS, and TWCS trade-offs and stagger repair so validation never overlaps a compaction storm.

Compaction Backlog Analysis & Alerting — the parent guide that quantifies backlog as a velocity signal and sets the thresholds this runbook gates on.
Interpreting nodetool compactionstats output — column-by-column reading of the diagnostic output used in every gate here.
How to tune compaction_throughput_mb_per_sec safely — choosing the throughput ceiling the resolver scales toward.
Python Monitoring for Cassandra Compaction — the driver-based scheduler that gates repair on compaction pressure so backlog does not regenerate.

Back to Compaction Backlog Analysis Alerting

Resolving High Compaction Backlog Without Downtime in Cassandra 4.x/5.x

Pre-conditions & safety gates

Implementation: idempotent throughput scaling and repair arbitration

Verification steps

Troubleshooting

Related