Resolving High Compaction Backlog Without Downtime

High compaction backlog in production Cassandra clusters manifests as elevated PendingCompactions, degraded p99 read/write latency, and eventual CompactionExecutor thread starvation. In Cassandra v4.x/v5.x, Unified Compaction and modern I/O schedulers improve predictability, but aggressive write patterns, misaligned strategy parameters, or uncoordinated anti-entropy repairs can still saturate compaction queues. Resolving this without service interruption requires deterministic I/O arbitration, dynamic throughput scaling, and precise repair-compaction coordination. Operators should already have baseline telemetry established through Compaction Backlog Analysis & Alerting pipelines before executing the procedures below. The following workflow implements idempotent Python automation, explicit safety gates, and step-by-step validation to safely drain backlog while preserving cluster stability.

Phase 1: Quantitative Backlog Isolation & Metric Validation

Before modifying runtime parameters, isolate whether the backlog stems from write amplification, compaction strategy misalignment, or repair stream contention. Execute the following diagnostic sequence on a representative node.

# 1. Show pending tasks and the active-compaction table.
#    Remaining bytes = sum of (total - completed) across the table rows;
#    nodetool does NOT print an "active tasks:" / "pending bytes:" line.
nodetool compactionstats --human-readable

Safety Check: Verify nodetool is responsive and the node is in UN state via nodetool status. Do not run if the node is DN or UJ. Expected Output:

pending tasks: 312
id                                   compaction type  keyspace  table    completed  total      unit  progress
c1d2e3f0-2a3b-11ef-8c4d-1a2b3c4d5e6f  Compaction       app       events   120 GiB    300 GiB    bytes  40.00%
d2e3f4a1-2a3b-11ef-8c4d-1a2b3c4d5e6f  Compaction       app       sessions  60 GiB    198 GiB    bytes  30.30%
Active compaction remaining time :   1h12m44s

Rollback Path: N/A (read-only diagnostics).

# 2. Verify CompactionExecutor queue saturation
#    (keep the header row; gate on the Pending column, tpstats has no "Max" column)
nodetool tpstats | grep -E "Pool Name|CompactionExecutor"

Safety Check: Ensure JMX port is accessible and no other nodetool operations are running concurrently. Expected Output:

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
CompactionExecutor                4       312          18452         0                 0

Rollback Path: N/A.

# 3. Cross-reference with active repair streams
nodetool netstats | grep -i "repair"

Safety Check: Confirm no ongoing nodetool rebuild or streaming operations that could be interrupted. Expected Output: Either empty (no active repairs) or lines showing Repair streaming sessions with byte counts. Rollback Path: N/A.

Intervention Thresholds:

  • PendingCompactions > 200 per node
  • CompactionExecutor Active = concurrent_compactors AND Pending > 50
  • Remaining compaction bytes (sum of total - completed from compactionstats) exceed what available disk I/O bandwidth can drain in the maintenance window (e.g., > 400 GB on 200 MB/s SSD arrays)

If thresholds are breached, proceed to dynamic scaling. Critical: Do not execute nodetool cleanup or nodetool scrub during active backlog. These operations compete for the same I/O scheduler and will trigger severe latency spikes or OOM conditions.

Phase 2: Idempotent Dynamic Throughput Scaling

Cassandra v4.x/v5.x supports live adjustment of compaction throughput and concurrency. The following Python module implements an idempotent resolver with explicit pre-flight validation, incremental scaling, and automatic rollback on metric regression.

#!/usr/bin/env python3
"""
Idempotent compaction backlog resolver for Cassandra v4.x/v5.x.
Scales compaction throughput and concurrency safely without downtime.
"""
import subprocess
import sys
import time
import logging
import re
from typing import Tuple, Optional

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)]
)

def run_nodetool(args: list, timeout: int = 15) -> Tuple[int, str, str]:
    """Execute nodetool with explicit timeout and error propagation."""
    cmd = ["nodetool"] + args
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout, check=True)
        return result.returncode, result.stdout.strip(), result.stderr.strip()
    except subprocess.CalledProcessError as e:
        logging.error(f"nodetool {args} failed: {e.stderr.strip()}")
        return e.returncode, "", e.stderr.strip()
    except subprocess.TimeoutExpired:
        logging.error(f"nodetool {args} timed out after {timeout}s")
        return -1, "", "Timeout"

def get_pending_bytes() -> Optional[int]:
    """Estimate remaining compaction bytes from compactionstats.

    nodetool does NOT emit a "pending bytes:" line. The real output has one row
    per active compaction with `completed` and `total` byte columns, so the
    remaining work is the sum of (total - completed) across those rows.
    """
    rc, out, _ = run_nodetool(["compactionstats", "--human-readable"])
    if rc != 0:
        return None

    multipliers = {
        "bytes": 1, "B": 1,
        "KiB": 1024, "MiB": 1024**2, "GiB": 1024**3, "TiB": 1024**4,
    }

    def to_bytes(value: str, unit: str) -> float:
        return float(value) * multipliers.get(unit, 1)

    # Match the completed/total pair on each table row, e.g.
    #   ... 120 GiB    300 GiB    bytes  40.00%
    # or the non-human-readable form: ... 128849018880 322122547200 bytes 40.00%
    row_re = re.compile(
        r"([\d.]+)\s*(bytes|B|KiB|MiB|GiB|TiB)\s+([\d.]+)\s*(bytes|B|KiB|MiB|GiB|TiB)\s+\w+\s+[\d.]+%"
    )
    remaining = 0.0
    for completed_v, completed_u, total_v, total_u in row_re.findall(out):
        remaining += to_bytes(total_v, total_u) - to_bytes(completed_v, completed_u)
    return int(max(remaining, 0))

def scale_compaction(target_throughput: int, target_concurrency: int, 
                     original_throughput: int, original_concurrency: int,
                     max_attempts: int = 3) -> bool:
    """Incrementally scale compaction parameters with validation and rollback."""
    logging.info(f"Target: {target_throughput} MB/s, {target_concurrency} threads")
    
    for attempt in range(1, max_attempts + 1):
        # Apply settings
        rc1, _, _ = run_nodetool(["setcompactionthroughput", str(target_throughput)])
        rc2, _, _ = run_nodetool(["setconcurrentcompactors", str(target_concurrency)])
        
        if rc1 != 0 or rc2 != 0:
            logging.error("Failed to apply scaling parameters. Rolling back.")
            run_nodetool(["setcompactionthroughput", str(original_throughput)])
            run_nodetool(["setconcurrentcompactors", str(original_concurrency)])
            return False

        time.sleep(45)  # Allow scheduler to stabilize
        pending = get_pending_bytes()
        if pending is None:
            logging.warning("Unable to read pending bytes. Aborting.")
            return False
            
        logging.info(f"Attempt {attempt}: Pending bytes = {pending / (1024**3):.2f} GiB")
        
        # Validation gate: expect >= 15% reduction per cycle
        if attempt == 1:
            baseline = pending
        else:
            reduction = (baseline - pending) / baseline
            if reduction >= 0.15:
                logging.info("Backlog draining successfully. Scaling complete.")
                return True
            else:
                logging.warning(f"Insufficient drain rate ({reduction:.2%}). Rolling back.")
                run_nodetool(["setcompactionthroughput", str(original_throughput)])
                run_nodetool(["setconcurrentcompactors", str(original_concurrency)])
                return False

    return False

if __name__ == "__main__":
    # Pre-flight safety check
    rc, _, _ = run_nodetool(["status"])
    if rc != 0:
        logging.critical("Node not in UN state. Exiting.")
        sys.exit(1)

    # Capture current state for rollback
    rc, out, _ = run_nodetool(["getcompactionthroughput"])
    current_tp = int(out) if rc == 0 else 16
    rc, out, _ = run_nodetool(["getconcurrentcompactors"])
    current_cc = int(out) if rc == 0 else 2

    # Scale to 2x throughput, +2 concurrency (cap at 8 threads to avoid I/O thrashing)
    new_tp = min(current_tp * 2, 256)
    new_cc = min(current_cc + 2, 8)

    success = scale_compaction(new_tp, new_cc, current_tp, current_cc)
    sys.exit(0 if success else 1)

Safety Check: Script verifies nodetool status returns UN before execution. Captures baseline throughput/concurrency for guaranteed rollback. Caps concurrency at 8 to prevent CPU/IO thrashing on standard NVMe arrays. Expected Output:

2024-05-12 14:02:11 [INFO] Target: 32 MB/s, 4 threads
2024-05-12 14:02:56 [INFO] Attempt 1: Pending bytes = 382.10 GiB
2024-05-12 14:02:56 [INFO] Backlog draining successfully. Scaling complete.

Rollback Path: Automatic on validation failure or nodetool error. Manual rollback: nodetool setcompactionthroughput <original_value> and nodetool setconcurrentcompactors <original_value>.

Phase 3: Anti-Entropy Repair Coordination & I/O Arbitration

Repair operations generate massive SSTable merges that directly compete with compaction. In Cassandra 4.x/5.x, streaming limits and repair scope must be explicitly bounded to prevent backlog regeneration.

# 1. Throttle repair streaming bandwidth (default is often unlimited)
nodetool setstreamthroughput 100

Safety Check: Verify current streaming throughput via nodetool getstreamthroughput. Do not set below 50 MB/s if cluster is under heavy write load. Expected Output: Stream throughput set to 100 MB/s Rollback Path: nodetool setstreamthroughput <original_value>

# 2. Run scoped repair with compaction-friendly flags (-seq == --sequential)
nodetool repair --full --in-local-dc -seq --trace --ignore-unreplicated-keyspaces

Safety Check: Ensure no other nodetool repair is running (nodetool compactionstats should show 0 Repair tasks). Run during off-peak windows if possible. Expected Output: Repair progress logs streaming to stdout/JMX. Completion returns 0 exit code. Rollback Path: Pressing Ctrl+C only detaches the nodetool client — it does not cleanly stop the server-side repair, and a full repair does not resume from a checkpoint. To actually stop in-flight validation/streaming, use nodetool stop VALIDATION (and nodetool stop COMPACTION if merges are saturating I/O) on the affected node, then revert the streaming throttle if needed. Plan to re-run the full repair from the start.

For advanced compaction strategy tuning, operators should align compaction_throughput_mb_per_sec with disk IOPS capacity and validate token distribution before scaling. Refer to Advanced Compaction Strategy Tuning & Monitoring for strategy-specific parameter matrices.

Phase 4: Post-Resolution Validation & Automated Rollback Triggers

Once backlog drains, validate system stability and revert temporary scaling to baseline to prevent resource exhaustion during normal operations.

# 1. Verify backlog normalization (pending count + remaining bytes from the table rows)
nodetool compactionstats --human-readable

Safety Check: Confirm pending tasks is low and the remaining bytes (sum of total - completed across rows) are < 10% of total data size per node. Expected Output:

pending tasks: 1
id                                   compaction type  keyspace  table   completed  total    unit  progress
e3f4a5b6-2a3b-11ef-8c4d-1a2b3c4d5e6f  Compaction       app       events  8 GiB      12.4 GiB bytes  64.52%
Active compaction remaining time :   0h01m03s

Rollback Path: If the remaining bytes rebound > 200 GiB, re-run Phase 2 script with conservative scaling (new_tp = current_tp * 1.5, new_cc = current_cc + 1).

# 2. Restore baseline compaction parameters
nodetool setcompactionthroughput 16
nodetool setconcurrentcompactors 2

Safety Check: Monitor tpstats for 5 minutes post-revert. Ensure CompactionExecutor Pending does not spike > 50. Expected Output: Compaction throughput set to 16 MB/s / Concurrent compactors set to 2 Rollback Path: If latency spikes or pending queue saturates, immediately re-apply scaled values: nodetool setcompactionthroughput 32 and nodetool setconcurrentcompactors 4.

# 3. Final latency & thread pool validation
nodetool tpstats | grep -E "Pool Name|MutationStage|ReadStage|CompactionExecutor"

Safety Check: Cross-reference with application p99 latency dashboards. tpstats has no Max column — gate on the Pending and Blocked/All time blocked columns staying near zero for all pools. Expected Output:

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
MutationStage                     0         0        1048576         0                 0
ReadStage                         0         0         524288         0                 0
CompactionExecutor                2         0          18460         0                 0

Rollback Path: N/A. If thread starvation persists, investigate write amplification via nodetool tablestats and consider strategy migration (e.g., STCS → LCS or TWCS).

Operational Best Practices

  • Never scale compaction throughput beyond 50% of sustained disk write bandwidth. Exceeding this threshold causes I/O queue saturation, triggering DiskFailurePolicy events.
  • Use nodetool compactionstats over nodetool tablestats for backlog tracking. The former aggregates across all strategies and provides real-time byte-level visibility.
  • Automate validation gates. Integrate the Python resolver into CI/CD or cron workflows with explicit alerting on PendingCompactions > 150.
  • Document baseline parameters. Store original compaction_throughput and concurrent_compactors values in infrastructure-as-code repositories for rapid recovery.

The end-to-end remediation phases are summarized below.

flowchart TD D["Diagnose backlog and validate metrics"] --> T["Throttle writes and raise compaction throughput"] T --> C["Raise concurrent compactors"] C --> M["Monitor pending-bytes reduction"] M --> R["Run scoped repair"] R --> B["Restore baseline settings"]
No-downtime backlog remediation phases

By enforcing deterministic scaling, isolating repair contention, and embedding explicit rollback triggers, operators can resolve compaction backlog without service degradation or manual intervention.