Compaction Error Categorization & Logging in Cassandra: Operational Guide

Compaction failures in Apache Cassandra rarely announce themselves with explicit stack traces. Instead, they surface as latent I/O stalls, tombstone accumulation, or silent SSTable corruption that propagates into repair backlogs and degraded read paths. For distributed systems engineers and DBAs managing v4.x/v5.x clusters, establishing a deterministic error categorization framework is non-negotiable. Structured logging, paired with automated remediation workflows, forms the foundation of safe node lifecycle management. This operational guide maps compaction failure signatures to actionable routing logic, integrating seamlessly with broader Advanced Compaction Strategy Tuning & Monitoring practices.

Operational Error Taxonomy & Log Signatures

Cassandra’s compaction subsystem generates telemetry across system.log and debug.log. In v4.x and v5.x, the UnifiedCompactionStrategy introduces more granular state transitions and dynamic I/O budgeting, but error patterns remain consistent across four primary categories:

  • I/O & Resource Exhaustion: java.io.IOException: No space left on device, DiskFullException, or thread pool saturation in CompactionExecutor. These indicate storage capacity breaches, filesystem quota limits, or I/O scheduler misalignment.
  • SSTable Integrity Failures: CorruptSSTableException, ChecksumMismatchException, or IndexOutOfBoundsException during read-ahead. Often stem from interrupted compaction cycles, underlying NVMe/SSD degradation, or unclean node shutdowns.
  • Tombstone & GC Pressure: Tombstone threshold exceeded warnings, OutOfMemoryError: Java heap space, or prolonged GC inspection pauses. Directly correlate with read amplification and can trigger speculative retry storms if left unaddressed.
  • Strategy & Configuration Drift: InvalidRequestException raised at DDL time when an ALTER TABLE ... WITH compaction = {...} statement carries invalid options (e.g. an out-of-range min_threshold/max_threshold, or a min_threshold greater than max_threshold). This is distinct from runtime throughput governance: compaction_throughput is a cassandra.yaml/nodetool setcompactionthroughput setting and is not validated by table DDL. These drifts typically follow manual ALTER TABLE operations, rolling restarts without proper validation, or mismatched cassandra.yaml deployments.

The following tree classifies a compaction exception and routes it to the right remediation:

flowchart TD EX["Compaction exception"] --> CLS{"Classify failure"} CLS -->|"transient (DiskFullException)"| TR["Retry with backoff"] CLS -->|"structural (CorruptSSTableException)"| ST["Run nodetool scrub or verify"] ST --> REPL["Replace SSTable if unrecoverable"] CLS -->|"GC or tombstone pressure"| GC["Tune thresholds and monitor"]
Compaction error categorization and remediation

Automated Parsing & Severity Routing

Manual log inspection does not scale across multi-node deployments. Production clusters require deterministic parsing pipelines that extract compaction-specific events, classify them by severity, and trigger automated responses. A robust approach combines journalctl streaming or tail -F with Python-based regex extraction and JMX polling.

import re
from datetime import datetime, timezone

SEVERITY_RULES = {
    r"(No space left on device|DiskFullException)": "CRITICAL",
    r"(CorruptSSTableException|RejectedExecutionException)": "CRITICAL",
    r"(Scanned over \d+ tombstones|OutOfMemoryError: Java heap space)": "HIGH",
    r"(GC for (ConcurrentMarkSweep|G1).* \d+ms)": "MEDIUM",
    r"GC for .* \d+ms": "LOW"
}

def classify_compaction_error(log_line: str) -> str:
    for pattern, severity in SEVERITY_RULES.items():
        if re.search(pattern, log_line, re.IGNORECASE):
            return severity
    return "INFO"

This classification logic feeds directly into Python Monitoring for Cassandra Compaction workflows, where parsed events are enriched with JMX metrics and pushed to centralized observability stacks. The severity matrix dictates automated routing:

  • CRITICAL: Isolate the node via nodetool disablebinary, halt compaction with nodetool disableautocompaction, and initiate targeted nodetool verify or nodetool scrub before re-enabling.
  • HIGH: Throttle compaction I/O, schedule incremental repair, and validate thresholds against Compaction Backlog Analysis & Alerting baselines.
  • MEDIUM: Adjust runtime parameters via nodetool setcompactionthroughput or JMX, monitor SSTable generation rates, and defer intervention unless backlog exceeds SLA.
  • LOW: Log for capacity trending and GC optimization cycles.

Repair Scheduling & Consistency Validation

Compaction errors frequently expose underlying consistency gaps. In Cassandra v4.x/v5.x, repair scheduling must align with incremental repair paradigms to avoid anti-patterns like overlapping full repairs or excessive streaming. Automated repair workflows should leverage nodetool repair -pr (incremental is the default; pass -full only when a full repair is required) with explicit parallelism controls (-seq for sequential, or the default parallel mode based on cluster topology and network bandwidth). Python automation can orchestrate rolling repair windows using the Apache Cassandra nodetool repair documentation or direct cqlsh health checks.

Consistency validation requires a strict operational sequence:

  1. Halt compaction on the affected node to prevent SSTable overlap during streaming.
  2. Run nodetool verify -e (extended verification) to read and validate all data, not just checksums, and isolate corrupted segments. Plain nodetool verify checks only checksums.
  3. Execute nodetool repair -pr (incremental by default) while monitoring streaming throughput via nodetool netstats and nodetool compactionstats.
  4. Re-enable compaction only after pending tasks drop to zero and backlog metrics stabilize.

This sequence prevents cascading read failures and ensures that Fallback Routing & Read Path Optimization mechanisms remain effective during recovery windows.

Throughput & Resource Governance

Misconfigured compaction throughput limits are a primary cause of silent degradation. Operators often set compaction_throughput_mb_per_sec too low, causing backlog accumulation, or too high, starving read/write thread pools. In v4.x/v5.x, the parameter interacts directly with the unified compaction scheduler’s I/O budgeting and dynamic scaling algorithms. Safe adjustment requires correlating disk latency (iostat -x 1), JVM heap utilization, and pending compaction tasks before applying runtime changes. For step-by-step validation and safe tuning procedures, consult How to Tune compaction_throughput_mb_per_sec Safely.

Cross-System Integration & Telemetry

Compaction error categorization does not operate in isolation. It must integrate with broader cluster telemetry pipelines. Async Compaction Tracking & Metrics provides the foundation for non-blocking metric collection, exposing JMX MBeans under the object name org.apache.cassandra.metrics:type=Compaction (for example name=PendingTasks, name=CompletedTasks, name=BytesCompacted) via the Prometheus JMX Exporter. Note that the exporter flattens these to dotted names like org.apache.cassandra.metrics.Compaction.PendingTasks, which are Graphite/exporter conventions rather than the JMX object-name form. When tombstone pressure triggers, operators must evaluate Speculative Retry & Read Repair Tuning to prevent retry amplification during degraded compaction windows.

For time-series workloads, strategy drift often manifests as compaction backlog spikes. Validating Strategy Selection for Time-Series Workloads ensures that UnifiedCompactionStrategy parameters (target_sstable_size — a size string such as 1GiB — along with scaling_parameters, sstable_growth, and max_sstable_size) align with ingestion rates and TTL expiration curves. Finally, capacity planning must incorporate Performance Benchmarking & Capacity Planning baselines to distinguish transient I/O stalls from systemic storage exhaustion.

Operational Readiness

Effective compaction error categorization transforms reactive firefighting into deterministic automation. By mapping log signatures to severity tiers, enforcing v4.x/v5.x repair consistency checks, and integrating telemetry pipelines, SREs can maintain cluster stability under heavy compaction loads. Structured logging, paired with automated routing and safe throughput governance, ensures that compaction remains a background optimization rather than a production bottleneck.

Related guides