Advanced Compaction Strategy Tuning & Monitoring

Apache Cassandra’s storage engine is anchored in Log-Structured Merge (LSM) trees, where incoming mutations are appended to the commit log, buffered in memtables, and periodically flushed to immutable Sorted String Tables (SSTables). Compaction is the deterministic background process that merges these SSTables, reclaims tombstone overhead, enforces TTL expiration, and preserves read-path predictability. This guide is written for Cassandra DBAs, distributed systems engineers, and Python automation builders operating v4.x and v5.x clusters, where compaction tuning is not a static configuration exercise but a continuous operational discipline that must balance disk I/O saturation, space reclamation velocity, and query latency SLAs. When compaction intersects with anti-entropy repair, node scaling, or schema evolution, misalignment rapidly manifests as read amplification, coordinator timeouts, or cascading node evictions. If you have not yet internalised the write path and strategy fundamentals, start with Cassandra Architecture & Compaction Fundamentals; this guide assumes that grounding and focuses on the tuning, observability, and automation layers built on top of it.

Architectural Overview: Where Compaction Sits in the Stack

Compaction is not an isolated maintenance job — it is a shared consumer of the same disk I/O, page cache, and CPU budget that serves live reads and repair streaming. Every mutation traverses the commit log and memtable before flushing to an on-disk SSTable, and from that moment the SSTable becomes a candidate for merging. The LSM tree mechanics in Cassandra govern how tiers progress, how read amplification accumulates, and how disk scheduling is bounded. A read at consistency level LOCAL_QUORUM may fan across multiple SSTables per replica; the more un-merged SSTables a partition spans, the more bloom-filter checks, index seeks, and row-merge passes the coordinator pays for. Compaction exists to keep that fan-out low without stalling ingestion.

Because compaction competes directly with the read path and with anti-entropy repair streaming for the same spindles, tuning is fundamentally a scheduling and back-pressure problem rather than a per-table setting. A compaction that runs too aggressively starves foreground queries; one that runs too conservatively lets SSTable counts balloon until reads time out. The overview below traces how a mutation becomes an SSTable and how the compaction subsystem, throughput governor, and observability plane wrap that flow.

How a mutation becomes an SSTable, and how compaction, reads, and repair share one node's I/O budget.

Core Mechanics: Strategies, Executors, and Virtual Tables

Cassandra’s compaction strategies dictate how SSTables are selected, merged, and promoted across storage tiers. SizeTieredCompactionStrategy (STCS) groups similarly sized SSTables, offering minimal write amplification but suffering severe read amplification as SSTable counts grow. LeveledCompactionStrategy (LCS) enforces strict size boundaries across levels, trading higher write amplification for predictable, low-latency reads. TimeWindowCompactionStrategy (TWCS) partitions data into fixed temporal windows, drastically reducing compaction overhead for append-only, time-ordered datasets. The trade-offs between STCS, LCS, and TWCS determine disk provisioning, repair windows, and I/O contention long before any tuning parameter is touched. Starting with v5.0, UnifiedCompactionStrategy (UCS) is the new default: it consolidates the compaction thread pool, adapts fan-out via the scaling_parameters option, and reduces configuration drift across heterogeneous workloads.

Every compaction task runs inside a bounded execution pool, the CompactionExecutor, sized by concurrent_compactors. Inspecting live task state is the first operational reflex. On both 4.x and 5.x, nodetool compactionstats -H reports active tasks, bytes processed, and total bytes with human-readable units:

nodetool compactionstats -H

In v5.x you can query the same state without JMX through the system_views virtual tables, which is essential for automation running inside restricted networks:

SELECT keyspace_name, table_name, task_id, completion_ratio, unit
FROM system_views.sstable_tasks;

Historical outcomes live in system.compaction_history, which lets you correlate a latency regression with the exact tables that compacted during the window:

SELECT keyspace_name, columnfamily_name, compacted_at, bytes_in, bytes_out
FROM system.compaction_history
WHERE keyspace_name = 'telemetry' ALLOW FILTERING;

Note the nodetool surface drift between versions: nodetool tablestats (and tablehistograms) is the current form, while the older cfstats/cfhistograms aliases are deprecated and should not be baked into new automation. The SSTablesPerReadHistogram from tablehistograms is the single most diagnostic read-amplification signal, telling you how many SSTables a typical read actually touched.

Strategy & Configuration Surface

Selecting and tuning a strategy requires rigorous profiling of write velocity, tombstone density, and read-to-write ratios. Temporal workloads in particular demand precise window alignment to prevent cross-window compaction storms; the strategy selection guidance for time-series workloads walks through aligning compaction windows with retention policies and query patterns. Misaligned window boundaries or an over-aggressive tombstone_compaction_interval can trigger unnecessary I/O spikes, especially when combined with high-cardinality partition keys or skewed data distribution.

The node-wide levers live in cassandra.yaml and are shared by every table on the node. Treat these as the throughput and concurrency budget for the whole CompactionExecutor:

Option	Type	Default	Recommended range
`concurrent_compactors`	int	min(disks, cores), capped at 8	2–8; keep ≤ number of physical data disks
`compaction_throughput`	rate string	`64MiB/s`	64–256 MiB/s on NVMe; 16–48 on spinning disk
`sstable_preemptive_open_interval`	size string	`50MiB`	50 MiB default; lower to smooth read latency, raise to cut open churn
`concurrent_materialized_view_builders`	int	1	1–4 only if MVs are in use

Note the 4.x→5.x naming: the legacy compaction_throughput_mb_per_sec (a bare integer) was superseded by the unit-suffixed compaction_throughput in 4.1+. Automation that writes cassandra.yaml must branch on version to avoid emitting an unparseable key.

Per-table behaviour is set through the compaction map on the table schema. For UCS deployments, the following subproperties govern how aggressively the engine merges tiers:

Option (UCS)	Type	Typical default	Recommended range
`scaling_parameters`	string	`T4`	`T2`–`T8` (tiered) or `L10`+ (leveled-like) per read/write mix
`target_sstable_size`	size string	`1GiB`	256 MiB–2 GiB; smaller for point-lookup latency
`base_shard_count`	int	4	1–16, scaled to node count and disk parallelism
`sstable_growth`	float (0–1)	0.333	0.2–0.5; higher favours fewer, larger SSTables

UCS still respects concurrent_compactors and compaction_throughput while dynamically adjusting work based on disk I/O wait times, which reduces thread contention but requires careful calibration so compaction does not starve repair streaming or hint delivery. A concrete ALTER TABLE for a mixed workload moving off STCS looks like:

ALTER TABLE telemetry.readings
WITH compaction = {
  'class': 'UnifiedCompactionStrategy',
  'scaling_parameters': 'T4',
  'target_sstable_size': '512MiB'
};

The strategy trade-off matrix below is the decision surface most operators return to. Measure the actual amplification for your workload rather than trusting fixed multipliers — LCS write amplification in particular is workload-sensitive.

Strategy	Write amp	Read amp	Space overhead	Best-fit workload
STCS	Low	High	High (up to ~50% during merges)	Append-heavy, low-read, tightly governed tombstones
LCS	High	Low	Low (~10%)	Point-lookup OLTP, update-in-place, read-latency SLAs
TWCS	Very low	Low within window	Low with aligned TTL	Time-series, telemetry, IoT, log data with TTL
UCS	Adaptive	Adaptive	Adaptive	Mixed/heterogeneous workloads; v5.x default

Automation Patterns: Repair-Safe Compaction Orchestration

Manual compaction tuning does not scale across multi-datacenter, multi-tenant clusters. Production environments require idempotent automation that synchronises strategy adjustments, anti-entropy repair scheduling, and node lifecycle operations. A safe control plane wraps nodetool and the system tables behind strict validation: it verifies cluster state before acting, refuses to overlap destructive operations, and backs off exponentially on transient failure. The scaffold below captures the guardrails every compaction automation should enforce — pre-flight checks, idempotency, bounded retries, and explicit safety gates.

#!/usr/bin/env python3
# requirements: cassandra-driver>=3.29, tenacity>=8.2  (Python 3.10+)
"""Repair-safe compaction throughput governor with pre-flight guardrails."""
from __future__ import annotations

import logging
import subprocess
from dataclasses import dataclass

from tenacity import retry, stop_after_attempt, wait_exponential

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
log = logging.getLogger("compaction-governor")

DISK_HEADROOM_MIN = 0.25        # refuse to act below 25% free
PENDING_TASK_CEILING = 100      # abort if backlog already pathological


@dataclass(frozen=True)
class NodeState:
    pending_compactions: int
    disk_free_ratio: float
    active_repair_sessions: int


@retry(stop=stop_after_attempt(4), wait=wait_exponential(multiplier=2, min=2, max=60))
def run_nodetool(args: list[str], timeout: int = 900) -> str:
    """Execute a nodetool command with a hard timeout and exponential backoff."""
    result = subprocess.run(
        ["nodetool", *args],
        capture_output=True, text=True, timeout=timeout, check=True,
    )
    return result.stdout.strip()


def preflight(state: NodeState) -> None:
    """Fail closed: raise before any mutating action if the node is unsafe."""
    if state.disk_free_ratio < DISK_HEADROOM_MIN:
        raise RuntimeError(f"disk headroom {state.disk_free_ratio:.0%} below floor")
    if state.pending_compactions > PENDING_TASK_CEILING:
        raise RuntimeError(f"pending compactions {state.pending_compactions} above ceiling")
    if state.active_repair_sessions > 0:
        raise RuntimeError("active repair session in progress; refusing to retune throughput")


def set_throughput(mib_per_sec: int, state: NodeState) -> None:
    """Idempotent: no-op if the target is already applied."""
    preflight(state)
    current = run_nodetool(["getcompactionthroughput"])
    if str(mib_per_sec) in current:
        log.info("throughput already at %s MiB/s; skipping", mib_per_sec)
        return
    run_nodetool(["setcompactionthroughput", str(mib_per_sec)])
    log.info("compaction throughput set to %s MiB/s", mib_per_sec)

A production workflow layers safe sequencing on top of that primitive: query system.compaction_history to find tables with elevated tombstone density and SSTable counts, adjust throughput during off-peak windows, schedule incremental repair (nodetool repair -pr; incremental is the default from 4.0) so streaming never overlaps a compaction storm, and only trigger nodetool cleanup after compaction has stabilised. Full implementation patterns — including driver-based metric collection and scheduling — are documented in Python monitoring for Cassandra compaction. Two rules are non-negotiable: never run nodetool scrub or nodetool upgradesstables concurrently with active compaction, and always verify disk headroom before triggering a full repair.

Operational Discipline & Observability

Compaction executes asynchronously within bounded execution pools, making synchronous polling inadequate. Health tracking requires observing pending-task counts, throughput rates, disk-utilisation trends, and historical completion patterns. Prioritise PendingTasks and CompletedTasks from the org.apache.cassandra.metrics:type=Compaction MBeans, alongside per-table TombstoneScannedHistogram. High-resolution scraping (15–30s intervals) through the Prometheus JMX exporter lets you compute compaction velocity and predict backlog accumulation before it reaches the read path. When correlated with node-level iowait from node_exporter, these signals distinguish healthy background merging from a pathological compaction storm. The mechanics of capturing and interpreting these asynchronous signals are covered in depth in async compaction tracking & metrics.

The pipeline below shows how compaction signals flow from collection to automated action:

Compaction observability-to-action pipeline.

Alerting should be tiered rather than a single static limit, correlating compaction velocity with write throughput and repair load:

Warning: PendingTasks > 2 × concurrent_compactors sustained for more than 10 minutes.
Critical: disk usage above 80% with elevated tombstone scan ratios, or PendingTasks growing linearly.
Emergency: disk usage above 90%, or CompactionExecutor thread starvation detected.

The methodology for computing backlog velocity and routing these tiers to PagerDuty or OpsGenie is detailed in compaction backlog analysis & alerting. A minimal Prometheus JMX exporter rule set should scrape the Compaction MBeans plus Storage:type=Load, then expose cassandra_compaction_pendingtasks and a recording rule for its 10-minute rate so alert expressions read cleanly.

Failure Modes & Anti-Patterns

Compaction backlog / storm. When SSTable creation outpaces merge velocity, pending tasks climb, disk fills, and tombstones accumulate until writes are rejected. Detect it with nodetool compactionstats (rising pending count) and tablehistograms (climbing SSTablesPerReadHistogram). Mitigate by raising compaction_throughput with nodetool setcompactionthroughput, temporarily throttling ingestion, and — only after confirming disk headroom — letting the executor drain. Never respond by disabling autocompaction and forgetting it.

Tombstone storm. Aggressive deletes or short TTLs paired with a strategy that cannot purge markers in time trigger TombstoneOverwhelmingException and read timeouts. This is a lifecycle problem rooted in tombstone management & garbage collection: audit gc_grace_seconds against repair cadence, watch TombstoneScannedHistogram, and use nodetool garbagecollect surgically rather than dropping gc_grace_seconds to zero, which risks resurrecting deleted data.

Streaming vs compaction contention. Anti-entropy repair streaming competes for the same disk I/O pool as compaction. Running a full repair during a compaction storm doubles I/O pressure and can evict nodes. Stagger repair across racks, and gate repair on a pending-compaction ceiling so the two never peak together.

Corrupt SSTables and disk exhaustion. Structural failures surface as CorruptSSTableException, transient ones as DiskFullException, logged under org.apache.cassandra.db.compaction.CompactionTask and org.apache.cassandra.io.sstable.format. Categorising these signatures and routing them to the right runbook — nodetool verify, nodetool scrub, or targeted decommission — is the subject of compaction error categorization & logging. During heavy compaction windows, reads that traverse many SSTable levels amplify latency; fallback routing & read path optimization keeps coordinators from routing to nodes under I/O saturation by tuning speculative_retry and the read_repair table option. Note that read_repair_chance and dclocal_read_repair_chance were both removed in 4.0; read_repair defaults to 'BLOCKING' and may be set to 'NONE'.

Anti-pattern: fixed multipliers and static thresholds. Assuming a constant write-amplification factor per strategy, or hard-coding a single pending-task alert level, guarantees false pages on some tables and silent backlog on others. Compaction requires 20–30% free disk to merge safely; capacity planning must measure the real amplification factor per workload with cassandra-stress and iostat before production rollout, and alert thresholds must scale with concurrent_compactors.

Cassandra Architecture & Compaction Fundamentals — the storage-engine grounding this tuning guide builds on.
Strategy Selection for Time-Series Workloads — aligning TWCS/UCS windows with retention and query patterns.
Async Compaction Tracking & Metrics — capturing and interpreting asynchronous compaction signals.
Compaction Backlog Analysis & Alerting — quantifying backlog velocity and tiered alert routing.
Fallback Routing & Read Path Optimization — keeping reads off I/O-saturated replicas during heavy compaction.
Python Monitoring for Cassandra Compaction — building the driver-based automation control plane.
Compaction Error Categorization & Logging — mapping failure signatures to remediation runbooks.

Advanced Compaction Strategy Tuning & Monitoring

Explore this section