Step-by-Step Guide to Switching from STCS to LCS in Cassandra 4.x and 5.x

This is the runbook for one specific, high-stakes task: migrating a single live production table from SizeTieredCompactionStrategy (STCS) to LeveledCompactionStrategy (LCS) on a Cassandra 4.x or 5.x cluster, without saturating disk I/O or losing a rollback path. The switch is not a metadata toggle — the instant the ALTER TABLE commits, Cassandra begins recompacting every existing SSTable into a strict, non-overlapping level hierarchy, a full-table I/O event that can run for hours. This page assumes you have already decided the switch is warranted using the trade-off analysis in understanding STCS vs LCS vs TWCS, its parent guide; if you are still choosing a strategy, start there. Prerequisites for this procedure: Cassandra 4.0+ or 5.0, cassandra-driver 3.28+ for the Python automation (Python 3.10+), superuser CQL access, a local nodetool on PATH, and a deployment where no major compaction or repair is currently running. The workflow below folds explicit safety gates, throttled execution, real-time monitoring, and a deterministic rollback into a single automation-ready sequence.

The flow below summarizes the migration stages and the rollback branch that returns the table to STCS if guardrails breach.

Pre-Conditions & Safety Gates

LCS compaction requires substantial temporary disk space and sustained I/O to stage overlapping SSTables during the rewrite. Every gate below must pass on the target node before you touch the schema; each one has an explicit abort condition and remediation path.

Disk space & I/O headroom

LCS compaction typically consumes 1.5–2.5× the current data-directory size in transient staging while it re-levels overlapping ranges. Validate available space with a deterministic check before anything else.

Safety check: Abort if free space falls below current_data_size × 1.5. Expected output: True with logged headroom metrics, or a RuntimeError naming the exact deficit. Remediation: Expand storage volumes or archive cold partitions before proceeding.

#!/usr/bin/env python3
# Requirements: Python 3.10+ (standard library only).
"""Assert the data directory has enough transient free space for an LCS re-level."""

import logging
import shutil


def validate_disk_headroom(data_dir: str, multiplier: float = 2.2) -> bool:
    usage = shutil.disk_usage(data_dir)
    current_size = usage.total - usage.free
    required_free = int(current_size * (multiplier - 1.0))
    if usage.free < required_free:
        raise RuntimeError(
            f"Insufficient free space: {usage.free / 1e9:.1f}GB available, "
            f"{required_free / 1e9:.1f}GB required for LCS transition."
        )
    logging.info(
        "Disk headroom validated: %.1fGB free, %.1fGB required.",
        usage.free / 1e9, required_free / 1e9,
    )
    return True

Compaction backlog & pending repairs

Never start an LCS migration while a major compaction or full repair is active — concurrent compaction threads saturate disk I/O and trigger read timeouts. A migration that competes with in-flight anti-entropy repair streaming is the most common way this task wedges a deployment.

Safety check: Assert zero pending compaction tasks and no active repair sessions. Expected output: pending tasks: 0 (or a small single digit); empty repair output from netstats. Remediation: Defer the migration until nodetool compactionstats and nodetool netstats both report idle.

# Verify the compaction queue is drained (Cassandra 4.x/5.x — text output, no JSON).
nodetool compactionstats | grep -i "pending tasks" | awk '{print $NF}'
# Expected: 0 (or < 5)

# Verify no repair session is streaming.
nodetool netstats | grep -i "repair"
# Expected: empty output

Schema agreement & snapshot

Confirm every node agrees on the current schema, then capture a point-in-time snapshot. LCS migration cannot be paused; the snapshot is the only reliable rollback for a corrupted or lost SSTable set.

Safety check: A single schema UUID across all nodes in describecluster; the snapshot completes with zero errors. Expected output: Schema versions: [UUID]: [node_count], and a snapshot directory under data/<keyspace>/<table>/snapshots/. Remediation: If schemas disagree, resolve the split first (nodetool describecluster should converge to one UUID) before proceeding.

# Schema agreement across the cluster.
nodetool describecluster | grep "Schema versions"
# Expected: a single UUID mapping to every node

# Snapshot the target table (replace KEYSPACE and TABLE with real names).
nodetool snapshot -t pre_lcs_migration KEYSPACE TABLE
# Expected: "Snapshot directory: pre_lcs_migration" created successfully

Implementation

With every gate green, run the migration itself: an idempotent, throttled schema switch followed by continuous progress monitoring. The routine below verifies the current strategy is STCS before altering, caps compaction throughput so the re-level leaves headroom for client I/O, and is safe to re-run — a second invocation on an already-leveled table is a harmless no-op.

Safety check: Verify the current compaction class is SizeTieredCompactionStrategy before altering; cap throughput to 50 MB/s to prevent I/O starvation. Expected output: the schema change commits and background compaction begins immediately; a re-run prints that the table is already on LCS and exits.

#!/usr/bin/env python3
# Requirements: Python 3.10+, a local `nodetool` and `cqlsh` on PATH.
"""Idempotently switch one table from STCS to LCS with compaction throttled first."""

import subprocess


def apply_lcs_migration(
    keyspace: str,
    table: str,
    host: str = "127.0.0.1",
    port: int = 9042,
    throttle_mbps: int = 50,
) -> None:
    # Idempotency guard: read the live strategy and bail if it is already LCS,
    # so re-running inside an automation pipeline never re-triggers a rewrite.
    check = subprocess.run(
        ["cqlsh", host, str(port), "-e",
         f"SELECT compaction FROM system_schema.tables "
         f"WHERE keyspace_name='{keyspace}' AND table_name='{table}';"],
        capture_output=True, text=True, check=True,
    )
    if "LeveledCompactionStrategy" in check.stdout:
        print("Table already using LCS. Skipping migration.")
        return

    # Throttle BEFORE the ALTER so the re-level cannot starve client reads.
    subprocess.run(["nodetool", "setcompactionthroughput", str(throttle_mbps)], check=True)

    # Online schema change: the table stays readable/writable throughout.
    subprocess.run(
        ["cqlsh", host, str(port), "-e",
         f"ALTER TABLE {keyspace}.{table} "
         "WITH compaction = {'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 160};"],
        check=True,
    )
    print("LCS schema transition initiated. Monitor compaction progress.")

Background compaction must be polled continuously to catch stalls, a runaway backlog, or disk pressure. The monitor below parses the plain-text nodetool compactionstats output (neither compactionstats nor tablestats emits JSON on 4.x/5.x), reports byte-level progress, and raises the moment the pending queue crosses a critical threshold so the caller can decide whether to roll back.

Safety check: Poll every 30 s; abort if pending tasks exceed 500 or the data volume crosses 85% used. Expected output: steady progress lines, then a completion message once the queue drains to zero.

#!/usr/bin/env python3
# Requirements: Python 3.10+, a local `nodetool` on PATH.
"""Poll nodetool compactionstats until the LCS re-level drains, aborting on backlog."""

import re
import subprocess
import time


def monitor_lcs_compaction(poll_interval: int = 30, max_pending: int = 500) -> None:
    # First line reads "pending tasks: N"; each active-compaction row carries
    # per-task "completed" and "total" byte columns we sum for progress.
    pending_re = re.compile(r"pending tasks:\s*(\d+)", re.IGNORECASE)

    while True:
        stats = subprocess.run(
            ["nodetool", "compactionstats"],
            capture_output=True, text=True, check=True,
        )
        lines = stats.stdout.splitlines()

        pending = 0
        for line in lines:
            if (m := pending_re.search(line)):
                pending = int(m.group(1))
                break

        # Row columns: id  type  keyspace  table  completed  total  unit  progress
        completed = total = 0
        for line in lines:
            cols = line.split()
            if len(cols) >= 8 and cols[-4].isdigit() and cols[-3].isdigit():
                completed += int(cols[-4])
                total += int(cols[-3])

        if pending == 0 and total == 0:
            print("LCS compaction complete. Proceed to repair synchronization.")
            break

        if pending > max_pending:
            raise RuntimeError(
                f"Compaction backlog critical: {pending} pending tasks. "
                "Abort and evaluate rollback."
            )

        print(f"Progress: {completed}/{total} bytes merged. Pending tasks: {pending}")
        time.sleep(poll_interval)

Once the queue drains, reconcile replicas. The re-level rewrites SSTables locally but does not repair partition-level inconsistencies, and a burst of streamed SSTables during the transition can leave replicas divergent. Run an incremental repair scoped to the primary range so you do not trigger a repair storm across the ring.

Safety check: confirm nodetool netstats shows no active streams before starting; use -pr to bound scope. Expected output: Repair session ... completed successfully, and nodetool tablestats reflecting the new level distribution.

#!/usr/bin/env python3
# Requirements: Python 3.10+, a local `nodetool` on PATH.
"""Run a primary-range repair after the LCS switch, deferring if a repair is active."""

import logging
import subprocess


def execute_post_migration_repair(keyspace: str, table: str) -> None:
    netstats = subprocess.run(
        ["nodetool", "netstats"], capture_output=True, text=True, check=True,
    )
    if "Repair session" in netstats.stdout:
        logging.warning("Active repair detected. Deferring until idle.")
        return

    try:
        subprocess.run(["nodetool", "repair", "-pr", keyspace, table], check=True)
        logging.info("Primary-range repair completed successfully.")
    except subprocess.CalledProcessError as exc:
        logging.error("Repair failed: %s", exc.stderr)
        raise

Verification Steps

Confirm the switch actually took hold and the re-level settled — the ALTER returns instantly, but the real work runs for minutes to hours.

Confirm the schema change. Query the strategy directly rather than trusting the ALTER return:

SELECT compaction FROM system_schema.tables
WHERE keyspace_name = 'ks' AND table_name = 'accounts';
-- Expect: {'class': '...LeveledCompactionStrategy', 'sstable_size_in_mb': '160', ...}

Confirm the table actually leveled. On Cassandra 4.x/5.x use nodetool tablestats (the older cfstats alias is deprecated); a healthy LCS table keeps almost everything above L0 with only a few L0 SSTables:
```
nodetool tablestats ks.accounts | grep -E "SSTables in each level"
# Expected e.g.: SSTables in each level: [1, 10, 42, 0, 0, 0, 0, 0, 0]
```
Confirm read amplification improved. Compare SSTables-per-read against the pre-change baseline; LCS should drop the 95th percentile toward 1–2:
```
nodetool tablehistograms ks accounts
```
Audit the merge history. The system.compaction_history table records each merge with input/output byte sizes, confirming the re-level ran and how much space it reclaimed:
```
SELECT keyspace_name, columnfamily_name, compacted_at, bytes_in, bytes_out
FROM system.compaction_history LIMIT 20;
```

Troubleshooting

OutOfSpaceException (or a stalled large compaction) mid-migration. The re-level ran the volume out of transient staging space. Detect it with nodetool compactionstats showing a wedged compaction and df -h /var/lib/cassandra/data near capacity. Root cause: the pre-flight headroom gate was skipped or the multiplier was too low for this table’s overlap. Fix: raise throughput to 0 (unlimited) only if disk permits so the current compaction finishes and frees space, otherwise add disk or archive cold partitions; if the node is already critical, roll back to STCS (below) and retry after expanding storage.

WriteTimeoutException / client read timeouts during the re-level. Compaction is out-competing client I/O on the same disks. Detect it via rising ReadLatency/WriteLatency and a growing pending tasks count. Root cause: throughput is uncapped or set too high for the underlying storage. Fix: lower the cap with nodetool setcompactionthroughput 16, and schedule the remainder of the migration inside a low-traffic window; apply the switch one table at a time so two full-table re-levels never compete for the same disks.

TombstoneOverwhelmingException on the post-migration repair. A partition carries more droppable tombstones than the query threshold allows, so repair (or a read) aborts. This ties back to tombstone management and garbage collection: the switch itself does not purge tombstones, it only re-levels them. Fix: let a tombstone-only compaction run (LCS honours tombstone_threshold), confirm your repair cadence is shorter than gc_grace_seconds so deletes do not resurrect, then re-run the -pr repair on the affected range.

Rolling back to STCS. If read latency, disk pressure, or a compaction stall makes the migration unacceptable, revert deterministically. Reverting is online but triggers a second full compaction that merges the leveled SSTables back into size tiers, so treat it as a real I/O event, not a free undo. Use the pre-migration snapshot only if SSTables are actually lost or corrupted.

# Schema rollback (idempotent) — merges L0-Ln SSTables back into size tiers.
cqlsh -e "ALTER TABLE keyspace.table WITH compaction = {'class': 'SizeTieredCompactionStrategy'};"
nodetool compactionstats | grep "Compacting"   # Expect active merge tasks

# Snapshot restore — ONLY if data integrity is compromised.
nodetool stopdaemon
TABLE_DIR=/var/lib/cassandra/data/keyspace/table
# Move the snapshot aside FIRST so clearing live SSTables cannot destroy it.
mv "$TABLE_DIR/snapshots/pre_lcs_migration" /var/lib/cassandra/restore_pre_lcs_migration
# Remove live SSTables but preserve the (now empty) snapshots directory.
find "$TABLE_DIR" -maxdepth 1 -type f -delete
# Copy the snapshot data back into the live table directory.
cp -r /var/lib/cassandra/restore_pre_lcs_migration/* "$TABLE_DIR/"
sudo systemctl start cassandra
# Load the restored SSTables without restreaming from peers.
nodetool refresh keyspace table

Understanding STCS vs LCS vs TWCS — the parent guide comparing all three strategies and when a switch is actually warranted.
LSM tree mechanics in Cassandra — the memtable-flush-to-SSTable write path and compaction engine the re-level rewrites.
Tombstone management and garbage collection — why the switch does not purge tombstones and how gc_grace_seconds gates their removal.
Resolving high compaction backlog without downtime — recovery tactics if the LCS re-level backs up the compaction queue.

Back to Understanding STCS vs LCS vs TWCS