Tombstone Management & Garbage Collection in Cassandra 4.x and 5.x

Tombstones are Cassandra’s immutable markers for deleted rows, columns, and TTL-expired data. Because the storage engine is append-only, a DELETE never mutates an existing record; it appends a timestamped marker that must later be reconciled across replicas and then purged. Left unmanaged, tombstone accumulation triggers TombstoneOverwhelmingException, inflates read latency, exhausts disk, and stalls compaction until I/O queues saturate. This page is for DBAs and distributed-systems engineers who need to reason precisely about when a marker becomes eligible for removal — and who need a runnable procedure to reclaim that space without resurrecting deleted data. It sits beneath the broader Cassandra architecture and compaction fundamentals guide; reach for it when you are tuning gc_grace_seconds, debugging a tombstone-heavy read path, or scheduling the repair that gates safe purging on Cassandra 4.x or 5.x.

How a Delete Becomes a Purgeable Tombstone

Cassandra’s storage engine is a log-structured merge tree, so every delete follows the same append-only trajectory as a write. When a DELETE, a TTL expiration, or a partition/range delete occurs, the coordinator writes a tombstone into the active memtable alongside the mutation timestamp. On flush, that marker becomes part of an immutable SSTable. Until compaction merges the tombstone with the older live data it shadows, every read that touches the affected partition must scan the marker, compare timestamps, and reconcile which cells still count as live. The flush cadence, SSTable tiering, and marker visibility windows are all governed by the underlying LSM tree mechanics in Cassandra, which is why tombstone density tracks your write/delete ratio, TTL distribution, and partition cardinality rather than raw row count.

Cassandra writes several distinct kinds of tombstone, and they cost the read path differently:

Cell (column) tombstones — a single deleted or overwritten-to-null column value.
Row tombstones — a deleted clustering row (DELETE ... WHERE pk = ? AND ck = ?).
Range tombstones — a deleted slice of clustering rows (DELETE ... WHERE pk = ? AND ck > ?); one marker can shadow thousands of rows, so it is cheap to write but forces the read path to evaluate the whole range boundary on every scan.
Partition tombstones — a deleted partition (DELETE ... WHERE pk = ?), which shadows everything under the key.
TTL tombstones — cells that reach their expiry become tombstones automatically at read/compaction time, without any client DELETE.

A marker becomes purgeable only when two independent conditions both hold: gc_grace_seconds has elapsed since the tombstone’s timestamp, and compaction actually merges every SSTable fragment that the marker overlaps into a single output. The grace window exists so that a delete has time to propagate to every replica before the evidence of the delete is discarded. If a tombstone were purged before a lagging replica received it, that replica would still hold the original live value, and the next repair or read-repair would treat the stale value as the newest data and resurrect it — the classic “deleted data comes back” failure. Because partition keys are spread across the token ring by consistent hashing, tombstone distribution is itself a function of data partitioning and token ring placement: a hot or oversized partition concentrates markers on one replica set and is where tombstone problems surface first.

The state machine below summarizes how a marker progresses from a live record to reclaimed disk space, gated by gc_grace_seconds and repair propagation.

Configuration Reference

Tombstone behavior is controlled by a mix of per-table CQL properties and node-level cassandra.yaml settings. The two are frequently confused: the grace window is per-table, but the read-scan guards are per-node.

Key	Scope	Default	Valid range	Impact
`gc_grace_seconds`	Per-table (CQL)	`864000` (10 days)	`0`–`INT_MAX` seconds	Minimum age before a tombstone is eligible for purge; must exceed your full-coverage repair cycle or deletes resurrect.
`tombstone_warn_threshold`	Per-node (`cassandra.yaml`)	`1000`	`1`–`INT_MAX`	Logs a WARN when a single read scans this many tombstones; early signal of a modeling problem.
`tombstone_failure_threshold`	Per-node (`cassandra.yaml`)	`100000`	`1`–`INT_MAX`	Aborts the query with `TombstoneOverwhelmingException` to protect the coordinator from OOM.
`compaction_throughput` (`compaction_throughput_mb_per_sec` pre-4.1)	Per-node (`cassandra.yaml`)	`64` MiB/s	`0` (unthrottled)–`INT_MAX`	Caps background merge rate; too low starves tombstone purging, too high starves foreground reads.
`unchecked_tombstone_compaction`	Per-table (compaction subproperty)	`false`	`true`/`false`	Lets a single SSTable be re-compacted to drop tombstones even without overlap, once past `tombstone_compaction_interval`.
`tombstone_compaction_interval`	Per-table (compaction subproperty)	`86400` (1 day)	seconds	Minimum SSTable age before single-SSTable tombstone compaction is considered.

The grace window and single-SSTable purge behavior are set per table with ALTER TABLE:

-- Shorten the grace window for a table whose repair cycle is well under 3 days,
-- and allow lone SSTables to shed tombstones once they age past the interval.
ALTER TABLE sensor.readings
  WITH gc_grace_seconds = 259200  -- 3 days; must exceed the full repair cycle
  AND compaction = {
    'class': 'TimeWindowCompactionStrategy',
    'compaction_window_unit': 'DAYS',
    'compaction_window_size': 1,
    'unchecked_tombstone_compaction': 'true',
    'tombstone_compaction_interval': 86400
  };

The node-level scan guards live in cassandra.yaml and require a rolling restart to take effect:

# cassandra.yaml — per-node read-path guards (not per-table CQL properties)
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000
compaction_throughput: 64MiB/s   # 4.1+ syntax; pre-4.1 use compaction_throughput_mb_per_sec: 64

Your compaction strategy sets the velocity of purging: SizeTieredCompactionStrategy can delay tombstone removal when bulk deletes create uneven tiers, LeveledCompactionStrategy purges faster at the cost of write amplification, and TimeWindowCompactionStrategy makes cleanup fully predictable once a window ages past gc_grace_seconds. Choose it against your workload using the comparison of STCS, LCS and TWCS rather than defaulting to STCS on a delete-heavy table.

Procedure: Safely Reclaiming Tombstone Space

The goal of this runbook is to move markers from tombstone to reclaimed disk without shortening the grace window below your repair cadence. Run it per table, starting from the worst offender.

Baseline the table. Capture the tombstone-per-slice histogram and SSTable footprint before touching anything.
```
nodetool tablestats sensor.readings   # cfstats is deprecated; use tablestats on 4.x/5.x
```
```
Table: readings
    SSTable count: 47
    Space used (live): 214.6 GiB
    Maximum tombstones per slice (last five minutes): 184320
    Average tombstones per slice (last five minutes): 2210.4
```
Gate: if Maximum tombstones per slice is anywhere near tombstone_failure_threshold, reads are already at risk — proceed, but treat it as an incident, not routine maintenance.
Confirm the grace window is safe for this table. Read the current value and compare it to your full-coverage repair cycle.
```
cqlsh -e "SELECT gc_grace_seconds FROM system_schema.tables \
  WHERE keyspace_name='sensor' AND table_name='readings';"
```
Safety gate: gc_grace_seconds must be larger than the time it takes one repair cycle to touch every replica of this table. If it is not, fix the repair schedule first — never lower the grace window to force a purge.
Ensure the delete has propagated: run repair before you purge. Tombstones are only safe to drop once every replica has the delete. On 4.x/5.x incremental primary-range repair is the routine cycle.
```
nodetool repair -pr sensor readings
```
The distinction between opportunistic on-read reconciliation and this scheduled validation is covered in read repair vs anti-entropy repair; only the scheduled Merkle-tree comparison guarantees full propagation before purge.
Force reclamation once the grace window and repair both clear. Prefer the targeted tool over a major compaction, which collapses all tiers into one SSTable and disrupts STCS/LCS tier structure.
```
nodetool garbagecollect sensor readings   # 4.x+; single-table, tier-preserving GC
```
Use nodetool compact sensor readings (a major compaction) only on a table small enough that a single giant SSTable is acceptable, or as a last resort under an active incident.
Watch the merge drain. Confirm the GC/compaction task is running and progressing rather than stalled.
```
nodetool compactionstats
```

The Python driver below folds those gates into an idempotent operation you can put behind cron or an orchestrator. It refuses to purge unless the grace window comfortably exceeds the repair cadence, so a misconfigured table cannot resurrect data.

#!/usr/bin/env python3
# Requirements: Python 3.10+, a local `nodetool` and `cqlsh` on PATH. No third-party deps.
"""Gate tombstone reclamation on a safe grace window and a completed repair (Cassandra 4.x/5.x)."""

import subprocess
from datetime import datetime, timezone


def _run(cmd: list[str], timeout: int = 7200) -> str:
    """Run a subprocess, raising on non-zero exit or timeout."""
    proc = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)
    if proc.returncode != 0:
        raise RuntimeError(f"{' '.join(cmd)} failed: {proc.stderr.strip()}")
    return proc.stdout


def grace_seconds(keyspace: str, table: str) -> int:
    """Read the per-table gc_grace_seconds from the schema."""
    out = _run([
        "cqlsh", "-e",
        f"SELECT gc_grace_seconds FROM system_schema.tables "
        f"WHERE keyspace_name='{keyspace}' AND table_name='{table}';",
    ], timeout=30)
    # The value is the sole integer on the data line of cqlsh's table output.
    for line in out.splitlines():
        token = line.strip()
        if token.isdigit():
            return int(token)
    raise RuntimeError(f"could not parse gc_grace_seconds for {keyspace}.{table}")


def reclaim_tombstones(
    keyspace: str, table: str, repair_cycle_seconds: int
) -> bool:
    """Repair then garbage-collect a table, but only if the grace window is safe."""
    grace = grace_seconds(keyspace, table)
    if grace <= repair_cycle_seconds:
        print(
            f"REFUSING: gc_grace_seconds={grace}s <= repair cycle "
            f"{repair_cycle_seconds}s; purging risks resurrecting deletes."
        )
        return False

    stamp = datetime.now(timezone.utc).isoformat()
    print(f"[{stamp}] Repairing {keyspace}.{table} before purge (grace={grace}s)")
    _run(["nodetool", "repair", "-pr", keyspace, table])

    # garbagecollect preserves tier structure, unlike a major compaction.
    print(f"Reclaiming tombstones on {keyspace}.{table}")
    _run(["nodetool", "garbagecollect", keyspace, table])
    print("Reclamation complete.")
    return True


if __name__ == "__main__":
    # A 3-day repair cycle → the table's grace window must be strictly larger.
    reclaim_tombstones("sensor", "readings", repair_cycle_seconds=259200)

For a full alerting layer that watches these metrics continuously and pages before thresholds breach, build on the pattern in automating tombstone threshold alerts with Python.

Verification & Observability

Trust the metrics, not the exit code — confirm the tombstones actually dropped.

Re-run the histogram. nodetool tablestats sensor.readings should show Maximum tombstones per slice (last five minutes) fall and SSTable count shrink after the merge. If the number is unchanged, the markers were not yet purgeable (grace window not elapsed, or no overlapping SSTables to merge).
Confirm the merge finished. nodetool compactionstats should return to pending tasks: 0 for the table; a line stuck below 100% indicates a stalled GC.
Query the virtual tables (4.x/5.x). The system_views keyspace exposes active tasks without JMX polling:
```
SELECT keyspace_name, table_name, kind, progress, total
FROM system_views.sstable_tasks;
```
Grep the logs. Search system.log for TombstoneOverwhelmingException and Read N live rows and M tombstone cells WARN lines; both should stop appearing for the table once reclamation succeeds. Persistent WARN lines after a clean purge point to a modeling problem, not a GC problem.
Export to Prometheus/Grafana. Track the org.apache.cassandra.metrics:type=Table,name=TombstoneScannedHistogram and type=Compaction,name=PendingTasks MBeans via the JMX exporter, and alert on read-latency SLO regressions correlated with tombstone scan rate.

Failure Modes & Rollback

Resurrected (zombie) deletes from a grace window shorter than the repair cycle. If gc_grace_seconds is smaller than the time your repair takes to reach every replica, compaction can purge a tombstone before a lagging replica received the delete; the next repair then treats that replica’s stale live value as newest and re-propagates it. Detect it by diffing expected-deleted rows across replicas with cqlsh CONSISTENCY ALL reads, or by spotting rows reappearing after a repair. There is no clean rollback for already-resurrected data — you must re-issue the deletes, then raise gc_grace_seconds above the repair cycle before the next purge. Prevention is the only real fix, which is why the runbook gates on the grace-versus-cadence comparison.

Read aborts under TombstoneOverwhelmingException. A query that scans more than tombstone_failure_threshold markers is aborted to protect the coordinator, so the read fails entirely. Detect it in system.log and in client-side driver errors. The immediate mitigation is to identify the offending partition (nodetool tablestats → Maximum tombstones per slice) and run the reclamation procedure above; the durable fix is to stop generating so many tombstones — avoid queue-style tables, wide range deletes, and inserting nulls. Rolling back a raised tombstone_failure_threshold “to make the error go away” is an anti-pattern: it trades a fast failure for coordinator OOM.

Grace window and single-SSTable data both preventing purge. Even after the grace window elapses, a tombstone is not dropped while a live value it shadows still lives in a different, un-merged SSTable — normal compaction only reclaims it when those SSTables are merged together. Detect it when nodetool tablestats shows high tombstone counts long after the grace window should have cleared them. The fix is to enable unchecked_tombstone_compaction and tombstone_compaction_interval on the table, or run nodetool garbagecollect; roll back by reverting the ALTER TABLE if the extra single-SSTable compactions raise I/O beyond budget.

Frequently Asked Questions

What happens if gc_grace_seconds is shorter than my repair cadence?

Tombstones for a partition can be purged during compaction before a repair has propagated the delete to every replica holding that token range. Replicas that missed the delete still hold the original value, and the next repair or read-repair treats that value as the newest data and resurrects it. Always keep your full-coverage repair cycle comfortably shorter than gc_grace_seconds, and gate any forced purge on that comparison so a skipped cycle cannot silently exceed the grace window.

Is it safe to set gc_grace_seconds to 0?

Only on tables where you can guarantee the delete never needs to survive a replica outage — for example, a single-node cluster, or a TWCS time-series table whose entire window is dropped atomically and never repaired against divergent replicas. On any multi-replica table that takes normal writes and deletes, gc_grace_seconds = 0 will resurrect data the moment one replica misses a mutation. Prefer shortening it to just above your repair cycle rather than zeroing it.

Should I run nodetool garbagecollect or a major compaction to purge tombstones?

Prefer nodetool garbagecollect (4.x+). It operates table by table and preserves your compaction tier structure, whereas a major compaction (nodetool compact) collapses every SSTable into one huge file, which breaks STCS tiering and LCS levels and creates a single SSTable that will not compact again for a long time. Reserve major compaction for small tables or one-off incident recovery.

Why do tombstones remain even after gc_grace_seconds has elapsed?

An expired grace window makes a tombstone eligible for purge, but it is only actually dropped when compaction merges every SSTable containing data that the tombstone shadows. If the live value it overwrites still sits in an un-merged SSTable, the marker must be kept, or the older value would reappear. Enable unchecked_tombstone_compaction or run nodetool garbagecollect to force single-SSTable reclamation once the SSTable ages past tombstone_compaction_interval.

How do I stop generating so many tombstones in the first place?

Most tombstone storms are a data-modeling defect. Avoid queue or work-log patterns where rows are inserted then deleted in order, since a read of the queue head scans every deleted row ahead of it. Do not insert explicit null values (each becomes a cell tombstone) — omit the column instead. Prefer TTLs with TimeWindowCompactionStrategy over manual range deletes for time-series data, so whole windows drop as immutable SSTables rather than leaving range tombstones behind.

Cassandra architecture and compaction fundamentals — the parent guide framing how the storage engine and cluster fabric interact.
LSM tree mechanics in Cassandra — why deletes append markers and how compaction reconciles them.
Understanding STCS vs LCS vs TWCS — how each strategy sets the velocity and predictability of tombstone purging.
Read repair vs anti-entropy repair — the reconciliation that must propagate a delete before it is safe to purge.
Automating tombstone threshold alerts with Python — a continuous alerting layer that pages before thresholds breach.

Tombstone Management & Garbage Collection in Cassandra 4.x and 5.x

Related guides