Data Partitioning & Token Ring Basics in Cassandra 4.x and 5.x

Q: Can I change the partitioner once nodes are running?

No. Murmur3Partitioner is fixed at first bootstrap because every SSTable is physically sorted in that partitioner's token order. Switching partitioners requires standing up a new cluster and migrating data via sstableloader or a dual-write cutover. Treat the partitioner choice as permanent.

Q: What happens if gc_grace_seconds is shorter than my repair cadence?

Tombstones can be purged during compaction before a repair propagates the delete to every replica, resurrecting deleted data on replicas that missed it. Keep your full-coverage repair cycle comfortably shorter than gc_grace_seconds and gate repairs on ring health so a skipped or failed cycle does not silently exceed the grace window.

The token ring is not an abstract topology; it is the deterministic routing substrate that governs every write, read, compaction sweep, and anti-entropy stream in a production deployment. This page is for DBAs, distributed-systems engineers, and DevOps teams who need to reason precisely about how a partition key becomes a physical placement decision — and who need runnable checks to make that reasoning safe before they stream, repair, or resize a node. It sits beneath the broader Cassandra architecture and compaction fundamentals guide, which frames how the storage engine and cluster fabric interact; here we drill into the ring itself. Reach for this page when you are validating ring consistency, tuning num_tokens, debugging skewed data distribution, or scripting node lifecycle automation on Cassandra 4.x or 5.x, where Murmur3Partitioner and virtual nodes (vnodes) are the defaults and manual token assignment is the exception.

How Consistent Hashing Maps Keys to the Ring

Every partition key in Cassandra is hashed to a 64-bit signed integer token. The ring represents a continuous number space spanning -2^63 to 2^63 - 1, wrapping so that the largest token is adjacent to the smallest. Murmur3Partitioner is the default and only recommended hash for new deployments; it is fast, produces a near-uniform distribution, and — critically — is deterministic, so the same key always lands on the same token on every node. The partitioner is a deployment-wide invariant: it cannot be changed after the first node bootstraps, because every SSTable on disk is already sorted by Murmur3 token order.

When vnodes are enabled, each physical node owns many disjoint token ranges rather than a single contiguous block. Cassandra 4.0+ recommends num_tokens: 16 (the historical default was 256). Fewer tokens per node reduces the number of ranges that repair must diff and shrinks the availability blast radius of a single failure, while the token-allocation algorithm keeps distribution even. On 4.x/5.x, when allocate_tokens_for_local_replication_factor (or the older allocate_tokens_for_keyspace) is set, new nodes select tokens that actively minimise imbalance for the given replication factor rather than choosing them at random.

The flow below summarizes how a partition key is deterministically routed to its replicas.

Consistent hashing is deterministic: the same key always resolves to the same token, range, and replica set.

Once a key resolves to a token, the token’s owning range identifies the primary replica; the replication strategy then walks the ring clockwise to place the remaining replicas on distinct nodes (and, with NetworkTopologyStrategy, distinct racks). Because the mapping is purely a function of the key and the ring state, any coordinator can compute placement locally without a lookup service — the ring metadata itself, disseminated by gossip, is the routing table.

Vnodes split the token space into many disjoint ranges per node; a key's token picks the primary range, and replicas follow the next distinct nodes clockwise.

Token ownership and endpoint state propagate through the node gossip and failure detection protocols, which carry ring metadata, liveness markers, and schema versions across the deployment using a randomized epidemic exchange. When a node fails, is replaced, or is decommissioned, gossip triggers token redistribution. Validate ring consistency with nodetool describecluster and nodetool status before initiating any streaming operation: mismatched ring state during streaming causes silent data divergence or duplicate writes.

Partition routing also dictates storage layout. All columns for a given partition key reside on the same replica set and are stored together, sorted by clustering key. Oversized partitions therefore concentrate load on a single replica and stall background maintenance, because the LSM tree mechanics in Cassandra force compaction threads to hold large in-memory merge structures for a single key. Keeping partitions bounded is not a storage nicety — it is what keeps the ring’s per-node load evenly serviceable.

Configuration Reference

The following options govern ring behavior. All live in cassandra.yaml unless noted, and several are immutable after bootstrap.

Key	Default (4.x/5.x)	Valid range	Impact on distribution / repair / throughput
`partitioner`	`Murmur3Partitioner`	fixed per cluster	Immutable after first bootstrap; determines token hash and on-disk sort order. Changing it requires a full rebuild.
`num_tokens`	`16`	`1`–`256`	Lower values shrink repair diff scope and failure blast radius but raise the risk of skew without token allocation; higher values smooth distribution at the cost of more ranges to repair.
`allocate_tokens_for_local_replication_factor`	unset (recommend `3`)	positive integer = RF	When set, new nodes pick tokens that minimise imbalance for that RF instead of choosing at random. Strongly recommended with low `num_tokens`.
`initial_token`	unset (auto)	comma-list of tokens	Manual assignment; only for single-token or migration scenarios. Mutually exclusive with `num_tokens` auto-allocation.
`auto_bootstrap`	`true`	`true` / `false`	Controls whether a joining node streams its claimed ranges before serving reads. Never set `false` on a real join.
`endpoint_snitch`	`SimpleSnitch` (set `GossipingPropertyFileSnitch`)	snitch class	Determines rack/DC awareness of replica placement; wrong snitch collapses fault domains onto one rack.

A representative production snippet for a rack-aware, low-num_tokens deployment:

# cassandra.yaml — ring and placement (Cassandra 4.x/5.x)
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
num_tokens: 16
allocate_tokens_for_local_replication_factor: 3
endpoint_snitch: GossipingPropertyFileSnitch
auto_bootstrap: true

Keyspace replication is set at the schema layer, and it is where the ring’s placement rules become concrete:

-- Rack- and DC-aware placement; replicas walk the ring across distinct racks.
ALTER KEYSPACE my_keyspace
  WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3
  };

Because num_tokens participates in token allocation at join time, changing it on an existing node has no retroactive effect — a node keeps the tokens it claimed at bootstrap. To adopt a new num_tokens, you add new nodes with the new value (or replace nodes one at a time) rather than editing the config of a running node and restarting.

Validating Ring State: A Pre-Streaming Runbook

Every operation that moves data across token ranges — bootstrap, decommission, replace, or repair — must be gated on a consistent ring view. Run these steps in order and stop if any gate fails.

Confirm a single schema and ring version across the deployment. A split schema means gossip has not converged and streaming will diverge.
```
nodetool describecluster
```
Expected — exactly one schema version, all live nodes listed under it:
```
Cluster Information:
    Name: prod
    Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
        2b6f...c41: [10.0.1.10, 10.0.1.11, 10.0.1.12]
```
If more than one schema version appears, resolve it (usually by waiting for convergence or restarting the stuck node) before continuing.
Verify every node is UN (Up/Normal) and inspect ownership. Ownership skew is the single most common ring problem.
```
nodetool status my_keyspace
```
```
--  Address     Load     Tokens  Owns (effective)  Host ID   Rack
UN  10.0.1.10   184 GiB  16      33.4%             a1b2...   rack1
UN  10.0.1.11   181 GiB  16      33.2%             c3d4...   rack2
UN  10.0.1.12   187 GiB  16      33.4%             e5f6...   rack3
```
Gate: no node should be DN/?, and effective ownership should cluster tightly around 100% / node_count. A node more than a few points off is a skew signal — see failure modes below.
Confirm no streaming or pending ranges are already in flight. Overlapping streams saturate inter-node bandwidth and corrupt progress tracking.
```
nodetool netstats | head -n 5
```
Expected when idle:
```
Mode: NORMAL
Not sending any streams.
Not receiving any streams.
```
Only now initiate the ring operation (repair shown here). Repair reconciles divergent replicas across token ranges; on 4.x/5.x, incremental repair is the default and primary-range (-pr) repair is the routine cycle. Note that -pr must not be combined with -local, since a primary-range repair already spans all replicas of those ranges. The distinction between on-read reconciliation and scheduled validation is covered in read repair vs anti-entropy repair; the ring only concerns us here as the unit repair operates over.

The Python workflow below folds the gates above into an idempotent, repeatable driver you can put behind cron or an orchestrator. It parses the plain-text output of nodetool (neither nodetool status nor nodetool info supports a JSON format on 4.x/5.x) and refuses to repair against an unhealthy ring.

#!/usr/bin/env python3
# Requirements: Python 3.10+, a local `nodetool` on PATH. No third-party deps.
"""Gate a primary-range repair on a consistent ring view (Cassandra 4.x/5.x)."""

import subprocess
from datetime import datetime, timezone


def _nodetool(args: list[str], timeout: int = 30) -> str:
    """Run nodetool, raising on a non-zero exit or timeout."""
    proc = subprocess.run(
        ["nodetool", *args], capture_output=True, text=True, timeout=timeout
    )
    if proc.returncode != 0:
        raise RuntimeError(f"nodetool {' '.join(args)} failed: {proc.stderr.strip()}")
    return proc.stdout


def ring_is_consistent() -> bool:
    """True only if there is a single schema version and no down nodes."""
    describe = _nodetool(["describecluster"])
    # A split schema prints more than one indented "<hash>: [..]" line.
    schema_lines = [ln for ln in describe.splitlines() if ":" in ln and "[" in ln]
    if len(schema_lines) > 1:
        print(f"Schema disagreement across ring:\n{describe}")
        return False
    status = _nodetool(["status"])
    # Node state is the first token on each data line, e.g. "UN 10.0.1.10 ...".
    down = [ln for ln in status.splitlines() if ln[:2] in ("DN", "DL", "DJ", "DM")]
    if down:
        print(f"Down nodes detected; deferring:\n{''.join(down)}")
        return False
    return True


def streaming_idle() -> bool:
    """True when no streams are in flight."""
    net = _nodetool(["netstats"])
    return "Not sending any streams." in net and "Not receiving any streams." in net


def repair_primary_range(keyspace: str, timeout: int = 7200) -> bool:
    """Run a sequential primary-range repair, gated on ring health."""
    if not (ring_is_consistent() and streaming_idle()):
        print("Pre-flight gate failed; not starting repair.")
        return False
    stamp = datetime.now(timezone.utc).isoformat()
    print(f"[{stamp}] Starting primary-range repair on {keyspace}")
    # Incremental is the default on 4.x/5.x; -pr scopes to this node's primary
    # ranges; -seq avoids saturating streaming during peak compaction windows.
    proc = subprocess.run(
        ["nodetool", "repair", "-pr", "-seq", keyspace],
        capture_output=True, text=True, timeout=timeout,
    )
    if proc.returncode != 0:
        print(f"Repair failed (exit {proc.returncode}): {proc.stderr.strip()}")
        return False
    print("Repair completed successfully.")
    return True


if __name__ == "__main__":
    repair_primary_range("my_keyspace")

Verification & Observability

After any ring operation, confirm the outcome rather than trusting the exit code alone.

Re-check ownership and load convergence. Run nodetool status my_keyspace again; effective ownership should be balanced and Load should have shifted as expected (dropped on a decommissioned node’s former peers only after cleanup). A joining node should read UN, not UJ, once bootstrap completes.
Confirm streaming drained. nodetool netstats should return to Not sending any streams. / Not receiving any streams. If a stream is stuck, nodetool netstats will show a non-100% progress line that stops advancing.
Inspect ring detail per token. nodetool ring my_keyspace lists every token and its owner; use it to spot a node holding disproportionately few or many tokens after allocation.

Query the virtual tables (4.x/5.x). The system_views keyspace exposes live state without JMX:

-- Per-node liveness and load, straight from the coordinator.
SELECT peer, up, data_center, rack, tokens
FROM system_views.peers_information;

Grep the logs for allocation and streaming events. Bootstrap logs Token allocation and JOINING/NORMAL transitions; streaming logs Session … complete. Repeated Streaming error or Timed out lines in system.log indicate bandwidth saturation — throttle with nodetool setstreamthroughput and retry.

Pair ring metrics with compaction backlog: after a bootstrap that streamed many ranges, watch nodetool compactionstats for a spike in pending tasks, since newly streamed SSTables must be compacted into the receiving node’s tiers.

Failure Modes & Rollback

Ring ownership skew (uneven Owns %). With low num_tokens and no token allocation configured, random token selection can leave one node owning noticeably more of the ring than its peers, concentrating reads, writes, and compaction on it. Detect it with nodetool status (ownership several points above 100% / node_count) or nodetool ring (one node holding wider ranges). You cannot rebalance in place safely by editing tokens; the rollback is to add replacement nodes with allocate_tokens_for_local_replication_factor set and decommission the skewed node so its ranges are re-picked evenly. Avoid nodetool move on a vnode deployment — it triggers full-range streaming and rarely fixes systemic skew.

Oversized partition hotspot. A partition that grows past roughly 100 MB pins load on one replica set and forces compaction to hold a large merge structure for a single key, driving read amplification, long GC pauses, and eventually TombstoneOverwhelmingException when deletes accumulate. Detect it with nodetool tablestats (watch Compacted partition maximum bytes and Maximum live cells per slice). This is a modeling defect, not a ring defect: the fix is application-side, using bucketed partition keys sized per how to calculate optimal partition sizes for Cassandra, and pruning the accumulated rows via tombstone management and garbage collection. There is no ring-level rollback — you re-model the key.

Premature decommission leaving orphaned ranges. Terminating a node before nodetool decommission finishes streaming its ranges to peers leaves the ring believing those ranges are covered when they are not, causing missing data under QUORUM. Detect it by confirming nodetool netstats on the leaving node reports no active streams and nodetool status shows the node gone before you power it off. If a node was killed mid-decommission, roll forward rather than back: bring the ranges back into coverage with nodetool repair -full on the remaining replicas, and use nodetool removenode <host-id> (or assassinate only as a last resort, after describecluster confirms ring consistency) to purge the stale entry.

Frequently Asked Questions

Should I migrate an existing cluster from num_tokens 256 to 16?

Only by rolling in new nodes or replacing old ones — you cannot lower num_tokens on a running node, because a node keeps the tokens it claimed at bootstrap. The migration is worthwhile for large deployments: fewer tokens mean smaller repair diffs and a smaller availability blast radius per failure. Set allocate_tokens_for_local_replication_factor on the new nodes so the reduced token count still distributes evenly, and add them one rack at a time, running cleanup on old nodes afterward.

Can I change the partitioner once nodes are running?

No. Murmur3Partitioner is fixed at first bootstrap because every SSTable is physically sorted in that partitioner’s token order. Switching partitioners requires standing up a new cluster and migrating data (for example via sstableloader or a dual-write cutover). Treat the partitioner choice as permanent.

Why does one node own more data than the others even with vnodes?

Vnodes reduce but do not eliminate skew when tokens are chosen randomly, which is the default if token allocation is not configured. With a low num_tokens, the variance is larger. Configure allocate_tokens_for_local_replication_factor to make new nodes pick balancing tokens, and check nodetool status ownership after each join. Persistent skew is corrected by replacing the offending node, not by editing its tokens.

What happens if gc_grace_seconds is shorter than my repair cadence?

Tombstones for a partition can be purged during compaction before a repair has propagated the delete to every replica holding that token range, resurrecting deleted data on replicas that missed it. Keep your full-coverage repair cycle comfortably shorter than gc_grace_seconds, and gate repairs on ring health as shown in the runbook so a skipped or failed cycle does not silently exceed the grace window.

Do I still need repair if I run vnodes and NetworkTopologyStrategy?

Yes. Vnodes only change how ranges are distributed, not whether replicas can diverge. Dropped mutations, hinted-handoff expiry, and node outages still create inconsistency that only anti-entropy repair reconciles. The ring determines which ranges each repair session covers; it does not remove the need to repair them.

Cassandra architecture and compaction fundamentals — the parent guide framing how the storage engine and cluster fabric fit together.
LSM tree mechanics in Cassandra — how per-partition data becomes SSTables and why oversized partitions stall compaction.
Node gossip and failure detection protocols — the epidemic exchange that disseminates ring metadata and endpoint state.
Read repair vs anti-entropy repair — the two reconciliation paths that operate over token ranges.
How to calculate optimal partition sizes for Cassandra — sizing methodology that keeps per-token load evenly serviceable.

Data Partitioning & Token Ring Basics in Cassandra 4.x and 5.x

Related guides