Skip to content

[BFTree] Add RangeIndex cluster migration#1731

Open
tiagonapoli wants to merge 40 commits into
mainfrom
tiagonapoli/bftree-migration
Open

[BFTree] Add RangeIndex cluster migration#1731
tiagonapoli wants to merge 40 commits into
mainfrom
tiagonapoli/bftree-migration

Conversation

@tiagonapoli
Copy link
Copy Markdown
Collaborator

@tiagonapoli tiagonapoli commented Apr 23, 2026

Summary

Adds end-to-end cluster migration for RangeIndex keys, supporting both MIGRATE SLOTS and MIGRATE KEYS paths. RangeIndex keys are backed by a native BfTree whose on-disk state (data.bftree) lives outside Tsavorite — shipping just the 51-byte stub is insufficient. This change streams the entire BfTree snapshot file alongside the stub over the existing migration transport.

Architecture

  • Serializer (RangeIndexChunkedSerializer): Pure state machine over Span<byte> — no I/O. Takes file data as input via MoveNext(dest, fileData, out consumed).
  • Migration Reader (RangeIndexMigrationReader): Async wrapper that reads the snapshot file and feeds bytes to the serializer.
  • Deserializer (RangeIndexChunkedDeserializer): Sync state machine that writes received file data to a temp file, validates xxHash64 checksum, recovers the native BfTree, and publishes the stub to the store.
  • Factory (RangeIndexManager.SnapshotRangeIndexAndCreateReader): Snapshots the BfTree under an exclusive lock to a temp file, then creates the reader.

Wire format

Single MigrationRecordSpanType.SerializedRangeIndexStream (tag 4). Stream format across chunks:

[4B keyLen][key bytes][8B fileSize][file bytes][8B xxHash64][4B stubLen][stub]

Key and file bytes may span chunks; all other elements must fit within a single chunk.

SLOTS path

  1. Scan detects RI keys via RecordType == 2, captures to MigrateOperation.RangeIndexes
  2. After scan completes, MigrateRangeIndexKeysAsync runs a sketch-protected batch cycle:
    • Add all RI keys to sketch (INITIALIZING)
    • TRANSMITTING + epoch barrier — blocks writes during snapshot+transmit
    • Transmit each key via TransmitRangeIndexAsync
    • DELETING + epoch barrier — blocks reads+writes during delete
    • Delete each key
    • finally: clear sketch (unblocks clients)

KEYS path

  1. GetRangeIndexKeysForMigration discovers RI keys via RIGET
  2. TransmitKeysAsync skips RI keys (in rangeIndexKeysToIgnore)
  3. Each RI key transmitted via TransmitRangeIndexAsync, then marked in sketch for DeleteKeysAsync

Bug fixes

  • Remove RI+cluster startup guard (GarnetServer.cs)
  • Fix round-trip migration: Publish deletes existing data file before move
  • Fix Publish registration: accept InPlaceUpdated/CopyUpdated status
  • Fix serializer empty-buffer bug in FileData phase
  • TransmitRangeIndexAsync catches all exceptions (never throws)
  • Sketch protection for RI keys in SLOTS path (previously unprotected)

Tests

  • 26 unit tests for serializer/deserializer (round-trip, checksums, error states, buffer boundaries, chunk sizes)
  • 11 cluster integration tests: SingleBySlot, ByKeys, ManyBySlot, WhileModifying, MigrateBack, LargeTree (1KB/256KB chunks), ChunkSize variants, StressAsync (Explicit)

TODO

  • Vector Set index keys have the same sketch protection gap (documented in code)
  • AOF replication of migrated trees to destination replicas

@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch 6 times, most recently from 61a467b to 3a6e52f Compare May 5, 2026 17:15
@tiagonapoli tiagonapoli marked this pull request as ready for review May 5, 2026 17:26
Copilot AI review requested due to automatic review settings May 5, 2026 17:26
@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch from 3a6e52f to dc78c02 Compare May 5, 2026 17:36
@tiagonapoli tiagonapoli changed the title [BFTree] Add RangeIndex cluster migration Add RangeIndex cluster migration with sketch protection May 5, 2026
@tiagonapoli tiagonapoli changed the title Add RangeIndex cluster migration with sketch protection [BFTree] Add RangeIndex cluster migration May 5, 2026
@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch 3 times, most recently from 74564ec to 74a1e44 Compare May 5, 2026 17:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch 4 times, most recently from 5754d5d to ac06150 Compare May 5, 2026 22:50
@vazois
Copy link
Copy Markdown
Contributor

vazois commented May 6, 2026

Missing RangeIndex feature-enabled guards

This PR doesn't consistently check whether the RangeIndex feature is enabled (i.e., whether RangeIndexManager is non-null) before using it. Two spots are at risk of NullReferenceException:

1. Source side — MigrateSession.RangeIndex.cs:29

var rangeIndexManager = clusterProvider.storeWrapper.DefaultDatabase.RangeIndexManager;
// No null check before calling .SnapshotRangeIndexAndCreateReader(...)

The KEYS path correctly uses ?.GetRangeIndexKeysForMigration(...) with a fallback to an empty dictionary, but TransmitRangeIndexAsync assumes the manager is always available. If the SLOTS scan somehow discovers an RI key on a node where the feature is disabled, this will crash.

2. Receive side — RespClusterMigrateCommands.cs:181

if (!rangeIndexMigrationState.ProcessRecord(...))

rangeIndexMigrationState is set to null when RangeIndexManager is null (line 70 in ClusterSession.cs), but the dispatch at line 181 doesn't guard against that. If a receiving node has RangeIndex disabled and the sender transmits a SerializedRangeIndexStream record, this will throw.

Suggestion: Add a null check on the receive side (e.g., log a warning and skip/error the record), and consider a defensive null check in TransmitRangeIndexAsync as well.

}

public async Task<bool> TransmitKeysAsync(Dictionary<byte[], byte[]> vectorSetKeysToIgnore)
public async Task<bool> TransmitKeysAsync(Dictionary<byte[], byte[]> vectorSetKeysToIgnore, Dictionary<byte[], byte[]> rangeIndexKeysToIgnore)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to combine these two instead of having to maintain two dictionaries?

Comment thread libs/cluster/Server/Migration/MigrateOperation.cs Outdated
Copy link
Copy Markdown
Contributor

@vazois vazois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libs/cluster/Server/Migration/MigrateScanFunctions.cs:53 — maybe not for this PR, but can't we just get these values from an enum?

Comment thread libs/cluster/Server/Migration/MigrateSession.RangeIndex.cs
await WaitForConfigPropagationAsync().ConfigureAwait(false);

// Discover Vector Sets linked namespaces
var allKeys = migrateTask.sketch.Keys.Select(t => t.Item1.ToArray());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid the copy here and just iterate over the container?

Copy link
Copy Markdown
Contributor

@vazois vazois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libs/cluster/Server/Migration/MigrateSessionKeys.cs:43 — seems really wasteful to maintain a separate dictionary just for skipping keys. We can potentially store the info for the key type inline and skip the key once we read it from the sketch list

}

/// <summary>Reset state for the next key stream.</summary>
private void Reset()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that the reset does not race with back to back Migration calls or parallel sessions on different slots

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each ClusterSession has a single RangeIndexMigrationReceiveSession - within a single ClusterSession, processing is single threaded right?

Comment thread libs/cluster/Session/RangeIndexMigrationReceiveSession.cs
@tiagonapoli tiagonapoli changed the base branch from dev to main May 21, 2026 15:03
Tiago Napoli and others added 2 commits May 25, 2026 17:32
w
Signed-off-by: Tiago Napoli <tiagonapoli@microsoft.com>
- Update RangeIndexManager.Migration.cs for 7 API changes from main:
  dataDir->riLogRoot, liveIndexes key nint->Guid, SnapshotToFile->CPR
  pattern, evicted tree paths, RecoverFromCprSnapshot, LogDataPathFor,
  stub.ResetFlags()
- Document TRYAGAIN behavior for RI commands during migration instead
  of spin-wait, with rationale (pipeline stall, client timeout, and
  connection reset risks)
- Add Redis protocol context and SE.Redis client handling details

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch from 9e45701 to 74b9745 Compare May 26, 2026 00:32
Tiago Napoli and others added 27 commits May 25, 2026 17:46
…errors

- Return TRYAGAIN for RI.* commands when key is blocked during migration
  instead of spin-waiting, avoiding pipeline stall and client timeout risks
- Add returnTryAgainForMigratingKeys field to ClusterSlotVerificationInput
- Add MigratingKeyResult local function in SingleKeySlotVerify
- Thread parameter through MultiKeySlotVerify, VerifyKeysInRange,
  NetworkIterativeSlotVerify
- Fix migration API visibility: internal -> public for cross-project access
  (DefaultMigrationChunkSize, RangeIndexRecordType, GetRangeIndexKeysForMigration,
  SnapshotRangeIndexAndCreateReader, DeriveTempMigrationPath, PublishMigratedIndex)
- Fix RangeIndexManager constructor calls in tests (enabled/dataDir -> riLogRoot)
- Fix broken XML doc cref and remove stale Allure.NUnit usings
- Fix dataDir -> riLogRoot references in RangeIndexManager.cs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove RangeIndexMigrationReceiveState allocation from ClusterSession
  constructor; create lazily on first SerializedRangeIndexStream record
- Add XML doc on rangeIndexMigrationState explaining why per-session
  state is needed (chunked BfTree snapshots across CLUSTER MIGRATE calls)
- Separate error paths: RangeIndex not enabled vs ProcessRecord failure
- Use null-conditional pattern for protocol enforcement check

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove redundant returnTryAgainForMigratingKeys parameter from
  SingleKeySlotVerify and NetworkIterativeSlotVerify; read from csvi field
- Fix XML doc cref in ClusterSession
- Remove duplicate clusterSession.Dispose() in RespServerSession

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Thread _cts.Token through TransmitRangeIndexAsync chunk read loop
and MigrateRangeIndexKeysAsync key iteration for consistent
cancellation support with rest of migration session.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Merge snapshot and transmission into single try/catch block with
using declaration. Update error message to cover both phases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace direct _cts.Token access with explicit CancellationToken
parameter. Remove default chunkSize value.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dex.cs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use fixed statement to pin the key byte[] before creating
PinnedSpanByte, matching the pattern in MigrateSessionSlots.cs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SnapshotForMigration now receives PinnedSpanByte instead of
ReadOnlySpan<byte>, matching the codebase pattern for internal
storage methods. The caller pins the key with fixed statement.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pin keyBytes alongside stubBytes in the fixed block and use
FromPinnedPointer instead of FromPinnedSpan.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add migrationTempDir field, created in constructor after cleanup
- Remove dot prefix (.migration-tmp -> migration-tmp) for Windows
- Remove per-call Directory.CreateDirectory from DeriveTempMigrationPath
- Simplify DeriveTempMigrationPath to expression-bodied member

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Read_MainStore can trigger CopyToTail, which calls PostCopyToTail-cold
and tries to take the RI exclusive lock. In SnapshotForMigration the
exclusive lock is already held, causing a deadlock. Use Read_RangeIndex
(which passes ReadCopyOptions.None) to suppress CTT on both read sites.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Discovery (GetRangeIndexKeysForMigration) now returns HashSet<byte[]>
instead of Dictionary — it only identifies which keys are RangeIndex,
without capturing stub bytes that could go stale.

SnapshotForMigration reads the stub under exclusive lock and returns
it via out parameter. SnapshotRangeIndexAndCreateReader uses the fresh
stub directly, eliminating the time-of-check/time-of-use gap where
concurrent reads could promote the stub via RIPROMOTE.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These changes belong on the separate tiagonapoli/tryagain-ri-migration
branch. Reverts ClusterSlotVerificationInput.returnTryAgainForMigratingKeys,
MigratingKeyResult local function, and the IsRangeIndexCommand flag setting.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add 4 ExceptionInjectionType enum values for RI migration phases
- Add TriggerException and ResetAndWaitAsync hooks in migration code
- Add 7 new cluster migration tests (on-disk stubs, concurrent R/W,
  pause during transmit/delete, exception injection, slow transmit)
- Add Stopwatch-based timing to sender overall, sender per-key, and
  receiver paths (logged in TimeSpan.Ticks)
- Expose TotalFileBytes on RangeIndexChunkedSerializer and reader
- Inject ILogger into RangeIndexMigrationReceiveState

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract timing variables into TransmitRangeIndexMetrics,
MigrateRangeIndexMetrics, and ReceiveRangeIndexMetrics structs,
each with a LogSummary method for self-contained logging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move TransmitRangeIndexMetrics, MigrateRangeIndexMetrics, and
ReceiveRangeIndexMetrics into separate files under the new
Server/Migration/RangeIndex/ directory.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move from lazy ??= initialization to constructor init. Simplify
pattern match to plain null-check, add chunk count to protocol
violation log message.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…= null

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If a concurrent checkpoint already owns the snapshot claim, wait for it
and copy the scratch file it produced — avoids spinning under the
exclusive lock. Matches the pattern used by SnapshotForFlushViaCpr.

Also changed File.Copy to overwrite: false since migration paths are
always unique GUIDs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants