[feat] SID: add SidRqkmeans model (FAISS-trained residual K-Means)#539
Conversation
Second of three PRs splitting the Semantic-ID models onto the shared base from alibaba#538. Adds the concrete RQ-KMeans backend on top of ResidualQuantizer / BaseSidModel; RQ-VAE follows in PR3. - tzrec/modules/sid/kmeans.py: KMeansLayer centroid container + recon_diagnostics. - tzrec/modules/sid/residual_kmeans_quantizer.py: ResidualKMeansQuantizer (FAISS-trained, FX-traceable forward, non-uniform per-layer codebooks). - tzrec/models/sid_rqkmeans.py: SidRqkmeans(BaseSidModel) - gradient -free; reservoir-samples embeddings during the train loop and fits FAISS once in on_train_end. - tzrec/models/model.py: BaseModel.on_train_end() no-op lifecycle hook. - tzrec/main.py: invoke on_train_end after the train loop and force the tail checkpoint so post-hook state is persisted. - protos: SidRqkmeans message + ModelConfig registration (601; 600 is reserved for SidRqvae in PR3). - tests: kmeans_test, ResidualKMeansQuantizerTest, sid_rqkmeans_test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the `last_ckpt_step == i_step -> -1` override (and its stale comment) in the train loop's end-of-loop hook. The normal checkpoint cadence already persists the post-hook state. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- on_train_end() now returns is_ckpt_after_train; the tail save fires on `last_ckpt_step != i_step or is_ckpt_after_train`, so the fitted FAISS codebook is always persisted even when the last periodic checkpoint landed on the final step (main.py, model.py, sid_rqkmeans.py). (alibaba#1) - DDP on_train_end: wrap the rank0 FAISS fit in try/except and broadcast a fit-status flag so a rank0-only failure (or an empty reservoir) makes all ranks raise together instead of deadlocking on the centroid broadcast; correct the empty-reservoir docstring. (alibaba#2, alibaba#3) - KMeansLayer: cache is_initialized as a plain Python bool to drop a per-layer per-batch GPU->CPU .item() sync on the eval/predict path, kept in lockstep with the _is_initialized buffer. (alibaba#6) - _reservoir_add: copy only the kept rows to host instead of the whole batch every training step (keep float64 for n_seen exactness). (alibaba#7) - train_offline: per-layer fit-loss log now reports cumulative reconstruction of the original input (correct under normalize_residuals); align the module normalize_residuals default to True to match the proto. (alibaba#8, alibaba#10) - Drop dead faiss_residual_kmeans (RQ-VAE-only, lands in PR3) and its test; tidy _coerce_proto_numbers into a comprehension. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Flip the default for RQ-KMeans residual normalization to False, in both the SidRqkmeans proto field and the ResidualKMeansQuantizer constructor (kept consistent to avoid the proto/module mismatch). This matches OpenOneRec's residual k-means, which fits raw residuals with no per-layer L2 normalization. Configs that set normalize_residuals explicitly are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- KMeansLayer: add mark_initialized_() so the buffer + cached-bool init flag is owned by the layer; the DDP broadcast in SidRqkmeans uses it instead of poking the private fields. - SidRqkmeans: extract the reservoir-cap setup into _init_reservoir(). - residual_kmeans_quantizer: import faiss at module level (it's a pinned requirement) instead of a lazy in-function import; narrow train_offline(inputs) to torch.Tensor (all callers pass tensors) and drop the dead numpy branch. - Tighten the verbose comments/docstrings across the SID files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add ResidualKMeansQuantizer.default_fit_sample_size() (max(K) * max_points_per_centroid) so the FAISS default lives in the FAISS-owning class; SidRqkmeans._init_reservoir asks the quantizer instead of reading faiss_kwargs and hardcoding 256. Behavior-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Use logger.exception() in on_train_end's rank0 except so the underlying error's stack trace is captured (peers raise a coordinated RuntimeError pointing at the rank0 log); drop the now-unused `as e`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment-only; pushed to re-trigger CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review summary — SidRqkmeans (FAISS residual K-Means)Solid, carefully engineered PR. The Vitter Algorithm R reservoir, the int-status-flag DDP fit/broadcast (sidestepping NCCL bool quirks), the cached Should-fix (inline):
Test gaps worth closing:
Minor (docs/consistency):
|
Should-fix: - train_offline: faiss reads `gpu` as a GPU *count*, not a device index, so `gpu=current_device()` was 0 (single-GPU / rank0) -> falsy -> silent CPU fallback. Pass `gpu=1` so the fit actually runs on the (rank0) GPU. Test gaps: - reservoir Phase-2 replacement correctness (identifiable rows: intact, in-range, replacement actually occurs) — beyond the count/shape checks. - normalize_residuals=True end-to-end through train_offline (the F.normalize site the other tests never reached). - eval vs inference predict contract (quantized/input_embedding vs codes-only) and the init_metric/update_metric/compute_metric path. - checkpoint round-trip now asserts codes match the source model exactly (assert_close), not merely non-zero. Minor docs: - on_train_end Returns: clarify only the single-process path returns False; DDP raises on an empty gather. - train_offline docstring: the post-fit index.search streams all N in chunks. - proto train_sample_size comment: K -> max(K) for non-uniform codebooks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review summaryStrong, carefully-engineered PR. The hard parts are handled well: FX-traceable read-only Three items worth a look (posted inline):
Minor (not posted inline):
Nice work overall — the engineering rigor here is clearly high. |
The earlier gpu=1 "fix" was itself wrong and broke the GPU unittest_ci (cpu_ci/h20 passed): faiss reads `gpu` as a COUNT and 1 == True collapses to all-GPUs, so the rank0-only fit sharded over every rank's device and the GPU faiss path (newly activated — it was a silent CPU fallback before) failed on the tiny test data. faiss's count kwarg cannot pin to a single device, so default the fit to CPU (a bounded one-shot; set gpu in faiss_kmeans_kwargs to opt in explicitly). Also: - _init_reservoir docstring: note the cap targets train_sample_size when set, else default_fit_sample_size(). - Add test_on_train_end_ddp_rank0_failure: forces rank0's fit to raise and asserts every rank raises the coordinated RuntimeError, with join(timeout) so a reintroduced deadlock fails CI instead of hanging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Announce CPU vs GPU + N/D at the start of train_offline so the CPU default isn't silent (configs that don't set faiss_kmeans_kwargs.gpu now fit on CPU). Gated by verbose (on_train_end passes verbose=True). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merge upstream/master (1.2.17, incl. alibaba#540 DlrmHSTU fix) and bump. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code Review SummaryReviewed by parallel subagents (code quality, performance, test coverage, documentation accuracy, security), with findings cross-checked before posting. Overall this is carefully written code. The fail-fast guards (reservoir cap vs. largest K at construction), the mid-fit checkpoint poison detection, the strictly-typed FAISS kwargs proto, the bounded reservoir with float64 stream counters, and the exact-round-trip checkpoint tests all show real attention to edge cases. Test coverage of the prediction contracts, non-uniform codebooks, and checkpoint poisoning is genuinely strong. Main concern (inline on
|
…CPU pin - sid_integration_test: force CPU with CUDA_VISIBLE_DEVICES="-1" not "" (empty is treated inconsistently across CUDA runtimes; the GPU CI runner didn't hide devices, tripping the CPU-only guard in the train_eval child). - BaseSidModel: validate codebook entries >=1 and input_dim >=1 at construction; guard feature width in _extract_feature (a (B,1) tensor would otherwise broadcast into a degenerate rank-1 codebook). assert -> raise. - residual_kmeans_quantizer / kmeans_quantize: assert -> raise for the data-corruption guards (N>=max_k, load_centroids_ shape, CPU/shape contract) so they survive python -O. - RelativeL1: float64 sum / long count to avoid float32 rounding past 2**24. - kmeans_quantize: drop the duplicate relative_l1/recon_diagnostics helpers; RelativeL1 (tzrec/metrics) is the single home of the formula. Per-layer offline-fit log now reports MSE only. - sid_rqkmeans: TODO documenting the periodic-checkpointing-disabled contract (codebook can be dropped by save dedupe otherwise). - sid_model.proto: drop stale "(global, across all ranks)" wording. - mock config: set save_checkpoints_steps/epochs = 0 (the documented convention). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code Review SummaryReviewed at head a9a889c by parallel subagents (code quality, performance, test coverage, docs accuracy, multi-process safety), findings cross-checked before posting. Prior-round items were verified first: the DDP coordination concerns are resolved by removal (single-process guard in Inline comments (5)
Minor (no inline)
🤖 Generated with Claude Code |
- CPU-only guard message recommends CUDA_VISIBLE_DEVICES="-1" (not "", which this PR found unreliable on the GPU CI runner). - Correct the train_offline comment: faiss throws (not warns) for N < K. - Add negative tests for the fail-fast guards: empty/zero codebook, input_dim<1, feature-width mismatch, and train_offline too-few-points / wrong-dim. - sid_integration_test: assert the post-fit eval reports finite mse/rel_loss/ unique_sid_ratio (rel_loss < 1.0, unique_sid_ratio > 0) so a degenerate / unfitted codebook can't keep the test green. - Trim verbose comments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ever 1) The (B, 1) broadcast footgun isn't reachable in practice, so revert _extract_feature to the plain feature read and remove its negative test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The end-to-end train_eval is CPU-only (SidRqkmeans refuses a visible CUDA device). Forcing CPU on the CUDA-built GPU CI image is unreliable (the prior CUDA_VISIBLE_DEVICES="" / "-1" workarounds both still failed in the train_eval child). Skip when CUDA is available so the test runs on the CPU CI job (where it passes) and skips on the GPU runner. Keep nproc=1 for the single-process guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for both github-actions passes — went through them all. Dispositions: Fixed:
Declined / deferred (with reason):
Also fixed the CPU-only end-to-end test that was failing on the GPU CI image: it now runs on the CPU CI job and skips where CUDA is present. |
…_abstract Brings the reviewed alibaba#539 foundation onto feat/sid_abstract (which already carries alibaba#538 + an older RQ-VAE/RQ-Kmeans port), and syncs to upstream/master (alibaba#540, alibaba#541, which alibaba#539 already contains). Conflict resolutions: - sid_rqkmeans.py(+test), residual_kmeans_quantizer.py, sid_model.py: take alibaba#539's canonical versions (BaseSidModel now hosts both SID models, with mse/rel_loss/unique_sid_ratio and the unified x_hat recon key). - types.py: union — keep alibaba#539's QuantizeOutput, retain feat's QuantizeForwardMode enum + ResidualQuantizerOutput (RQ-VAE needs them). - protos/models/sid_model.proto: union — alibaba#539's typed FaissKmeansConfig + clean SidRqkmeans, re-add feat's SinkhornConfig/ClipConfig/SidRqvae; drop the now-unused struct.proto import. - protos/model.proto: enable `SidRqvae sid_rqvae = 600;` (the field alibaba#539 reserved for this follow-up). - main.py / model.py on_train_end: take alibaba#539's wording; drop feat's forced tail-checkpoint (SID models rely on the final=True tail save). Transitional state: old modules/sid/kmeans.py still coexists with alibaba#539's kmeans_quantize.py, and the RQ-VAE stack is still on the old abstraction — both retired in the follow-up refactor commit. All SID modules import.
… kmeans.py Refactors the RQ-VAE stack onto the reviewed alibaba#539 SID foundation so the VQ and K-Means backends share one per-layer interface, and keeps RQ-VAE device-agnostic (CPU + GPU) — unlike the deliberately CPU-only SidRqkmeans. - VectorQuantize now subclasses QuantizeLayer: implements quantize() -> QuantizeOutput and get_codebook_embeddings(); forward() delegates to quantize() so standalone vq(x) still works; the base lookup() returns the raw codebook vector embedding.weight[ids]. - ResidualVectorQuantizer drives the layer through the ABC (layer.quantize / layer.lookup / get_codebook_embeddings) instead of reaching into layer.embedding directly; behavior (raw-vector accumulation, STE-on-aggregate) is unchanged. - SidRqvae drops its update_metric override; alibaba#539's BaseSidModel now scores mse + rel_loss + unique_sid_ratio off predictions["x_hat"]/["codes"]. The train-path mse override stays (RQ-VAE has a train reconstruction). - Retire modules/sid/kmeans.py (replaced by alibaba#539's kmeans_quantize.py): relocate faiss_residual_kmeans into kmeans_quantize.py (CPU fit, centroids returned on the input device — safe from a GPU-resident RQ-VAE) and _squared_euclidean_distance into vector_quantize.py (its only user); drop the now-orphaned KMeansLayer / recon_diagnostics. Tests for the two moved helpers migrate to kmeans_quantize_test.py / vector_quantize_test.py. CPU + GPU: no hard-CUDA assumptions; the only device-sensitive path is the optional FAISS kmeans_init, which fits on CPU and moves centroids to the module's device (DDP: fit on rank 0, broadcast). Sinkhorn's all_reduce works under gloo and nccl. Verified on CPU: all SID unit tests pass (quantize_layer, vector_quantize, kmeans_quantize, residual_quantizer, residual_kmeans_quantizer, relative_l1, sid_rqvae, sid_rqkmeans, residual_vector_quantizer_dist). ruff check/format clean. GPU smoke + the full sid_integration_test must run in the torchgpuv4 container (this shell's CUDA driver is too old and has a stale installed tzrec; checkpoint_util import already fails there independent of this change).
|
Thanks @tiankongdeguiji — all addressed on latest ( Round 1
Round 2
|
…t/sid_abstract upstream/master advanced to 3d4d5a8: alibaba#539 (SidRqkmeans) was merged as a squash and alibaba#542 (docs) landed. feat already carries alibaba#539's content (from the earlier pr-539 merge), so this is effectively the alibaba#542 doc update plus reconciling the SID proto/types surface. Conflicts (all add/add on files feat extended for RQ-VAE) resolved by taking feat's superset: - modules/sid/types.py: keep QuantizeForwardMode + ResidualQuantizerOutput (master's alibaba#539 has only QuantizeOutput, which is identical and auto-merged). - protos/models/sid_model.proto: keep SinkhornConfig/ClipConfig/SidRqvae on top of the shared FaissKmeansConfig + SidRqkmeans. - protos/model.proto: keep `SidRqvae sid_rqvae = 600;` (master still has it reserved as a comment).
Second of three PRs splitting the Semantic-ID models onto the shared base from #538. Adds the concrete RQ-KMeans backend on top of ResidualQuantizer / BaseSidModel; RQ-VAE follows in PR3.