Add MetalIndexIVFPQ with product quantization and residual encoding support#5288
Add MetalIndexIVFPQ with product quantization and residual encoding support#5288Evandabest wants to merge 9 commits into
Conversation
|
Thanks for the PR. Could you post a performance comparison between the CPU IVFPQ and the Metal one? This would clarify for which operating points this is beneficial. |
|
@mdouze You were right to question the performance between CPU and Metal IVFPQ, on the current performance is 1:1 with the CPU because Metal IVFPQ's search() currently delegates entirely to the CPU Index. The speedup comes from moving the per query PQ lookup table construction and the IVF list scan onto the GPU (the merge step can reuse the existing IVF merge kernel). I'll add the Python tests regardless. Would you prefer I bring the full GPU scan into this PR or do the GPU scan as a follow up PR? |
|
Ah right, thanks for the explanation. A subsequent PR is fine. |
|
@mdouze Here are the numbers from the follow up GPU-scan work (this PR itself still delegates search to the CPU index -the GPU path lands in the follow up PR). All on an M3 Pro, d=128, 8-bit PQ, fp32 LUT, recall vs CPU ≈ 1.0 (0.9999+). PQ LUT computation
LUT speedup saturates at ~3.7-3.8x once the batch is large enough to hide kernel-launch overhead (nq ≥ 10k); ~1.2x for tiny batches. |
Summary
Adds IVF-PQ (inverted file with product quantization) index support to the Metal GPU backend
MetalIndexIVFPQwith full train/add/search/reset/copyFrom/copyTo supportMetalIVFPQImplGPU-resident IVF list storage for PQ codes (segment allocator, same pattern as IVFFlat)by_residual=true(default for L2)runMetalIVFPQFullSearchwith CPU LUT fallback viarunMetalIVFPQScanMetalClonerto support IVFPQ inindex_cpu_to_metal_gpu/index_metal_gpu_to_cpuChanges
MetalIndexIVFPQ.h/.mm- IVFPQ index class (train, add, search, reset, copyFrom/copyTo, cloner support)impl/MetalIVFPQ.h/.mm- GPU-resident IVF list storage with segment allocator for PQ codestest/TestMetalIndexIVFPQ.mm- 4 C++ tests (L2, IP, reset, CPU↔GPU round-trip)test/CMakeLists.txt- added TestMetalIndexIVFPQ build targetDifferences from CUDA IVFPQ
Training: Delegates to CPU (
IndexIVFPQ::train). CUDA can train on GPU. Same rationale as IVFFlat - training is a one-time cost.Add path: Coarse quantization and PQ encoding run on CPU, then codes are copied to GPU storage. CUDA does both on GPU. On Apple Silicon with unified memory, the copy cost is minimal.
Residual encoding: When
by_residual=true, residuals (x - coarse_centroid) are computed on CPU before PQ encoding. CUDA computes residuals on GPU. Functionally equivalent.Lookup tables: PQ distance lookup tables are computed on CPU and uploaded to GPU for the scan phase. CUDA computes LUTs on GPU. CPU LUT computation is fast relative to the scan and avoids a separate GPU kernel launch.
IVF list storage: Same segment allocator pattern as IVFFlat - single contiguous buffer rather than CUDA's per-list
DeviceVectorallocations.Note
FP16 coarse quantizer and GPU merge kernel are planned optimizations for a future PR. Both apply across all IVF index types (IVFFlat, IVFPQ, IVFSQ).
Build and test