Add MetalIndexIVFPQ with product quantization and residual encoding support by Evandabest · Pull Request #5288 · facebookresearch/faiss

Evandabest · 2026-06-08T19:05:50Z

Summary

Adds IVF-PQ (inverted file with product quantization) index support to the Metal GPU backend

Add MetalIndexIVFPQ with full train/add/search/reset/copyFrom/copyTo support
Add MetalIVFPQImpl GPU-resident IVF list storage for PQ codes (segment allocator, same pattern as IVFFlat)
Support 8-bit product quantization with precomputed per-query lookup tables
Support both L2 and inner product metrics
Residual encoding when by_residual=true (default for L2)
CPU-side PQ lookup table computation with precomputed tables optimization for L2
GPU scan path via runMetalIVFPQFullSearch with CPU LUT fallback via runMetalIVFPQScan
Update MetalCloner to support IVFPQ in index_cpu_to_metal_gpu / index_metal_gpu_to_cpu

Changes

New: MetalIndexIVFPQ.h/.mm - IVFPQ index class (train, add, search, reset, copyFrom/copyTo, cloner support)
New: impl/MetalIVFPQ.h/.mm - GPU-resident IVF list storage with segment allocator for PQ codes
New: test/TestMetalIndexIVFPQ.mm - 4 C++ tests (L2, IP, reset, CPU↔GPU round-trip)
Modified: test/CMakeLists.txt - added TestMetalIndexIVFPQ build target

Differences from CUDA IVFPQ

Training: Delegates to CPU (IndexIVFPQ::train). CUDA can train on GPU. Same rationale as IVFFlat - training is a one-time cost.

Add path: Coarse quantization and PQ encoding run on CPU, then codes are copied to GPU storage. CUDA does both on GPU. On Apple Silicon with unified memory, the copy cost is minimal.

Residual encoding: When by_residual=true, residuals (x - coarse_centroid) are computed on CPU before PQ encoding. CUDA computes residuals on GPU. Functionally equivalent.

Lookup tables: PQ distance lookup tables are computed on CPU and uploaded to GPU for the scan phase. CUDA computes LUTs on GPU. CPU LUT computation is fast relative to the scan and avoids a separate GPU kernel launch.

IVF list storage: Same segment allocator pattern as IVFFlat - single contiguous buffer rather than CUDA's per-list DeviceVector allocations.

Note

FP16 coarse quantizer and GPU merge kernel are planned optimizations for a future PR. Both apply across all IVF index types (IVFFlat, IVFPQ, IVFSQ).

Build and test

cmake -B build \
  -DFAISS_ENABLE_GPU=OFF \
  -DFAISS_ENABLE_METAL=ON \
  -DBUILD_TESTING=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_PREFIX_PATH="$(brew --prefix libomp)" \
  .
cmake --build build --target faiss faiss_metal TestMetalIndexIVFPQ -j$(sysctl -n hw.logicalcpu)
cd build && ctest -R TestMetalIndexIVFPQ --output-on-failure

mdouze · 2026-06-10T09:04:03Z

Thanks for the PR. Could you post a performance comparison between the CPU IVFPQ and the Metal one? This would clarify for which operating points this is beneficial.
Also please add tests for the code in Python.

Evandabest · 2026-06-10T12:34:55Z

@mdouze You were right to question the performance between CPU and Metal IVFPQ, on the current performance is 1:1 with the CPU because Metal IVFPQ's search() currently delegates entirely to the CPU Index. The speedup comes from moving the per query PQ lookup table construction and the IVF list scan onto the GPU (the merge step can reuse the existing IVF merge kernel). I'll add the Python tests regardless. Would you prefer I bring the full GPU scan into this PR or do the GPU scan as a follow up PR?

mdouze · 2026-06-11T05:58:11Z

Ah right, thanks for the explanation. A subsequent PR is fine.
Could you measure the speedup of PQ LUT computation then, for the record?

Evandabest · 2026-06-11T13:55:43Z

@mdouze Here are the numbers from the follow up GPU-scan work (this PR itself still delegates search to the CPU index -the GPU path lands in the follow up PR). All on an M3 Pro, d=128, 8-bit PQ, fp32 LUT, recall vs CPU ≈ 1.0 (0.9999+).

PQ LUT computation
Metal ivfpq_build_lut_l2 kernel vs faiss CPU ProductQuantizer::compute_distance_tables (same M×256 tables, numerically identical, max rel diff ~1e-7):

Case	CPU	Metal	Speedup
M=8, nq=10k	12.3 ms	4.4 ms	2.8x
M=16, nq=10k	17.1 ms	4.6 ms	3.7x
M=32, nq=10k	30.0 ms	8.1 ms	3.7x
M=16, nq=1k	1.78 ms	1.49 ms	1.2x
M=16, nq=50k	84.2 ms	21.9 ms	3.8x
M=16, nq=100k	160.2 ms	43.5 ms	3.7x

LUT speedup saturates at ~3.7-3.8x once the batch is large enough to hide kernel-launch overhead (nq ≥ 10k); ~1.2x for tiny batches.

Evandabest added 7 commits June 8, 2026 13:43

Add MetalIVFPQImpl GPU-resident IVF list storage for PQ codes

24b2570

Add MetalIndexIVFPQ index class with add_core() fix

ba34b35

Add TestMetalIndexIVFPQ with L2, IP, reset, and round-trip tests

cba7dd3

Fix PQ encoding to use residuals when by_residual is true

4c9dce1

Add verifyPQSettings_ validation for constructor and train

c34e4e4

Add by_residual and polysemous_ht validation in copyFrom

0723fd6

Add MetalIndexIVFPQ and MetalIVFPQ to CMake build

68ba2a1

meta-cla Bot added the CLA Signed label Jun 8, 2026

Add TestMetalIndexIVFPQ to Metal CI build and test targets

cd1ce45

Add IVFPQ Python tests for L2, IP, reset, and round-trip

e7f89a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MetalIndexIVFPQ with product quantization and residual encoding support#5288

Add MetalIndexIVFPQ with product quantization and residual encoding support#5288
Evandabest wants to merge 9 commits into
facebookresearch:mainfrom
Evandabest:metal-ivfpq

Evandabest commented Jun 8, 2026

Uh oh!

mdouze commented Jun 10, 2026

Uh oh!

Evandabest commented Jun 10, 2026

Uh oh!

mdouze commented Jun 11, 2026

Uh oh!

Evandabest commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Evandabest commented Jun 8, 2026

Summary

Changes

Differences from CUDA IVFPQ

Note

Build and test

Uh oh!

mdouze commented Jun 10, 2026

Uh oh!

Evandabest commented Jun 10, 2026

Uh oh!

mdouze commented Jun 11, 2026

Uh oh!

Evandabest commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants