Parent macro-issue: #12
Depends on: #17 (Phase 4)
Overlaps: Phase 4 (Weeks 7–10)
Goal
Leverage the quadrance-exact, rationally-wired Laplacian eigenbasis as a transform coding frame for corpus-topology-coherent embedding compression. The eigenmodes are now reproducible across runs and platforms (Phase 4), which makes the spectral basis a stable codec dictionary.
Design: SpectralQuant as transform coding
SpectralQuant treats the bottom-d eigenvectors of L_F as an orthonormal transform analogous to a DCT basis, but adapted to the corpus manifold topology. Items that are spectrally smooth (low λ / low Rayleigh quotient) have most of their energy in the low-frequency modes and compress aggressively. Items that are rough (high λ) retain energy across many modes and require more bits.
This is automatically aligned with the epiplexity decomposition: structural information (low-frequency modes, learnable) compresses; random information (high-frequency modes, irreducible) does not. The lossless limit at b=32, d=F reproduces the epiplexity compression ratio observed on CVE (38.4×).
New crate: surfface-codec
Create crates/surfface-codec/ with:
surfface-codec/
src/
lib.rs
codec.rs # SpectralQuantCodec
quantise.rs # Lloyd-Max scalar quantisation
entropy.rs # Arithmetic coder using H_T budget
projector.rs # SpectralQuantProjector (IndexScorePhase)
benches/
codec_bench.rs
tests/
round_trip.rs
biochain.rs
SpectralQuantCodec
pub struct SpectralQuantCodec {
/// Bottom-d eigenvectors of L_F: shape [F x d]
pub basis: Array2<f32>,
/// Per-mode variance sigma_i^2(Phi_i): shape [d]
pub mode_variance: Array1<f32>,
/// Per-mode quantisation step sizes (Lloyd-Max optimal): shape [d]
pub step_sizes: Array1<f32>,
/// Entropy model: H_T estimate per mode
pub entropy_model: EntropyModel,
/// WiringMetric used to build this codec (must be Quadrance*)
pub wiring_metric: WiringMetric,
pub d: usize, // number of spectral modes retained
pub version: u32,
}
impl SpectralQuantCodec {
/// Build codec from an existing ArrowSpace index.
/// Uses the already-computed L_F eigenvectors — no extra eigendecomposition.
pub fn from_index(index: &ArrowSpaceIndex, d: usize) -> Result<Self>;
/// Compress item x to b bits per spectral coefficient.
/// Returns arithmetic-coded byte vector.
pub fn encode(&self, x: &[f32], b: usize) -> Vec<u8>;
/// Reconstruct from compressed bytes.
pub fn decode(&self, bytes: &[u8]) -> Vec<f32>;
/// Theoretical compression ratio at b bits/coefficient.
/// ratio = (F * 32) / (d * b + entropy_overhead)
pub fn compression_ratio(&self, b: usize) -> f64;
/// Reconstruction RMSE on a held-out sample.
pub fn eval_rmse(&self, sample: &[Vec<f32>], b: usize) -> f32;
}
Forward and inverse transform
Forward (encode):
x_tilde = Phi^T x // [d] spectral coefficients
x_tilde_q = quantise(x_tilde, step_sizes, b) // Lloyd-Max
bytes = arithmetic_encode(x_tilde_q, entropy_model)
Inverse (decode):
x_tilde_q = arithmetic_decode(bytes, entropy_model)
x_hat = Phi * x_tilde_q // [F] reconstructed embedding
Reconstruction error: ||x - x_hat||^2 = Q(x, x_hat) — expressible as a quadrance, and available as a score.
Lloyd–Max quantisation per mode
Each spectral coefficient x_tilde_i is modelled as Gaussian with variance sigma_i^2 = mode_variance[i]. The Lloyd–Max optimal step size for b bits:
delta_i = 2 * sigma_i * sqrt(3) / (2^b - 1) // uniform approximation
For non-Gaussian modes, run 5 iterations of the Lloyd–Max algorithm on a calibration sample (1 % of corpus).
Entropy coding
Use the epiplexity H_T estimate as the per-item entropy budget:
- Items with low
H_T (structurally regular, high epiplexity compression) get a tighter arithmetic code.
- Items with high
H_T (high randomness) get more bits allocated.
This is the direct computational realisation of the epiplexity MDL criterion applied to compression.
SpectralQuantProjector
Slots into IndexScorePhase alongside TauModeScore:
pub struct SpectralQuantProjector {
pub codec: Arc<SpectralQuantCodec>,
pub b: usize, // bits per coefficient for reconstruction error scoring
}
impl IndexScoreProjector for SpectralQuantProjector {
/// Returns Q(x, x_hat) = ||x - decode(encode(x, b))||^2
/// High score = item is structurally anomalous (hard to compress spectrally).
fn project(&self, item: &EmbeddingVector, ctx: &IndexContext) -> f32;
fn name(&self) -> &'static str { "spectral_quant_error" }
}
Target benchmarks (CVE corpus, N = 313 841, F = 384)
| Config |
Compression ratio |
Reconstruction RMSE |
Notes |
| f32 baseline (floating-point wiring) |
< 6× |
— |
existing |
QuadranceGaussian, d=64, b=8 |
≥ 8× |
< 0.01 |
target |
QuadranceGaussian, d=128, b=4 |
≥ 12× |
< 0.05 |
target |
QuadranceRational, d=64, b=8 |
≥ 8× |
< 0.01, bit-exact |
BioChain target |
Tests
Acceptance criteria
Parent macro-issue: #12
Depends on: #17 (Phase 4)
Overlaps: Phase 4 (Weeks 7–10)
Goal
Leverage the quadrance-exact, rationally-wired Laplacian eigenbasis as a transform coding frame for corpus-topology-coherent embedding compression. The eigenmodes are now reproducible across runs and platforms (Phase 4), which makes the spectral basis a stable codec dictionary.
Design: SpectralQuant as transform coding
SpectralQuant treats the bottom-
deigenvectors ofL_Fas an orthonormal transform analogous to a DCT basis, but adapted to the corpus manifold topology. Items that are spectrally smooth (low λ / low Rayleigh quotient) have most of their energy in the low-frequency modes and compress aggressively. Items that are rough (high λ) retain energy across many modes and require more bits.This is automatically aligned with the epiplexity decomposition: structural information (low-frequency modes, learnable) compresses; random information (high-frequency modes, irreducible) does not. The lossless limit at
b=32,d=Freproduces the epiplexity compression ratio observed on CVE (38.4×).New crate:
surfface-codecCreate
crates/surfface-codec/with:SpectralQuantCodecForward and inverse transform
Forward (encode):
Inverse (decode):
Reconstruction error:
||x - x_hat||^2 = Q(x, x_hat)— expressible as a quadrance, and available as a score.Lloyd–Max quantisation per mode
Each spectral coefficient
x_tilde_iis modelled as Gaussian with variancesigma_i^2 = mode_variance[i]. The Lloyd–Max optimal step size forbbits:For non-Gaussian modes, run 5 iterations of the Lloyd–Max algorithm on a calibration sample (1 % of corpus).
Entropy coding
Use the epiplexity
H_Testimate as the per-item entropy budget:H_T(structurally regular, high epiplexity compression) get a tighter arithmetic code.H_T(high randomness) get more bits allocated.This is the direct computational realisation of the epiplexity MDL criterion applied to compression.
SpectralQuantProjectorSlots into
IndexScorePhasealongsideTauModeScore:Target benchmarks (CVE corpus, N = 313 841, F = 384)
QuadranceGaussian, d=64, b=8QuadranceGaussian, d=128, b=4QuadranceRational, d=64, b=8Tests
decode(encode(x, b))has RMSE <eval_rmsethreshold for allb ∈ {4, 8, 16}.bincreases.QuadranceRationalencode → serialise → deserialise → decode produces bit-identical output on x86-64 and Apple Silicon.SpectralQuantProjectorscores on CVE are positively correlated withTauModeScoreλ (Spearman ρ > 0.3).compression_ratioconsistency:compression_ratio(32)≈ 38.4× (matching observed epiplexity ratio on CVE).from_indexno-op: BuildingSpectralQuantCodecfrom an existing index adds < 2 s on CVE (reuses already-computed eigenvectors).Acceptance criteria
surfface-codeccrate compiles and passes all tests.SpectralQuantCodec::from_index()operational on CVE withd=64.SpectralQuantProjectorregistered inIndexScorePhase.<index_name>.sqcodec.binalongside the main index.