Skip to content

Phase 5: SpectralQuant compression — SpectralQuantCodec, topology-coherent quantisation, and SpectralQuantProjector #18

Description

@Mec-iS

Parent macro-issue: #12
Depends on: #17 (Phase 4)
Overlaps: Phase 4 (Weeks 7–10)

Goal

Leverage the quadrance-exact, rationally-wired Laplacian eigenbasis as a transform coding frame for corpus-topology-coherent embedding compression. The eigenmodes are now reproducible across runs and platforms (Phase 4), which makes the spectral basis a stable codec dictionary.


Design: SpectralQuant as transform coding

SpectralQuant treats the bottom-d eigenvectors of L_F as an orthonormal transform analogous to a DCT basis, but adapted to the corpus manifold topology. Items that are spectrally smooth (low λ / low Rayleigh quotient) have most of their energy in the low-frequency modes and compress aggressively. Items that are rough (high λ) retain energy across many modes and require more bits.

This is automatically aligned with the epiplexity decomposition: structural information (low-frequency modes, learnable) compresses; random information (high-frequency modes, irreducible) does not. The lossless limit at b=32, d=F reproduces the epiplexity compression ratio observed on CVE (38.4×).


New crate: surfface-codec

Create crates/surfface-codec/ with:

surfface-codec/
  src/
    lib.rs
    codec.rs          # SpectralQuantCodec
    quantise.rs       # Lloyd-Max scalar quantisation
    entropy.rs        # Arithmetic coder using H_T budget
    projector.rs      # SpectralQuantProjector (IndexScorePhase)
  benches/
    codec_bench.rs
  tests/
    round_trip.rs
    biochain.rs

SpectralQuantCodec

pub struct SpectralQuantCodec {
    /// Bottom-d eigenvectors of L_F: shape [F x d]
    pub basis: Array2<f32>,
    /// Per-mode variance sigma_i^2(Phi_i): shape [d]
    pub mode_variance: Array1<f32>,
    /// Per-mode quantisation step sizes (Lloyd-Max optimal): shape [d]
    pub step_sizes: Array1<f32>,
    /// Entropy model: H_T estimate per mode
    pub entropy_model: EntropyModel,
    /// WiringMetric used to build this codec (must be Quadrance*)
    pub wiring_metric: WiringMetric,
    pub d: usize,    // number of spectral modes retained
    pub version: u32,
}

impl SpectralQuantCodec {
    /// Build codec from an existing ArrowSpace index.
    /// Uses the already-computed L_F eigenvectors — no extra eigendecomposition.
    pub fn from_index(index: &ArrowSpaceIndex, d: usize) -> Result<Self>;

    /// Compress item x to b bits per spectral coefficient.
    /// Returns arithmetic-coded byte vector.
    pub fn encode(&self, x: &[f32], b: usize) -> Vec<u8>;

    /// Reconstruct from compressed bytes.
    pub fn decode(&self, bytes: &[u8]) -> Vec<f32>;

    /// Theoretical compression ratio at b bits/coefficient.
    /// ratio = (F * 32) / (d * b + entropy_overhead)
    pub fn compression_ratio(&self, b: usize) -> f64;

    /// Reconstruction RMSE on a held-out sample.
    pub fn eval_rmse(&self, sample: &[Vec<f32>], b: usize) -> f32;
}

Forward and inverse transform

Forward (encode):

x_tilde = Phi^T x          // [d] spectral coefficients
x_tilde_q = quantise(x_tilde, step_sizes, b)   // Lloyd-Max
bytes = arithmetic_encode(x_tilde_q, entropy_model)

Inverse (decode):

x_tilde_q = arithmetic_decode(bytes, entropy_model)
x_hat = Phi * x_tilde_q    // [F] reconstructed embedding

Reconstruction error: ||x - x_hat||^2 = Q(x, x_hat) — expressible as a quadrance, and available as a score.


Lloyd–Max quantisation per mode

Each spectral coefficient x_tilde_i is modelled as Gaussian with variance sigma_i^2 = mode_variance[i]. The Lloyd–Max optimal step size for b bits:

delta_i = 2 * sigma_i * sqrt(3) / (2^b - 1)   // uniform approximation

For non-Gaussian modes, run 5 iterations of the Lloyd–Max algorithm on a calibration sample (1 % of corpus).


Entropy coding

Use the epiplexity H_T estimate as the per-item entropy budget:

  • Items with low H_T (structurally regular, high epiplexity compression) get a tighter arithmetic code.
  • Items with high H_T (high randomness) get more bits allocated.

This is the direct computational realisation of the epiplexity MDL criterion applied to compression.


SpectralQuantProjector

Slots into IndexScorePhase alongside TauModeScore:

pub struct SpectralQuantProjector {
    pub codec: Arc<SpectralQuantCodec>,
    pub b: usize,  // bits per coefficient for reconstruction error scoring
}

impl IndexScoreProjector for SpectralQuantProjector {
    /// Returns Q(x, x_hat) = ||x - decode(encode(x, b))||^2
    /// High score = item is structurally anomalous (hard to compress spectrally).
    fn project(&self, item: &EmbeddingVector, ctx: &IndexContext) -> f32;
    fn name(&self) -> &'static str { "spectral_quant_error" }
}

Target benchmarks (CVE corpus, N = 313 841, F = 384)

Config Compression ratio Reconstruction RMSE Notes
f32 baseline (floating-point wiring) < 6× existing
QuadranceGaussian, d=64, b=8 ≥ 8× < 0.01 target
QuadranceGaussian, d=128, b=4 ≥ 12× < 0.05 target
QuadranceRational, d=64, b=8 ≥ 8× < 0.01, bit-exact BioChain target

Tests

  • Round-trip: decode(encode(x, b)) has RMSE < eval_rmse threshold for all b ∈ {4, 8, 16}.
  • Monotonicity: RMSE strictly decreases as b increases.
  • BioChain bit-exact: QuadranceRational encode → serialise → deserialise → decode produces bit-identical output on x86-64 and Apple Silicon.
  • Score distribution: SpectralQuantProjector scores on CVE are positively correlated with TauModeScore λ (Spearman ρ > 0.3).
  • compression_ratio consistency: compression_ratio(32) ≈ 38.4× (matching observed epiplexity ratio on CVE).
  • from_index no-op: Building SpectralQuantCodec from an existing index adds < 2 s on CVE (reuses already-computed eigenvectors).

Acceptance criteria

  • surfface-codec crate compiles and passes all tests.
  • SpectralQuantCodec::from_index() operational on CVE with d=64.
  • Benchmark targets met (see table above).
  • SpectralQuantProjector registered in IndexScorePhase.
  • BioChain bit-exact CI job passing (x86-64 + Apple Silicon).
  • Codec serialises to <index_name>.sqcodec.bin alongside the main index.
  • All previous phase regression tests still pass.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions