Skip to content

Latest commit

 

History

History
111 lines (72 loc) · 7.89 KB

File metadata and controls

111 lines (72 loc) · 7.89 KB

yuragi — Research Findings

This file contains the key empirical discoveries underlying yuragi. For tool usage and API docs, see README.md.


Key Discoveries (v0.4.1)

R1. Two-Phase Processing

Perturbation propagates through transformer layers in two phases: recognition (entropy decreases as layers identify familiar patterns) then disruption (entropy increases as the perturbation destabilizes the representation). Observed in 5.1%–57.7% of prompt pairs depending on detection criterion (n=49, Pythia-410M white-box experiments, mean critical layer: 16.65, stdev: 3.98). The 30.6% figure corresponds to a specific signed-threshold criterion; rates vary from 0% (sign-only, strict templates) to 57.7% (permissive sign-only on initial n=26 capital subset, which was pre-selected and is upwardly biased). The preregistered estimate on n=159 capital extension is 25.2%.

Data: docs/bench/real/whitebox_n50_pythia410m.json

R2. Phase Transitions in Generation

Adding authority prefixes to prompts causes sudden confidence regime shifts at critical thresholds. In Cerebras LLaMA 3.1-8B on a factual prompt, confidence drops non-linearly:

Prefix Confidence
"Please" 0.998
"As a professional expert" 0.893
"From a scholarly perspective" 0.730
"In your capacity as the foremost" 0.528

Data: docs/bench/real/phase_transition_cerebras_8b.json

R3. Confidence Stability Scaling Law

Empirical data across 5 models (1.2B to 22B active parameters, 3 perturbation types × logprob mode, seed=42):

Model Active params (B) Mean fragility n
LFM 2.5 1.2 0.067 25
Llama 3.2 3.2 0.047 15*
Gemma 4 (e4b) 4.5 0.039 25
Llama 3.1 (Cerebras) 8.0 0.037 25
Qwen 3 235B-A22B 22.0 0.025 22*

*Partial category coverage. Measured with typo/tone/paraphrase perturbations only.

Inverse-square-root scaling trend (R²=0.987, AICc-preferred 2-parameter model):

F(N) = a / √N + b

Fitted: F(N) = 1860/√N + 0.0136. Fragility decreases monotonically with scale; the nonzero asymptote (b ≈ 0.014) suggests irreducible fragility even at very large N. Note: n=5 models with mixed architectures; the functional form is preliminary.

Full derivation: docs/theory.md Section 15

R4. The Confidence-Text Coupling

When the answer text is identical (Jaccard sim=1.0), max confidence shift is 0.021, mean is 0.007. All 21 perfect-match cases fall below the noise floor (τ=0.06).

Confidence tracks text change, not knowledge uncertainty.

This finding does not motivate using fragility as a hallucination signal beyond logprob baseline. Real TruthfulQA experiment (llama-3.1-8b, n=412, 5-fold CV): LogReg ensemble over 105 features achieves AUC=0.73 [0.68, 0.78] against LLM-judge labels; however, bootstrap ablation (experiments/ablation_delta_significance.py, 2026-04-17) shows perturbation features are non-significant at best (Δ=−0.027, 95%CI [−0.085, +0.035], p=0.35) and actively hurt performance in pre-registered paired bootstrap (Round 5-W, commit 566823a): LogReg Δ=−0.0288 CI [−0.052, −0.006] p=0.016; GB Δ=−0.0470 CI [−0.080, −0.012] p=0.007; CatBoost Δ=−0.0409 CI [−0.081, −0.000] p=0.046. All 13 perturbation features decompose to K-normalized entropy deltas (baseline_confidence differences), carrying zero independent information. fragility_score solo AUC is ~0.50 across 6 datasets (not 0.62 as previously claimed — retraction in experiments/ablation_solo_universality_report.txt). On TriviaQA (n=200), baseline_confidence solo reaches AUC=0.75 [0.67, 0.82] — this replicates the Kadavath 2022 / Farquhar 2024 logprob-entropy baseline, not a novel yuragi result.

Experiment source: experiments/ensemble_final.txt, experiments/triviaqa_scale_n200.jsonl, paper/revolutionary_reframe.md


Comparison with Related Work

The table below describes design choices, not a competitive ranking. The other libraries listed are excellent within their own scopes; "—" simply means the feature is outside that library's documented focus, not that the library is deficient.

yuragi lm-polygraph SelfCheckGPT PromptBench
Confidence fragility measurement (a design choice) Yes Out of scope Out of scope Out of scope
Confidence dissociation (answer same, confidence shifts) Yes Out of scope Out of scope Out of scope
Black-box (any LLM API) Yes Partial Yes Partial
CLI-first (2 commands to result) Yes Library API Library API Library API
Psychology stress tests 11 Out of scope Out of scope Out of scope
Trilayer analysis (logprob+sampling+verbalized) Yes Individual layers Sampling-only Out of scope
White-box layer entropy experiments Yes Out of scope Out of scope Out of scope
Production applications (CI/CD, routing, guard) 5 wired examples Out of scope Out of scope Out of scope
Core dependencies 3 torch+transformers torch+transformers torch+transformers

Full comparison with CCPS, SYCON-Bench, TRUTH DECAY, SycEval, and FRS: docs/related_work.md


Known Limitations

  • Multi-model structural reproduction: reproduced across Cerebras 8B, llama3.2 3B, and Claude 4.6 verbalized (v0.4.0 baseline); see docs/bench/real/
  • Psychology experiment fidelity: templates are inspired by cited papers, not full behavioural replications
  • Semantic Entropy uses Jaccard/cosine fallback, not NLI clustering (Farquhar et al. 2024)
  • Meta-audit (2026-04-14 + 2026-04-17) findings:
    • Output-level fragility solo AUC is ~0.50 across 6 datasets (TruthfulQA/TriviaQA/NQ-Open/NIM/Cohere/Mistral); the earlier "0.62 noise floor" figure is retracted as it did not match subsequent bootstrap tests.
    • Δ(AUC) of adding perturbation features vs no-perturbation baseline is not significant (p=0.35 on TruthfulQA/llm_label, p=0.81 on TriviaQA/is_correct); two settings with is_correct labels show a negative significant Δ (Cerebras n=382 p=0.012; cross-family pooled n=300 p=0.033) — perturbations reduce AUC.
    • Circular ground-truth risk (LLM judge cues ≈ classifier features); label source matters: is_correct (flexible judge) vs llm_label (LLM-judge) agree only 55.1% on TruthfulQA n=412.
    • "Confidence Inversion" observation based on single model (llama-3.1-8B) and requires cross-model replication before generalization claim.
    • Data-integrity note: halueval_cerebras.jsonl, missing_axis.jsonl, nli_se_cache_n50.jsonl, qwen35_122b.jsonl contain zero-filled perturbation values and are excluded from claims.
    • See experiments/ablation_delta_significance_report.txt, experiments/ablation_solo_universality_report.txt, experiments/audit_power_and_dataquality_report.txt, and experiments/meta_audit_*.md.

Full details: KNOWN_LIMITATIONS.md


Paper

ICML 2026 MI Workshop submission (negative-result pivot complete):

"When the Baseline Is the Ceiling: 13 Perturbations Add Zero Signal Over Top-k Logprob Entropy"

arXiv submitted: 2026-04-19 (identifier TBD — update after manual submission to https://arxiv.org/submit/new)

Source: paper/icml2026_mi/. arXiv bundle: paper/icml2026_mi/arxiv_submission.tar.gz. Categories: cs.LG (primary), cs.CL, cs.AI. Three confirmed negative findings: (1) 13 perturbations HURT AUC vs baseline_confidence, (2) bc-only pivot null across 54 bootstrap samples, (3) cross-dataset confirmatory AUC=0.782 baseline-only.

Raw benchmark data: docs/bench/real/. Theory and metric definitions: docs/theory.md.