This file contains the key empirical discoveries underlying yuragi. For tool usage and API docs, see README.md.
Perturbation propagates through transformer layers in two phases: recognition (entropy decreases as layers identify familiar patterns) then disruption (entropy increases as the perturbation destabilizes the representation). Observed in 5.1%–57.7% of prompt pairs depending on detection criterion (n=49, Pythia-410M white-box experiments, mean critical layer: 16.65, stdev: 3.98). The 30.6% figure corresponds to a specific signed-threshold criterion; rates vary from 0% (sign-only, strict templates) to 57.7% (permissive sign-only on initial n=26 capital subset, which was pre-selected and is upwardly biased). The preregistered estimate on n=159 capital extension is 25.2%.
Data: docs/bench/real/whitebox_n50_pythia410m.json
Adding authority prefixes to prompts causes sudden confidence regime shifts at critical thresholds. In Cerebras LLaMA 3.1-8B on a factual prompt, confidence drops non-linearly:
| Prefix | Confidence |
|---|---|
| "Please" | 0.998 |
| "As a professional expert" | 0.893 |
| "From a scholarly perspective" | 0.730 |
| "In your capacity as the foremost" | 0.528 |
Data: docs/bench/real/phase_transition_cerebras_8b.json
Empirical data across 5 models (1.2B to 22B active parameters, 3 perturbation types × logprob mode, seed=42):
| Model | Active params (B) | Mean fragility | n |
|---|---|---|---|
| LFM 2.5 | 1.2 | 0.067 | 25 |
| Llama 3.2 | 3.2 | 0.047 | 15* |
| Gemma 4 (e4b) | 4.5 | 0.039 | 25 |
| Llama 3.1 (Cerebras) | 8.0 | 0.037 | 25 |
| Qwen 3 235B-A22B | 22.0 | 0.025 | 22* |
*Partial category coverage. Measured with typo/tone/paraphrase perturbations only.
Inverse-square-root scaling trend (R²=0.987, AICc-preferred 2-parameter model):
F(N) = a / √N + b
Fitted: F(N) = 1860/√N + 0.0136. Fragility decreases monotonically with scale; the nonzero asymptote (b ≈ 0.014) suggests irreducible fragility even at very large N. Note: n=5 models with mixed architectures; the functional form is preliminary.
Full derivation: docs/theory.md Section 15
When the answer text is identical (Jaccard sim=1.0), max confidence shift is 0.021, mean is 0.007. All 21 perfect-match cases fall below the noise floor (τ=0.06).
Confidence tracks text change, not knowledge uncertainty.
This finding does not motivate using fragility as a hallucination signal beyond logprob baseline. Real TruthfulQA experiment (llama-3.1-8b, n=412, 5-fold CV): LogReg ensemble over 105 features achieves AUC=0.73 [0.68, 0.78] against LLM-judge labels; however, bootstrap ablation (experiments/ablation_delta_significance.py, 2026-04-17) shows perturbation features are non-significant at best (Δ=−0.027, 95%CI [−0.085, +0.035], p=0.35) and actively hurt performance in pre-registered paired bootstrap (Round 5-W, commit 566823a): LogReg Δ=−0.0288 CI [−0.052, −0.006] p=0.016; GB Δ=−0.0470 CI [−0.080, −0.012] p=0.007; CatBoost Δ=−0.0409 CI [−0.081, −0.000] p=0.046. All 13 perturbation features decompose to K-normalized entropy deltas (baseline_confidence differences), carrying zero independent information. fragility_score solo AUC is ~0.50 across 6 datasets (not 0.62 as previously claimed — retraction in experiments/ablation_solo_universality_report.txt). On TriviaQA (n=200), baseline_confidence solo reaches AUC=0.75 [0.67, 0.82] — this replicates the Kadavath 2022 / Farquhar 2024 logprob-entropy baseline, not a novel yuragi result.
Experiment source: experiments/ensemble_final.txt, experiments/triviaqa_scale_n200.jsonl, paper/revolutionary_reframe.md
The table below describes design choices, not a competitive ranking. The other libraries listed are excellent within their own scopes; "—" simply means the feature is outside that library's documented focus, not that the library is deficient.
| yuragi | lm-polygraph | SelfCheckGPT | PromptBench | |
|---|---|---|---|---|
| Confidence fragility measurement (a design choice) | Yes | Out of scope | Out of scope | Out of scope |
| Confidence dissociation (answer same, confidence shifts) | Yes | Out of scope | Out of scope | Out of scope |
| Black-box (any LLM API) | Yes | Partial | Yes | Partial |
| CLI-first (2 commands to result) | Yes | Library API | Library API | Library API |
| Psychology stress tests | 11 | Out of scope | Out of scope | Out of scope |
| Trilayer analysis (logprob+sampling+verbalized) | Yes | Individual layers | Sampling-only | Out of scope |
| White-box layer entropy experiments | Yes | Out of scope | Out of scope | Out of scope |
| Production applications (CI/CD, routing, guard) | 5 wired examples | Out of scope | Out of scope | Out of scope |
| Core dependencies | 3 | torch+transformers | torch+transformers | torch+transformers |
Full comparison with CCPS, SYCON-Bench, TRUTH DECAY, SycEval, and FRS: docs/related_work.md
- Multi-model structural reproduction: reproduced across Cerebras 8B, llama3.2 3B, and Claude 4.6 verbalized (v0.4.0 baseline); see
docs/bench/real/ - Psychology experiment fidelity: templates are inspired by cited papers, not full behavioural replications
- Semantic Entropy uses Jaccard/cosine fallback, not NLI clustering (Farquhar et al. 2024)
- Meta-audit (2026-04-14 + 2026-04-17) findings:
- Output-level fragility solo AUC is ~0.50 across 6 datasets (TruthfulQA/TriviaQA/NQ-Open/NIM/Cohere/Mistral); the earlier "0.62 noise floor" figure is retracted as it did not match subsequent bootstrap tests.
- Δ(AUC) of adding perturbation features vs no-perturbation baseline is not significant (p=0.35 on TruthfulQA/llm_label, p=0.81 on TriviaQA/is_correct); two settings with
is_correctlabels show a negative significant Δ (Cerebras n=382 p=0.012; cross-family pooled n=300 p=0.033) — perturbations reduce AUC. - Circular ground-truth risk (LLM judge cues ≈ classifier features); label source matters:
is_correct(flexible judge) vsllm_label(LLM-judge) agree only 55.1% on TruthfulQA n=412. - "Confidence Inversion" observation based on single model (llama-3.1-8B) and requires cross-model replication before generalization claim.
- Data-integrity note:
halueval_cerebras.jsonl,missing_axis.jsonl,nli_se_cache_n50.jsonl,qwen35_122b.jsonlcontain zero-filled perturbation values and are excluded from claims. - See
experiments/ablation_delta_significance_report.txt,experiments/ablation_solo_universality_report.txt,experiments/audit_power_and_dataquality_report.txt, andexperiments/meta_audit_*.md.
Full details: KNOWN_LIMITATIONS.md
ICML 2026 MI Workshop submission (negative-result pivot complete):
"When the Baseline Is the Ceiling: 13 Perturbations Add Zero Signal Over Top-k Logprob Entropy"
arXiv submitted: 2026-04-19 (identifier TBD — update after manual submission to https://arxiv.org/submit/new)
Source: paper/icml2026_mi/. arXiv bundle: paper/icml2026_mi/arxiv_submission.tar.gz. Categories: cs.LG (primary), cs.CL, cs.AI. Three confirmed negative findings: (1) 13 perturbations HURT AUC vs baseline_confidence, (2) bc-only pivot null across 54 bootstrap samples, (3) cross-dataset confirmatory AUC=0.782 baseline-only.
Raw benchmark data: docs/bench/real/. Theory and metric definitions: docs/theory.md.