yuragi — Research Findings

This file contains the key empirical discoveries underlying yuragi. For tool usage and API docs, see README.md.

Key Discoveries (v0.4.1)

R1. Two-Phase Processing

Perturbation propagates through transformer layers in two phases: recognition (entropy decreases as layers identify familiar patterns) then disruption (entropy increases as the perturbation destabilizes the representation). Observed in 5.1%–57.7% of prompt pairs depending on detection criterion (n=49, Pythia-410M white-box experiments, mean critical layer: 16.65, stdev: 3.98). The 30.6% figure corresponds to a specific signed-threshold criterion; rates vary from 0% (sign-only, strict templates) to 57.7% (permissive sign-only on initial n=26 capital subset, which was pre-selected and is upwardly biased). The preregistered estimate on n=159 capital extension is 25.2%.

Data: docs/bench/real/whitebox_n50_pythia410m.json

R2. Phase Transitions in Generation

Adding authority prefixes to prompts causes sudden confidence regime shifts at critical thresholds. In Cerebras LLaMA 3.1-8B on a factual prompt, confidence drops non-linearly:

Prefix	Confidence
"Please"	0.998
"As a professional expert"	0.893
"From a scholarly perspective"	0.730
"In your capacity as the foremost"	0.528

Data: docs/bench/real/phase_transition_cerebras_8b.json

R3. Confidence Stability Scaling Law

Empirical data across 5 models (1.2B to 22B active parameters, 3 perturbation types × logprob mode, seed=42):

Model	Active params (B)	Mean fragility	n
LFM 2.5	1.2	0.067	25
Llama 3.2	3.2	0.047	15*
Gemma 4 (e4b)	4.5	0.039	25
Llama 3.1 (Cerebras)	8.0	0.037	25
Qwen 3 235B-A22B	22.0	0.025	22*

*Partial category coverage. Measured with typo/tone/paraphrase perturbations only.

Inverse-square-root scaling trend (R²=0.987, AICc-preferred 2-parameter model):

F(N) = a / √N + b

Fitted: F(N) = 1860/√N + 0.0136. Fragility decreases monotonically with scale; the nonzero asymptote (b ≈ 0.014) suggests irreducible fragility even at very large N. Note: n=5 models with mixed architectures; the functional form is preliminary.

Full derivation: docs/theory.md Section 15

R4. The Confidence-Text Coupling

When the answer text is identical (Jaccard sim=1.0), max confidence shift is 0.021, mean is 0.007. All 21 perfect-match cases fall below the noise floor (τ=0.06).

Confidence tracks text change, not knowledge uncertainty.

This finding does not motivate using fragility as a hallucination signal beyond logprob baseline. Real TruthfulQA experiment (llama-3.1-8b, n=412, 5-fold CV): LogReg ensemble over 105 features achieves AUC=0.73 [0.68, 0.78] against LLM-judge labels; however, bootstrap ablation (experiments/ablation_delta_significance.py, 2026-04-17) shows perturbation features are non-significant at best (Δ=−0.027, 95%CI [−0.085, +0.035], p=0.35) and actively hurt performance in pre-registered paired bootstrap (Round 5-W, commit 566823a): LogReg Δ=−0.0288 CI [−0.052, −0.006] p=0.016; GB Δ=−0.0470 CI [−0.080, −0.012] p=0.007; CatBoost Δ=−0.0409 CI [−0.081, −0.000] p=0.046. All 13 perturbation features decompose to K-normalized entropy deltas (baseline_confidence differences), carrying zero independent information. fragility_score solo AUC is ~0.50 across 6 datasets (not 0.62 as previously claimed — retraction in experiments/ablation_solo_universality_report.txt). On TriviaQA (n=200), baseline_confidence solo reaches AUC=0.75 [0.67, 0.82] — this replicates the Kadavath 2022 / Farquhar 2024 logprob-entropy baseline, not a novel yuragi result.

Experiment source: experiments/ensemble_final.txt, experiments/triviaqa_scale_n200.jsonl, paper/revolutionary_reframe.md

Comparison with Related Work

The table below describes design choices, not a competitive ranking. The other libraries listed are excellent within their own scopes; "—" simply means the feature is outside that library's documented focus, not that the library is deficient.

	yuragi	lm-polygraph	SelfCheckGPT	PromptBench
Confidence fragility measurement (a design choice)	Yes	Out of scope	Out of scope	Out of scope
Confidence dissociation (answer same, confidence shifts)	Yes	Out of scope	Out of scope	Out of scope
Black-box (any LLM API)	Yes	Partial	Yes	Partial
CLI-first (2 commands to result)	Yes	Library API	Library API	Library API
Psychology stress tests	11	Out of scope	Out of scope	Out of scope
Trilayer analysis (logprob+sampling+verbalized)	Yes	Individual layers	Sampling-only	Out of scope
White-box layer entropy experiments	Yes	Out of scope	Out of scope	Out of scope
Production applications (CI/CD, routing, guard)	5 wired examples	Out of scope	Out of scope	Out of scope
Core dependencies	3	torch+transformers	torch+transformers	torch+transformers

Full comparison with CCPS, SYCON-Bench, TRUTH DECAY, SycEval, and FRS: docs/related_work.md

Known Limitations

Multi-model structural reproduction: reproduced across Cerebras 8B, llama3.2 3B, and Claude 4.6 verbalized (v0.4.0 baseline); see docs/bench/real/
Psychology experiment fidelity: templates are inspired by cited papers, not full behavioural replications
Semantic Entropy uses Jaccard/cosine fallback, not NLI clustering (Farquhar et al. 2024)
Meta-audit (2026-04-14 + 2026-04-17) findings:
- Output-level fragility solo AUC is ~0.50 across 6 datasets (TruthfulQA/TriviaQA/NQ-Open/NIM/Cohere/Mistral); the earlier "0.62 noise floor" figure is retracted as it did not match subsequent bootstrap tests.
- Δ(AUC) of adding perturbation features vs no-perturbation baseline is not significant (p=0.35 on TruthfulQA/llm_label, p=0.81 on TriviaQA/is_correct); two settings with is_correct labels show a negative significant Δ (Cerebras n=382 p=0.012; cross-family pooled n=300 p=0.033) — perturbations reduce AUC.
- Circular ground-truth risk (LLM judge cues ≈ classifier features); label source matters: is_correct (flexible judge) vs llm_label (LLM-judge) agree only 55.1% on TruthfulQA n=412.
- "Confidence Inversion" observation based on single model (llama-3.1-8B) and requires cross-model replication before generalization claim.
- Data-integrity note: halueval_cerebras.jsonl, missing_axis.jsonl, nli_se_cache_n50.jsonl, qwen35_122b.jsonl contain zero-filled perturbation values and are excluded from claims.
- See experiments/ablation_delta_significance_report.txt, experiments/ablation_solo_universality_report.txt, experiments/audit_power_and_dataquality_report.txt, and experiments/meta_audit_*.md.

Full details: KNOWN_LIMITATIONS.md

Paper

ICML 2026 MI Workshop submission (negative-result pivot complete):

"When the Baseline Is the Ceiling: 13 Perturbations Add Zero Signal Over Top-k Logprob Entropy"

arXiv submitted: 2026-04-19 (identifier TBD — update after manual submission to https://arxiv.org/submit/new)

Source: paper/icml2026_mi/. arXiv bundle: paper/icml2026_mi/arxiv_submission.tar.gz. Categories: cs.LG (primary), cs.CL, cs.AI. Three confirmed negative findings: (1) 13 perturbations HURT AUC vs baseline_confidence, (2) bc-only pivot null across 54 bootstrap samples, (3) cross-dataset confirmatory AUC=0.782 baseline-only.

Raw benchmark data: docs/bench/real/. Theory and metric definitions: docs/theory.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yuragi — Research Findings

Key Discoveries (v0.4.1)

R1. Two-Phase Processing

R2. Phase Transitions in Generation

R3. Confidence Stability Scaling Law

R4. The Confidence-Text Coupling

Comparison with Related Work

Known Limitations

Paper

FilesExpand file tree

RESEARCH.md

Latest commit

History

RESEARCH.md

File metadata and controls

yuragi — Research Findings

Key Discoveries (v0.4.1)

R1. Two-Phase Processing

R2. Phase Transitions in Generation

R3. Confidence Stability Scaling Law

R4. The Confidence-Text Coupling

Comparison with Related Work

Known Limitations

Paper