Summary

Large language models (LLMs) are increasingly deployed in high-stakes domains such as medicine, law, and finance, where model confidence is used as a proxy for reliability. However, confidence is fragile: a single word change in a prompt can shift a model's stated confidence by over 25 percentage points while the answer remains semantically identical. yuragi (Japanese for "fluctuation") is a Python package that systematically measures this confidence fragility through controlled prompt perturbations. It introduces the Confidence Dissociation metric $D(q, q') = \text{sim}(a, a') \times |\Delta C|$ that isolates cases where the answer is preserved but confidence shifts, a cross-provider confidence normalization $C = 1 - H_{\text{nats}} / \ln(k)$ that enables comparison across providers with different vocabulary sizes, and a trilayer measurement framework that captures logprob-based, sampling-based, and verbalized confidence in a single API call. yuragi is available on PyPI (pip install yuragi) and supports 100+ models via litellm, including OpenAI, Anthropic, Google, and local models through Ollama.

Statement of Need

Existing uncertainty quantification tools for LLMs focus primarily on two dimensions: calibration---whether stated confidence matches empirical accuracy---and hallucination detection---whether a model fabricates information. Tools such as lm-polygraph and SelfCheckGPT address these dimensions effectively. However, neither captures a third, orthogonal dimension: confidence stability under prompt perturbation.

Recent work has identified this gap from multiple angles. CCPS [@zhang2025ccps] demonstrates that perturbing internal representations reveals calibration failures, but requires white-box access to hidden states and Jacobians. TRUTH DECAY [@li2025truthdecay] and SYCON-Bench [@sun2025syconbench] study sycophantic behavior where models flip their answers under social pressure, but do not isolate confidence shifts when answers remain stable. SycEval [@chen2025syceval] benchmarks sycophancy across tasks but focuses on answer correctness rather than confidence dynamics. FRS [@fastowski2025frs] measures factual robustness under decoding-condition variation (temperature sweeps), complementary to but distinct from prompt-content perturbation. ADVICE [@sharma2025advice] identifies that answer generation and confidence verbalization are internally decoupled but proposes a fine-tuning solution requiring model training access.

yuragi complements existing approaches by occupying a specific design niche: a black-box, zero-shot, CLI-first tool. It applies 13 perturbation types (synonym substitution, paraphrase, tone shift, negation, authority pressure, anchoring, and others) and 11 psychology-inspired experiment protocols (Asch conformity, impostor, gaslighting, framing, cognitive dissonance, halo effect, primacy-recency, Dunning-Kruger, and others) adapted from classical social psychology [@asch1951conformity] to systematically stress-test confidence. Results include effect sizes (Cohen's $d$), nonparametric tests (Wilcoxon signed-rank), and bootstrap confidence intervals [@efron1979bootstrap]. The entire test suite comprises 1300+ tests including property-based tests via Hypothesis, with CI across Python 3.11/3.12 on Ubuntu, macOS, and Windows.

Functionality

A typical workflow consists of a single CLI command:

yuragi scan "What is the capital of France?" --provider openai

This produces a structured report containing: baseline confidence, per-perturbation confidence deltas, aggregate fragility score (mean $|\Delta C|$), worst-case fragility, the proportion of perturbations exceeding a configurable threshold, Confidence Dissociation scores, and bootstrap confidence intervals. For multi-model comparison, yuragi compare-models generates cross-provider heatmaps with normalized confidence scores. The package also exposes a Python API for programmatic integration and supports semantic entropy computation inspired by @farquhar2024semantic. All confidence computations follow information-theoretic foundations from @cover2006information, with entropy normalized by vocabulary size to enable fair cross-provider comparison.

Related Work

\autoref{tab:comparison} summarizes the positioning of yuragi relative to prior work. Conformity in LLMs [@huang2024conformity] demonstrates that LLMs exhibit Asch-like conformity under group pressure, motivating several of yuragi's psychology experiment protocols.

: Comparison of yuragi with related tools. \label{tab:comparison}

Tool	Access	Perturbation	Metric
CCPS	White-box	Internal repr.	Calibration
FRS	White-box	Temperature	Factual robustness
TRUTH DECAY	Black-box	Social pressure	Answer flip
SycEval	Black-box	Sycophancy	Answer correctness
ADVICE	Fine-tune	None (training)	Calibration
yuragi	Black-box	13 prompt types	Confidence delta

Acknowledgements

We thank the developers of litellm for providing the unified LLM API layer, and the authors of the Hypothesis property-based testing framework.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary

Statement of Need

Functionality

Related Work

Acknowledgements

References

FilesExpand file tree

paper.md

Latest commit

History

paper.md

File metadata and controls

Summary

Statement of Need

Functionality

Related Work

Acknowledgements

References