| title | yuragi: A Python Tool for Measuring LLM Confidence Fragility | |||||||
|---|---|---|---|---|---|---|---|---|
| tags |
|
|||||||
| authors |
|
|||||||
| affiliations |
|
|||||||
| date | 12 April 2026 | |||||||
| bibliography | paper.bib |
Large language models (LLMs) are increasingly deployed in high-stakes
domains such as medicine, law, and finance, where model confidence is
used as a proxy for reliability. However, confidence is fragile: a
single word change in a prompt can shift a model's stated confidence by
over 25 percentage points while the answer remains semantically
identical. yuragi (Japanese for "fluctuation") is a Python package
that systematically measures this confidence fragility through
controlled prompt perturbations. It introduces the Confidence
Dissociation metric yuragi is available on PyPI (pip install yuragi) and supports
100+ models via litellm, including OpenAI, Anthropic, Google, and
local models through Ollama.
Existing uncertainty quantification tools for LLMs focus primarily on
two dimensions: calibration---whether stated confidence matches
empirical accuracy---and hallucination detection---whether a model
fabricates information. Tools such as lm-polygraph and SelfCheckGPT
address these dimensions effectively. However, neither captures a third,
orthogonal dimension: confidence stability under prompt perturbation.
Recent work has identified this gap from multiple angles. CCPS [@zhang2025ccps] demonstrates that perturbing internal representations reveals calibration failures, but requires white-box access to hidden states and Jacobians. TRUTH DECAY [@li2025truthdecay] and SYCON-Bench [@sun2025syconbench] study sycophantic behavior where models flip their answers under social pressure, but do not isolate confidence shifts when answers remain stable. SycEval [@chen2025syceval] benchmarks sycophancy across tasks but focuses on answer correctness rather than confidence dynamics. FRS [@fastowski2025frs] measures factual robustness under decoding-condition variation (temperature sweeps), complementary to but distinct from prompt-content perturbation. ADVICE [@sharma2025advice] identifies that answer generation and confidence verbalization are internally decoupled but proposes a fine-tuning solution requiring model training access.
yuragi complements existing approaches by occupying a specific design
niche: a black-box, zero-shot, CLI-first tool. It applies 13 perturbation types (synonym substitution,
paraphrase, tone shift, negation, authority pressure, anchoring, and
others) and 11 psychology-inspired experiment protocols (Asch
conformity, impostor, gaslighting, framing, cognitive dissonance, halo
effect, primacy-recency, Dunning-Kruger, and others) adapted from
classical social psychology [@asch1951conformity] to systematically
stress-test confidence. Results include effect sizes (Cohen's
A typical workflow consists of a single CLI command:
yuragi scan "What is the capital of France?" --provider openaiThis produces a structured report containing: baseline confidence,
per-perturbation confidence deltas, aggregate fragility score (mean
yuragi compare-models generates cross-provider heatmaps with normalized
confidence scores. The package also exposes a Python API for
programmatic integration and supports semantic entropy computation
inspired by @farquhar2024semantic. All confidence computations follow
information-theoretic foundations from @cover2006information, with
entropy normalized by vocabulary size to enable fair cross-provider
comparison.
\autoref{tab:comparison} summarizes the positioning of yuragi relative
to prior work. Conformity in LLMs [@huang2024conformity] demonstrates
that LLMs exhibit Asch-like conformity under group pressure, motivating
several of yuragi's psychology experiment protocols.
: Comparison of yuragi with related tools. \label{tab:comparison}
| Tool | Access | Perturbation | Metric |
|---|---|---|---|
| CCPS | White-box | Internal repr. | Calibration |
| FRS | White-box | Temperature | Factual robustness |
| TRUTH DECAY | Black-box | Social pressure | Answer flip |
| SycEval | Black-box | Sycophancy | Answer correctness |
| ADVICE | Fine-tune | None (training) | Calibration |
| yuragi | Black-box | 13 prompt types | Confidence delta |
We thank the developers of litellm for providing the unified LLM API
layer, and the authors of the Hypothesis property-based testing
framework.