Skip to content

Latest commit

 

History

History
124 lines (107 loc) · 5.58 KB

File metadata and controls

124 lines (107 loc) · 5.58 KB
title yuragi: A Python Tool for Measuring LLM Confidence Fragility
tags
Python
large language models
confidence calibration
uncertainty quantification
prompt sensitivity
AI safety
robustness
authors
name orcid affiliation
hinanohart
0000-0000-0000-0000
1
affiliations
name index
Independent Researcher
1
date 12 April 2026
bibliography paper.bib

Summary

Large language models (LLMs) are increasingly deployed in high-stakes domains such as medicine, law, and finance, where model confidence is used as a proxy for reliability. However, confidence is fragile: a single word change in a prompt can shift a model's stated confidence by over 25 percentage points while the answer remains semantically identical. yuragi (Japanese for "fluctuation") is a Python package that systematically measures this confidence fragility through controlled prompt perturbations. It introduces the Confidence Dissociation metric $D(q, q') = \text{sim}(a, a') \times |\Delta C|$ that isolates cases where the answer is preserved but confidence shifts, a cross-provider confidence normalization $C = 1 - H_{\text{nats}} / \ln(k)$ that enables comparison across providers with different vocabulary sizes, and a trilayer measurement framework that captures logprob-based, sampling-based, and verbalized confidence in a single API call. yuragi is available on PyPI (pip install yuragi) and supports 100+ models via litellm, including OpenAI, Anthropic, Google, and local models through Ollama.

Statement of Need

Existing uncertainty quantification tools for LLMs focus primarily on two dimensions: calibration---whether stated confidence matches empirical accuracy---and hallucination detection---whether a model fabricates information. Tools such as lm-polygraph and SelfCheckGPT address these dimensions effectively. However, neither captures a third, orthogonal dimension: confidence stability under prompt perturbation.

Recent work has identified this gap from multiple angles. CCPS [@zhang2025ccps] demonstrates that perturbing internal representations reveals calibration failures, but requires white-box access to hidden states and Jacobians. TRUTH DECAY [@li2025truthdecay] and SYCON-Bench [@sun2025syconbench] study sycophantic behavior where models flip their answers under social pressure, but do not isolate confidence shifts when answers remain stable. SycEval [@chen2025syceval] benchmarks sycophancy across tasks but focuses on answer correctness rather than confidence dynamics. FRS [@fastowski2025frs] measures factual robustness under decoding-condition variation (temperature sweeps), complementary to but distinct from prompt-content perturbation. ADVICE [@sharma2025advice] identifies that answer generation and confidence verbalization are internally decoupled but proposes a fine-tuning solution requiring model training access.

yuragi complements existing approaches by occupying a specific design niche: a black-box, zero-shot, CLI-first tool. It applies 13 perturbation types (synonym substitution, paraphrase, tone shift, negation, authority pressure, anchoring, and others) and 11 psychology-inspired experiment protocols (Asch conformity, impostor, gaslighting, framing, cognitive dissonance, halo effect, primacy-recency, Dunning-Kruger, and others) adapted from classical social psychology [@asch1951conformity] to systematically stress-test confidence. Results include effect sizes (Cohen's $d$), nonparametric tests (Wilcoxon signed-rank), and bootstrap confidence intervals [@efron1979bootstrap]. The entire test suite comprises 1300+ tests including property-based tests via Hypothesis, with CI across Python 3.11/3.12 on Ubuntu, macOS, and Windows.

Functionality

A typical workflow consists of a single CLI command:

yuragi scan "What is the capital of France?" --provider openai

This produces a structured report containing: baseline confidence, per-perturbation confidence deltas, aggregate fragility score (mean $|\Delta C|$), worst-case fragility, the proportion of perturbations exceeding a configurable threshold, Confidence Dissociation scores, and bootstrap confidence intervals. For multi-model comparison, yuragi compare-models generates cross-provider heatmaps with normalized confidence scores. The package also exposes a Python API for programmatic integration and supports semantic entropy computation inspired by @farquhar2024semantic. All confidence computations follow information-theoretic foundations from @cover2006information, with entropy normalized by vocabulary size to enable fair cross-provider comparison.

Related Work

\autoref{tab:comparison} summarizes the positioning of yuragi relative to prior work. Conformity in LLMs [@huang2024conformity] demonstrates that LLMs exhibit Asch-like conformity under group pressure, motivating several of yuragi's psychology experiment protocols.

: Comparison of yuragi with related tools. \label{tab:comparison}

Tool Access Perturbation Metric
CCPS White-box Internal repr. Calibration
FRS White-box Temperature Factual robustness
TRUTH DECAY Black-box Social pressure Answer flip
SycEval Black-box Sycophancy Answer correctness
ADVICE Fine-tune None (training) Calibration
yuragi Black-box 13 prompt types Confidence delta

Acknowledgements

We thank the developers of litellm for providing the unified LLM API layer, and the authors of the Hypothesis property-based testing framework.

References