The yuragi.guardrails subpackage lands. yuragi is no longer just a
measurement library — it is now a guardrail platform with audit
logging, multi-agent runtime, and Git-like cognitive snapshots.
Origin: ported from the previously-private dacm project. See
CHANGELOG.md for the full list and KNOWN_LIMITATIONS.md G1–G7
for acknowledged tradeoffs.
-
yuragi.guardrails.fuse.ConfidenceFuser— 4-signal fusion -
yuragi.guardrails.policy.ConfidencePolicy— threshold routing -
yuragi.guardrails.audit.AuditLog— SQLite-WAL + SHA-256 chain -
yuragi.guardrails.kernel— Akka-style actor mesh -
yuragi.guardrails.bus— InMemory + (experimental) NATS transport -
yuragi.guardrails.snapshot— Merkle DAG checkpoint / restore -
yuragi.guardrails.scheduler— Contract Net + reputation fallback -
yuragi.guardrails.agents— 6 reference agents - AutoGen + LangGraph integrations
- 56 new unit tests, all existing tests still pass
- NATS DeadLetter detection via consumer-info polling
-
on_malformedreturnsACK | NAK | TERM - NATS prefetch /
max_in_flightconfig - Background-queue
AuditSinkfor high-throughput deployments - PII / jailbreak / prompt-injection detectors (Presidio bridge)
- Streaming-token guardrails (early-stop on confidence drop)
- Optional
executor-backedAuditSink(run_in_executor) - Snapshot benchmark against the ≤ 1 s / 10 k-agent target
- Semantic Entropy (Farquhar et al., Nature 2024) + Verbalized↔Logit Gap
- 13 perturbation types (added
negation,counterfactual,code_switching) - 11 psychology experiments (added
dunning_kruger,test_time_compute) - Statistical tests (Cohen's d, paired t-test, Wilcoxon, bootstrap CI)
- Multi-model
compare-modelsCLI + perturbation × model heatmap - Agentic UQ trajectory tracking (
track_agentic_session) - TruthfulQA / SimpleQA / HaluEval loaders
-
YuragiConfigfrozen dataclass (40 fields) threaded through all 15 analysis modules - Lazy litellm loader (PEP 562) — CLI cold start 2.5 s → 0.11 s (22×)
-
benchmarks/falsifiability_check.py— executable F1–F5 protocol -
docs/theory.md— Sections 1–20 (3048 lines), formal Definitions 1.1a/1.1b/1.2 - Narrative / implementation alignment pass (Round 5–9, 16-agent audit)
-
yuragi check— CI/CD fragility regression detection with GitHub Actions workflow -
yuragi route— Fragility-aware multi-model routing (3 strategies) -
yuragi guard— Abstention system for high-stakes domains (5 domain presets) -
yuragi recommend— Model selection based on fragility profiles -
yuragi red-team— Automated vulnerability discovery via perturbation probing -
applications/Python API for all 5 use cases
- Cerebras API integration (direct logprobs via
_complete_with_logprobs_direct()) - Thinking model detection (Qwen3.5/LFM2.5-thinking fallback to sampling)
- Provider detection for Groq, Together AI, Fireworks AI, OpenRouter
- Adaptive fragility metrics (
adaptive_fragility,maladaptive_fragility,adversarial_fragility,fragility_ratio) - Perturbation semantic classification (
SemanticClassenum: PRESERVING/MODIFYING/ADVERSARIAL) - Phase transition experiment (
experiments/phase_transition.py) — 11-step graduated prefix ladder - Hallucination prediction experiment (
benchmarks/hallucination_experiment.py)
-
whitebox_design.py— 5 whitebox experiments (Exp 1–5): layer entropy propagation, causal tracing, attention shift, representation geometry, SAE feature decomposition (Pythia-410m, Qwen2-1.5B)
- theory.md Sections 12–20 (NN interpretation, PIRI framework, phase transition, cross-species fragility, adaptive fragility)
- ICML 2026 MI Workshop paper (
paper/icml2026_mi/main.tex) - JOSS paper (
paper.md+paper.bib)
- 1320+ tests passing / ruff + bandit clean
- Gradio demo (
demo/app.py) for HuggingFace Spaces / local use - GitHub Actions reusable workflow (
.github/workflows/yuragi-check.yml)
- Embedding-based semantic classification (replace Jaccard with SBERT similarity backend)
- Gemma-2 SAE feature decomposition (extend white-box experiments beyond Pythia/Qwen2)
- Full hallucination validation — controlled dissociation benchmark across 3+ models with human-validated ground truth
- Cross-species experiments (Section 19 Universal Fragility Principle empirical validation)
- Sach/Wort probe redesign with paraphrase dataset and n ≥ 30
- Async Scanner API for batch workloads
- Export to W&B / MLflow
- Multi-language prompt support (beyond EN/JA)
- Cross-model benchmark database with shareable results
- Plugin system for custom perturbation types
-
yuragi watch— continuous monitoring mode - Pre-built Docker image
- MkDocs documentation site (GitHub Pages)
- Semantic Entropy + Verbalized↔Logit Gap
- 13 perturbation types, 11 psychology experiments
- Statistical tests (Cohen's d, paired t-test, Wilcoxon, bootstrap CI)
- Multi-model
compare-modelsCLI - Agentic UQ trajectory tracking
- TruthfulQA / SimpleQA / HaluEval loaders
-
YuragiConfigfrozen dataclass (40 fields) - Lazy litellm loader (CLI cold start 22×)
-
benchmarks/falsifiability_check.py -
docs/theory.mdSections 1–10 - 1050 tests passing / 1166 collected / ruff + bandit clean
- Core scanner with 10 perturbation types
- 3-layer confidence measurement (logprobs, sampling, verbalized)
- 9 psychology experiments (Asch, Impostor, Authority, Anchoring, Gaslighting, Framing, Cognitive Dissonance, Halo Effect, Primacy-Recency)
- Fragility Profile (CCI, Recovery Elasticity, Non-linearity Score)
- Confidence decay (Learned Helplessness) and recovery analysis
- Model behavioral fingerprinting
- Adversarial confidence analysis (flip, sycophancy, pressure resistance)
- Financial-engineering metrics (VIX, Sharpe, drawdown, regime detection)
- Phase transition detection
- Linguistic confidence gap (hedge detection, assertiveness)
- Calibration metrics (ECE, Brier, mutual information)
- 13 CLI commands, Rich TUI
- 978 tests, 95% coverage