Skip to content

Latest commit

 

History

History
130 lines (110 loc) · 6.83 KB

File metadata and controls

130 lines (110 loc) · 6.83 KB

Roadmap

v0.5.0 (Current, 2026-04-17) — Confidence-aware LLM Guardrails

The yuragi.guardrails subpackage lands. yuragi is no longer just a measurement library — it is now a guardrail platform with audit logging, multi-agent runtime, and Git-like cognitive snapshots. Origin: ported from the previously-private dacm project. See CHANGELOG.md for the full list and KNOWN_LIMITATIONS.md G1–G7 for acknowledged tradeoffs.

Highlights

  • yuragi.guardrails.fuse.ConfidenceFuser — 4-signal fusion
  • yuragi.guardrails.policy.ConfidencePolicy — threshold routing
  • yuragi.guardrails.audit.AuditLog — SQLite-WAL + SHA-256 chain
  • yuragi.guardrails.kernel — Akka-style actor mesh
  • yuragi.guardrails.bus — InMemory + (experimental) NATS transport
  • yuragi.guardrails.snapshot — Merkle DAG checkpoint / restore
  • yuragi.guardrails.scheduler — Contract Net + reputation fallback
  • yuragi.guardrails.agents — 6 reference agents
  • AutoGen + LangGraph integrations
  • 56 new unit tests, all existing tests still pass

v0.6.0 (planned)

  • NATS DeadLetter detection via consumer-info polling
  • on_malformed returns ACK | NAK | TERM
  • NATS prefetch / max_in_flight config
  • Background-queue AuditSink for high-throughput deployments
  • PII / jailbreak / prompt-injection detectors (Presidio bridge)

v0.7.0 (planned)

  • Streaming-token guardrails (early-stop on confidence drop)
  • Optional executor-backed AuditSink (run_in_executor)
  • Snapshot benchmark against the ≤ 1 s / 10 k-agent target

v0.4.0 (2026-04-12) — Production Applications & Open-Box Analysis

Core Confidence Measurement

  • Semantic Entropy (Farquhar et al., Nature 2024) + Verbalized↔Logit Gap
  • 13 perturbation types (added negation, counterfactual, code_switching)
  • 11 psychology experiments (added dunning_kruger, test_time_compute)
  • Statistical tests (Cohen's d, paired t-test, Wilcoxon, bootstrap CI)
  • Multi-model compare-models CLI + perturbation × model heatmap
  • Agentic UQ trajectory tracking (track_agentic_session)
  • TruthfulQA / SimpleQA / HaluEval loaders
  • YuragiConfig frozen dataclass (40 fields) threaded through all 15 analysis modules
  • Lazy litellm loader (PEP 562) — CLI cold start 2.5 s → 0.11 s (22×)
  • benchmarks/falsifiability_check.py — executable F1–F5 protocol
  • docs/theory.md — Sections 1–20 (3048 lines), formal Definitions 1.1a/1.1b/1.2
  • Narrative / implementation alignment pass (Round 5–9, 16-agent audit)

Production Applications (5 modules + 5 CLI commands)

  • yuragi check — CI/CD fragility regression detection with GitHub Actions workflow
  • yuragi route — Fragility-aware multi-model routing (3 strategies)
  • yuragi guard — Abstention system for high-stakes domains (5 domain presets)
  • yuragi recommend — Model selection based on fragility profiles
  • yuragi red-team — Automated vulnerability discovery via perturbation probing
  • applications/ Python API for all 5 use cases

Cerebras & Provider Support

  • Cerebras API integration (direct logprobs via _complete_with_logprobs_direct())
  • Thinking model detection (Qwen3.5/LFM2.5-thinking fallback to sampling)
  • Provider detection for Groq, Together AI, Fireworks AI, OpenRouter

Metrics & Analysis

  • Adaptive fragility metrics (adaptive_fragility, maladaptive_fragility, adversarial_fragility, fragility_ratio)
  • Perturbation semantic classification (SemanticClass enum: PRESERVING/MODIFYING/ADVERSARIAL)
  • Phase transition experiment (experiments/phase_transition.py) — 11-step graduated prefix ladder
  • Hallucination prediction experiment (benchmarks/hallucination_experiment.py)

White-Box Experiments

  • whitebox_design.py — 5 whitebox experiments (Exp 1–5): layer entropy propagation, causal tracing, attention shift, representation geometry, SAE feature decomposition (Pythia-410m, Qwen2-1.5B)

Theory & Papers

  • theory.md Sections 12–20 (NN interpretation, PIRI framework, phase transition, cross-species fragility, adaptive fragility)
  • ICML 2026 MI Workshop paper (paper/icml2026_mi/main.tex)
  • JOSS paper (paper.md + paper.bib)

Tests & Infrastructure

  • 1320+ tests passing / ruff + bandit clean
  • Gradio demo (demo/app.py) for HuggingFace Spaces / local use
  • GitHub Actions reusable workflow (.github/workflows/yuragi-check.yml)

v0.5.0 (Next) — Validation & Ecosystem

  • Embedding-based semantic classification (replace Jaccard with SBERT similarity backend)
  • Gemma-2 SAE feature decomposition (extend white-box experiments beyond Pythia/Qwen2)
  • Full hallucination validation — controlled dissociation benchmark across 3+ models with human-validated ground truth
  • Cross-species experiments (Section 19 Universal Fragility Principle empirical validation)
  • Sach/Wort probe redesign with paraphrase dataset and n ≥ 30
  • Async Scanner API for batch workloads
  • Export to W&B / MLflow
  • Multi-language prompt support (beyond EN/JA)
  • Cross-model benchmark database with shareable results
  • Plugin system for custom perturbation types
  • yuragi watch — continuous monitoring mode
  • Pre-built Docker image
  • MkDocs documentation site (GitHub Pages)

v0.2.0 (2026-04-11, Released) — Empirical Reproducibility & Academic-Grade Foundation

  • Semantic Entropy + Verbalized↔Logit Gap
  • 13 perturbation types, 11 psychology experiments
  • Statistical tests (Cohen's d, paired t-test, Wilcoxon, bootstrap CI)
  • Multi-model compare-models CLI
  • Agentic UQ trajectory tracking
  • TruthfulQA / SimpleQA / HaluEval loaders
  • YuragiConfig frozen dataclass (40 fields)
  • Lazy litellm loader (CLI cold start 22×)
  • benchmarks/falsifiability_check.py
  • docs/theory.md Sections 1–10
  • 1050 tests passing / 1166 collected / ruff + bandit clean

v0.1.0 (2026-04-10, Released) — Full Fragility Analysis

  • Core scanner with 10 perturbation types
  • 3-layer confidence measurement (logprobs, sampling, verbalized)
  • 9 psychology experiments (Asch, Impostor, Authority, Anchoring, Gaslighting, Framing, Cognitive Dissonance, Halo Effect, Primacy-Recency)
  • Fragility Profile (CCI, Recovery Elasticity, Non-linearity Score)
  • Confidence decay (Learned Helplessness) and recovery analysis
  • Model behavioral fingerprinting
  • Adversarial confidence analysis (flip, sycophancy, pressure resistance)
  • Financial-engineering metrics (VIX, Sharpe, drawdown, regime detection)
  • Phase transition detection
  • Linguistic confidence gap (hedge detection, assertiveness)
  • Calibration metrics (ECE, Brier, mutual information)
  • 13 CLI commands, Rich TUI
  • 978 tests, 95% coverage