Dynamical Models of AI Governability (blog post + basin explorer)#154
Draft
davidoj wants to merge 71 commits into
Draft
Dynamical Models of AI Governability (blog post + basin explorer)#154davidoj wants to merge 71 commits into
davidoj wants to merge 71 commits into
Conversation
Copied from claude-scratch/basin-explorer (canonical model implementation), excluding node_modules and dist. Original left in place for David to retire. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Vite base /basin-explorer/, outDir static/basin-explorer; Hugo serves it as-is with no new site dependencies. Built assets committed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Reads query string on load (clamped to slider ranges), writes back via history.replaceState whenever state changes; only non-default values encoded. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
alpha = 1 + l - k_uu was labelled as the k_cu->0 reduction of Condition 1 but omitted the O*(0) factor on l. Badge now shows l*O*(0) vs k_uu-1, matching the actual boundary-stability condition. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Zero level set of a swappable margin function (BASIN_BOUNDARY.margin): delta=1 condition l >= k_cu(a+c) + b(1+c) + 2*sqrt(k_cu*b*(a+c)(1+c)) for b = k_cu+k_uu-1 > 0 (reduces to l >= k_cu(sqrt(a+c)+sqrt(1+c))^2 at k_uu=1), basin always exists for b <= 0. Verified against numeric root scan (~5e-4). Swap margin() for the delta-general condition in Phase B. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
URL state, horizon slider, corrected basin badge, analytic boundary overlay. Verified in headless Chromium against the standalone static build (15/15 checks; evidence in _scratch/review/app-test-evidence/). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… presets - delta-general dynamics: F_c gains (1-delta)*l*O*Q; G = (1-Q)+eta+(k_uu-delta*l*O)Q - fixed-point scan and outcome-map boundary use the delta-general quadratic (A_delta = b(a+c) - (1-delta)l) with closed-form threshold l* = min(l_A, max(P, l+)) - new delta slider (default 0.7, flagged as fresh filtering-fraction estimate) - l slider updated to corrected calibration (default 0.4; Petri trend identifies l*O) - Broad / Strict / AI-2027 preset buttons; k_hu step 0.005 so Strict is representable - docs panel rewritten for delta (Condition 2 = A_delta > 0; C1 no longer necessary) - regression: delta=1 reproduces pre-change app bit-for-bit (22/22 checks) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…endix - F_c gains the redirected flow (1-delta)*l*O*q_u; F = q_c+q_h+(k_uu-delta*l*O)q_u - destruction-vs-redirection prose and mechanism table (filtering/retraining/control) - Condition 1 qualified (necessary only at delta=1); Condition 2 delta-general (k_uu+k_cu-1 > (1-delta)l/(a+c), exactly A_delta > 0) - headline closed-form basin threshold: l*O* >= 4*delta*k_cu (delta>=1/2), k_cu/(1-delta) (delta<=1/2); four-to-one rule at delta=1 - long-run fixed-points appendix reworked to the delta-general quadratic (A_delta = b(a+c)-(1-delta)l); endpoint-stability intuition updated - delta estimation passage (filtering-fraction heuristic, central 0.7, range 0.3-1.0) and summary-table row; AI 2027 1/16th passage restated delta-generally (2^(-4*delta^2)) - new idealization bullets (per-action vs pool suppression, dilution terms, constant-l stabilisation, pools as persistent behaviours, redirection at par) - AI control limitation note with class-level-discovery signpost for O Derivations verified in _scratch/review/scripts/verify-delta-fixed-points.js (12/12). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- conversion now reads l*O(0) ~ T_auto/T_half; central l = 0.4 (was 0.2), working range 0.2-1.0 (previous 0.1-0.5 scaled by 1/O(0) = 2) - both identification caveats added: net-of-inflow biases down, falling-k_hu confound biases up; double-duty note on the observed-misbehaviour evidence - summary-table row updated Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- up-front double-duty statement: observed-misbehaviour evidence sets q_u0 AND k_hu (they converge during the human era); only the k_hu reading does work - Broad = behavioural rates at face value (k's = q_u0 = 0.05); Strict = reproduction-relevant (ii') rates ~10x lower (0.005); all structural parameters shared (l = 0.4, delta = 0.7, k_uu = 1, O* = O(0) = 0.5) - parameter table and link to app presets Verified in _scratch/review/scripts/calibration-verdicts.js (12/12 + 8/8). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rdicts - 'uncooperative dominant' single verdict (computed at the mis-identified l = 0.2, delta = 1) replaced by the verdict table over Broad/Strict x delta - Broad: genuinely marginal (monostable q_u* ~ 0.32 at delta = 0.7; exact knife edge at delta = 1); Strict: cooperative dominant robustly (q_u* ~ 2.5%) - headline: the decisive uncertainty is how much observed misbehaviour is reproduction-relevant, not any structural parameter - David's authority caveats and gradual-takeover prose retained Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- lagged-but-unconfounded measurement: re-auditing archived systems with current tools reaches near-ceiling observability (0.86-0.97 at 2-3 sigma lag vs <= 0.67 contemporaneous); retain-development-logs recommendation - the lagged audit directly adjudicates Broad vs Strict (~5% vs ~0.5% reproduction-relevant share of archived dev labour) - thresholds: good outcomes are endemic management, not eradication; the model locates plateaus and margins but cannot set safe levels; decision sketch - per-calibration falsifiable predictions from RK4 trajectories: observed misbehaviour rises on BOTH good paths (replaces the raw note's 'probably not self-stabilizing' conjecture, which the central calibrations contradict); shared near-term observability hump prediction - confidently-good vs not-bad criteria; Strict passes, Broad does not - cross-link to EleutherAI reward-hacking indicators post as a k_hu/l indicator Trajectories: _scratch/review/scripts/good-path-trajectories.js. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- hook: live AI-builds-AI evidence (Codex builds Codex, Claude builds Claude, automated alignment researchers) moved from the estimation section to the intro - model section: explanation in words first (q_u' and F_u remain inline), full equation block demoted to a compact 'Model reference' subsection - 'Long-run model behaviour' retitled as the basin-picture section (already carries the headline inequality) - 'Implications and illustrative scenarios' wrapper removed; 'The default path' promoted to a top-level section - AI 2027 appendix promoted to a body section after the default path, with the 'similar concerns mask different risk models' point promoted to its thesis - calibration compressed: load-bearing judgements and named-calibrations table stay in body; source-by-source literature detail moved to a new 'calibration evidence in detail' appendix - terminology note (uncooperative vs misaligned) and prominent app links added Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- duplicated intro clause removed; citation placeholders replaced with real links (AI Futures model, Eth & Davidson software intelligence explosion, Davidson-Epoch interactive takeoff model) - related-work paragraph (Davidson takeoff, Christiano 'What failure looks like' Pt 1 as the default path informally formalised, Hubinger deceptive alignment as the bistable regime) - typos: possibilities, substantial, robust, elicitation, transition, aggressive, representative, technology, 'tendency to instil', c_M - AI 2027 list renumbered (6 -> 8 skip removed); RepliBench paren closed and rephrased (negative results are data, not absence of data) - '## Appendices' header added; all in-text anchors verified to resolve; explicit link to the fixed-points appendix from the basin section - front-matter description filled (draft: true and date retained) - AI usage note updated to reflect the delta/calibration/app workflow honestly Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- figures_ai2027.py: delta-general model core (F_c redirected flow, G, G0, delta-general r-quadratic classifier); defaults l = 0.4, delta = 0.7 - validation extended: delta=1/l=0.2 reproduces the pre-delta JS ground truth exactly, and delta-general verdicts match calibration-verdicts.js (Broad q_u* = 0.323, Strict 0.0255, AI 2027 escape, threshold l* = 0.28) - fig 1 regenerated at l = 0.4, delta = 0.7 (k_cu = 0.9 trajectory) - fig 2 boundary now k_cu = l*O*/(4*delta); AI 2027 annotation updated (needs l ~ 2.5, ~6x central, even at O* = 0.99) - post captions updated to the delta-general boundary formula and parameters Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- enable goldmark passthrough for block \[ ... \] delimiters only (existing
posts use $$ + escaped-underscore style and are untouched; full-site md5
diff confirms only this post and reward-hacking-indicators change)
- reward-hacking-indicators: two escaped \[..\] bracket literals would now
pass through to MathJax; switched to plain brackets — rendered page verified
byte-identical to the pre-change build
- join bare '=' / '-' lines inside display blocks (goldmark setext headings
were swallowing ~25 equations and polluting the ToC)
- inline math: ^* -> ^{\ast} (38 sites) so markdown emphasis cannot eat
asterisks; set notation via \lbrace/\rbrace
- verified in headless Chromium: 591 MathJax containers, 0 typeset errors,
0 console errors; figures load; ToC and anchors clean
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
All seven confirmed text discrepancies from the adversarial verification report (_scratch/review/verification-report.md rows 1-7): 1. Strict good-path bullet: Q crosses 1% at sigma=3.94 and reaches ~1.4% by sigma=6; reworded "stays below about 1% (sigma 4-6)". 2. Retro-observability ranges scoped to sigma <~ 7; on Broad lag-2 drifts to ~0.83 by sigma=12 (asymptote ~0.79) - "near-ceiling across the whole series" qualified to Strict. 3. Verdict table Strict delta=0.7 attractor cell 0.026 -> 0.025. 4. "q_u* ~ 0.61 for delta < 0.5" attached to delta=0.3; span 0.53-0.88 stated. 5. Summary table q_u0 central aligned to ~5% (was ~5-15%). 6. "Observability roughly flat after an early hump" corrected: settles ~0.49 on Strict, erodes 0.66 -> 0.34 on Broad; conclusion (observed rate climbs) retained. 7. 1/16th-progress and 2^(-4 delta^2) passages reframed explicitly as a per-doubling cost heuristic (the literal model exits its validity envelope, F < 0, in that regime). Conditions C1/C2 text (report item 12) was verified already delta-general in the current post; no change needed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Expands the bare per-calibration verdicts into a complete default-path exploration, parallel in depth to the AI 2027 section: - Act structure: human era (q_u pinned at the k_hu anchor; initial conditions forgotten, citing the q_u0 0.1%-99% sweep), handoff at sigma = ln(eta0) ~ 1.6 (parity is not an event in the model), and per-calibration post-handoff dynamics with sigma AND calendar landmarks (Broad: 10% @ sigma~4/3y, 20% @ ~10/7y, attractor 0.32 over decades with O eroding 0.66 -> 0.34; Strict: 1% @ 3.94/2.8y, settling at the endemic 2.5% with O ~ 0.49). - Broad framed as genuinely marginal, never a clean verdict: exact delta=1 knife edge (l O* = 0.2 = 4 k_cu; l = 0.396 vs 0.404 flips the asymptotic outcome while observables differ by <= 0.15 pp for ~40 y), plus a newly flagged proximity: the takeover endpoint stabilises at exactly delta = 0.75, just above the central 0.7. - Fire-alarm question answered from runs: on the bad path (Broad, l = 0.2, delta = 0.7) no observable is discontinuous - observed rate peaks at 12.7% near sigma~28 (~20 y) then declines, max movement ~2 pp/year, and is within 1.05-1.2x of the good path through sigma=8. Far-from-boundary regimes (AI 2027) do self-announce. - Rescue analysis by attractor LOCATION (basin existence is not a good-outcome proxy at delta<1), provisional good threshold q_u* <= 0.1: k_cu <= 0.019 or l >= 1.08 (both 2.7x moves) rescue alone; O* = 1 tops out at q_u* ~ 0.14 and delta = 0 at 0.25 - mapped to alignment-training / control-correction / interpretability / mechanism-mix intervention classes. - Observed-vs-true divergence noted as an emergent, unforced result (not the flagship), contrasted with AI 2027's discrete transition. Figures generated by _scratch/review/scripts/default-path-figs.py, which self-validates against the independent verification numbers (11/11) before drawing. Existing verdict table/text absorbed, not duplicated. Hugo build verified clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Delete model.mjs, analytic.mjs, scenarios.mjs, transients.mjs and
explore_ai2027.mjs: all five predate the destruction-fraction (delta)
extension (no delta, default l=0.2, old quadratic), and analytic.mjs
asserts a delta=1-only claim ("regime boundary independent of l") that
is false for delta<1. They were not part of the vite build but shipped
in the repo and contradicted the published model. Deleting rather than
updating: they were one-off exploration scripts fully superseded by the
in-app engine (src/BasinExplorer.jsx, canonical) and the self-validating
delta-general port in figures_ai2027.py.
Add a README noting the canonical math engine and the removal, and fix
figures_ai2027.py docstring references to the deleted files (its
validation values are pinned in-file; it still passes PORT VALID and
reproduces both AI 2027 figures unchanged).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Implement the David-approved production-gated model (derivation audit Part A, G1-G12): observability-gated suppression lO now intercepts the leakage inflow into F_u as well as removing from the q_u stock, with the same delta split of the full intercepted flow lO*L. - makeDeriv/computeG/calibratedRates: gated F_u, F_c, G (leakage is no longer a pure transfer; destroyed intercepted leakage subtracts from G, including in the calibration-time G0) - findSteadyStatesR: gated fixed-point function f(r) = k_cu + b*r - lO*(r)(k_cu+r)(1+(1-delta)r) - BASIN_BOUNDARY: gated lstar (u/V/N discriminant form, three-branch at k_uu=1, a=1); new ERADICATION_BOUNDARY overlay (lO*(0) = 1) on the outcome map; exact-zero crossings now traced - classifyBasin/badge: new eradication regime (q_u -> 0 reachable when lO*(0) >= 1), incl. bistable eradication/takeover and the two-root eradication-basin + high-attractor case - presets: identity-based leakage defaults (Broad trend-adjusted k = 0.0407, Strict steady-state k = 0.00542, AI 2027 k_hu follows Broad); k-slider steps 0.005 -> 0.0001; app defaults stay = Broad - URL schema versioned (v=2); unversioned pre-gating links load defaults - docs panel + hover cards: gated equations, workaround/pipeline rationale, gated C1, C2-unchanged note, gated quadratic and threshold, eradication bullet, q_h <-> eta notation mapping Verified: _scratch/review/scripts/gating-app-regression.js (38/38; the gate-zeroed source reproduces the previous model bit-for-bit, and presets reproduce the derivation fixed points to 3 decimals) and gating-browser-drive.mjs (27/27 in headless Chromium); report in _scratch/review/app-test-report-gating.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
vite build of the gated BasinExplorer.jsx; bundle index-BTR035jY.js -> index-DTN_euC9.js. Browser-verified against this exact build (27/27, _scratch/review/app-test-report-gating.md). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pre-existing working-tree edits found at session start: drops the l = 0.2 sensitivity paragraph from the default-path verdicts, softens the retro-audit prose, and rewords the lagged-development-data paragraph. Committed as-is before the production-gating edit queue. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Suppression now intercepts the leakage inflow as well as self-reproduction
(F_u = (1-lO)(k_cu q_c + k_hu q_h) + (k_uu - lO)q_u; F_c redirects (1-d)lO L;
F = q_c + q_h + k_uu q_u - d lO L). Changes, per derivation-audit Part A:
- dynamics-in-words and model reference: gated equations, L defined, beta=1
prevention-efficacy extension noted, leakage no longer cancels in F
- idealizations: gating motivation cross-ref, stock-vs-flow asymmetry
- basin picture: gated Condition 1 (bracketed form, G8), C2 unchanged note,
eradication regime bullet (G9), three-branch closed-form threshold (G6);
delta=1 rule becomes 4 k_cu (1-k_cu)
- default path: identity-calibrated defaults (Broad k=0.0407 trend-adjusted,
Strict k=0.00542 steady-state); Broad dips 5%->4.3% then drifts to a fifth
(q*=0.205, margin 1.79x); Strict plateau 2.2%; knife-edge coincidence
retired, boundary-adjacent framing rebuilt; fire alarm moved to the gated
bad path (50% at sigma~53) and new knife pair 0.222/0.226; rescue levers
restated - monitoring alone can now rescue (O*>=0.80, product 0.32)
- AI 2027: gated B=0-branch arithmetic (lO* ~ 1.4-1.8, half ungated),
1/16th and 2^{-4 delta^2} heuristics replaced, eradication-regime caveat
- good-path: eradication scope condition on 'management, not eradication'
- appendix: gated g(r), g'(0), convexity condition, quadratic B/C, C<=0
case, u/V/N threshold, endpoint and k_cu=0 gating-unchanged notes
All numbers from verify-gated-fixed-points.js (332 checks),
gated-calibrations.js and gated-landmarks.js; prose claims headers in
_scratch/review/drafts/gating-post-prose.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Estimating parameters: the level->leakage identity displayed and explained (F0 convention noted); three-value table (naive 0.050 / steady-state 0.055 / trend-adjusted 0.041 for Broad; Strict 0.0054); defaults justified (Broad trend-adjusted for coherence with its falling identifying series; Strict steady-state - the trend is not licensed for the (ii') pool); net-of-inflow caveat absorbed into the identity; the same series now flagged as doing triple duty (level, trend, rate); delta-independence of the ell estimate restated under gating - Two-readings passage: the falling observed series is ambiguous between benign (q_u falling 0.22/sigma) and alarming (O falling 0.5->0.23 with flat truth, or 0.5->0.11 with rising truth); separation requires an independent O-trend instrument; tied to the good-path dichotomy - Summary table: k_hu/k_cu rows carry the identity anchors; new trend row g - Named calibrations: double duty -> triple duty; symmetric Broad/Strict framing (measured proxy vs explicit guesswork, no headline hierarchy); parameter table and new variant-verdict table (incl. delta=1 spread) - Dual observability-calibration routes: when to use O*-prior vs T_M,1/2 / T_E,2 rate-judgement route (text + appendix intro) Numbers from gated-calibrations.js; cf. drafts/calibrations-v2.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… scope - Good path: the foregrounded near-term catch-rate prediction is replaced by the medium-term dichotomy (re-verified under gating across O* in [0.15, 0.5], landmarks-v2 s5): either observed misbehaviour rises by sigma ~ 8, or observability itself is collapsing (O 0.5 -> 0.13 in the one non-rising cell); near-term direction explicitly O-sensitive - O-robustness tags appended to the good-path section: robust (dichotomy, no-fire-alarm, ell*O*(0) product structure incl. the gating-modified monitoring-rescue claim) vs O-sensitive (basin verdicts, near-term observed direction, crossing times) - Capability-gating scope statement: single-pool limitation named in the idealizations (non-reproducing (i') misbehaviour has nowhere to live; today's rates may calibrate observability and (i')-leakage, Broad's defence is the proxy hypothesis); sigma-clock starts where (ii') becomes capability-possible; k_uu(sigma) ramp noted, not implemented; calibration scope note tied to why Strict declines the trend - Two-pool extension promoted to lead future-work item in the neglect/subversion appendix (subsumes suppression-channel and control questions) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- O-transient demoted from shared near-term prediction to calibration diagnostic: hump size measures the human monitoring subsidy (q_h0, a_AI/H, both weakly anchored; peak 0.66 at a_AI/H=1 vs ~0.53 at 10, long-run O* untouched); stagnating catch rates degenerate, falling human-era catch rate is the alarm - Path-prediction bullets re-fixed to gated defaults (Broad dip to 4.3%, back through 5% at ~2 y, 10% at ~5 y, endemic fifth, observed ~3.5% at sigma 4; Strict 1% at sigma 4.9, plateau 2.2%) and given an explicit series-binding rule: predictions bind (ii')-targeted instruments; Broad additionally binds the broad series via its proxy hypothesis, Strict predicts the broad series will NOT track q_u - Confounding paragraph tied to the two-readings ambiguity arithmetic - Retro-observability numbers updated (Broad lag-2 drift to ~0.85 by sigma 12, asymptote ~0.83) - Confidently-good paragraph: 2:1 -> 4:1, 1.4x -> 1.8x, margin caveat - Indicators: reward-hacking indicator labelled as binding the broad series; the two instrument gaps named ((ii')-targeted audits and an O-trend instrument: fixed-protocol monitorability testbeds, lagged dev-log audits) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- figures_ai2027.py ported to the gated engine (gated F_u/F_c/G, gated c_M calibration, gated quadratic classifier with eradication regime, three-branch lstar port); self-tests re-baselined by design against gated fixtures (Broad 0.2053 / Strict 0.0222 / l* 0.2241 / d=1 rule 0.38 / B=0 branch 2.835 & 3.6 / eradication trajectory / AI 2027 sigma@50% = 3.02, observed peak 23.6%) - all pass - ai2027-observability-cannot-save.png: gated boundary inverted by bisection (panel a), basin boundary clipped at the new eradication line with AI 2027 marker moved to it (panel b); caption rewritten (old k_cu = lO*/(4 delta) formula retired) - ai2027-high-leakage-run.png: gated trajectory (k_hu follows the Broad default 0.0407), observed peak ~24% - default-path figures regenerated from the gated default-path-figs.py (validates 14/14 against gated-landmarks fixtures): dip annotation, attractor 0.21 / 0.022, rescue panels with reversed O* verdict and attractor-climb caveat; captions updated - AI 2027 policy-contrast sentence: required product 5-9x central Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Strategy-stealing citation on the equal-reproduction assumption; E redefined as a stock of failure surfaces with named channels; the constant-l/l_k idealization recast as exchange-rate assumptions (the coverage race is O's dynamic, not an l assumption); redirection quality discounts and lags folded into delta_eff = 1 - v(1-delta) exactly; draft-history self-references removed or recast as timeless statements throughout (eradication retraction parentheticals, re-centring talk, earlier-version comparisons); inline-math backslash escapes doubled so Goldmark passes them to MathJax intact; hook paragraph trimmed and instrumental-convergence wording tightened (David). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
✅ Deploy Preview for gilded-faun-da8b4f ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Deploy Preview for pedantic-newton-e8d424 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
The Goldmark passthrough config in config-blog.toml requires Hugo >= 0.122; the pinned 0.109.0 ignores it silently, so inline/display math would be markdown-mangled on the deployed blog. 0.127.0 matches local development. Both site configs verified to build clean under 0.127.0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Four verdict-level results up front (evidence brackets the basin boundary; bad paths are quiet; reported misbehaviour trends are instrument-relative; rescues are multi-fold) so readers get the meat before the model machinery. Wording by David from the reviewed draft; dichotomy texture pending v4 re-verification as flagged in the PR. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Implements the David-specified v4 model change in the basin explorer: passive opacity becomes a per-doubling constant (dE_passive = c0*dA) instead of a production share; only active evasion remains a labour share. Downstream of the one-term change: - Long-run O*(r) is now implicit (minus root of cdP*O^2 - D*O + 1 = 0); fixed-point scan root-finds g with the implicit O* on a log grid with validity-envelope edge bisection (the envelope can clip real roots in deep-suppression corners). - Basin boundary overlay and lstar are numeric (-min g over the cooperative-side validity interval); v3's closed-form threshold chain is superseded, the delta=1 closed form survives as a cross-check. - c0 calibration switches to the O*-dial pin at the long-run rare-uncooperative point: c0/cM = (1-O*)/(O*(1-delta*lk*kcu*O*)), re-solved per parameter cell (badge O*(0) == the dial exactly). - Docs panel: new e' equation, per-doubling c0 reading + rationale, the dropped production-share variant named in one sentence, implicit-O* quadratic + cubic Phi fixed points, v4 endemic floor with the -c*d*k term and its validity bound, two-product claim removed (falsified by v4), C1/C2 displays updated. - URL schema bumped to v=4 (hard fallback: v<4 links load defaults). - Presets unchanged (identity values provably identical under v4). Verification: _scratch/review/scripts/v4/v4-app-regression.js (61/61; pinned bit-for-bit reconstruction of 6539bf1 when the two v4 edits are reverted; c0=0 reproduces v3 exactly; fixtures from calibrations-v4 / landmarks-v4) and v4-browser-drive.mjs (28/28). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
vite build of the v4 app source (asset index-DhUrkXvT.js replaces index-DrxafwgL.js). Browser-verified against the built bundle on a local server: 28/28 checks (badges, dial-pinned c0 readouts, URL v=4 hard fallback, endemic floor, O*-dial rescue, growth-peg docs, outcome map with the numeric boundary, zero console errors). Evidence in _scratch/review/app-test-evidence/v4-*.png and _scratch/review/app-test-report-v4.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Passive opacity becomes a per-doubling constant instead of a production share: e' = c_0 + a_E/M c_M q_u / F - e (F_E := c_0 F + a_E/M c_M q_u). Rewrites the model reference, dynamics-in-words and idealizations (the peg named as a specification choice, with the dropped production-share variant and its consequences stated); the basin picture and fixed-points appendix move to the implicit long-run observability (per-q_u quadratic, minus root) and the cubic fixed-point equation, with the new floor formula and its validity condition, numeric threshold + delta=1 closed form, and re-derived C1/C2; the sigma-clock and observability-calibration appendices re-derive e' and the T_E,2 route; c_0 is now back-solved from a long-run observability dial per parameter cell. AI usage note updated. Derivations verified by _scratch/review/scripts/v4/check_algebra_v4.py (36/36); audit trail in _scratch/review/derivation-audit.md Part A''. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
All 12 verdict cells keep their outcome (Broad monostable takeover at
margins 0.48-0.89x, Strict endemic 4.4-4.9% at 4.6-6.6x); transients
move: observability is flat (~0.5) through the human era and handoff
(no hump, no subsidy), crossings shift ~0.5-1 sigma earlier (Strict 1%
at sigma~3.9, ~2.8y), observed-rate path 2.5->3.8->7.0->9.2% peaking
12.8% near sigma~33, knife pairs 0.308/0.316 (<=0.19pp), retro-O bands
0.86-0.88 / 0.94-0.95 vs contemporaneous <=0.5, rescue levers k<=0.016
/ l>=0.76 / dial>=0.88 ("every rescue >= 3.5x"), dichotomy 19/20 with
the same exception cell. The O-hump calibration diagnostic becomes a
dial-trend diagnostic; the two-product robustness clause is rescoped
to directly recomputed claims. AI 2027: delta=0.7 threshold l*~3.9-4.0
(product ~2.0, 20x central, ~1/4 per doubling, appearance 44-54%);
at delta=1 no fix rate yields a basin under the dial convention
(convention stated). Two-readings gains the model-side dial gloss.
Numbers from _scratch/review/scripts/v4/{calibrations,landmarks}_v4.py;
fixture digests in _scratch/review/drafts/{calibrations,landmarks}-v4.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
figures_ai2027.py ported to the v4 engine (per-doubling passive opacity, per-cell dial pin, implicit minus-root observability, numeric basin boundary with the delta=1 closed form as cross-check); its self-tests re-baselined to the v4 fixtures (Broad l* 0.3123, AI-2027 l* 3.926/3.995 at delta=0.7 and NO BASIN at delta=1, tolerable k_cu 0.0395->0.1078, l* 1.674 at the 0.99 dial, trajectory 50% at sigma~3.1 with obs peak ~18%) — 12/12 pass. default-path figures regenerated from the v4 default-path-figs.py (validated 21/21 against landmarks-v4/ calibrations-v4): observability panels now flat through the handoff, crossings and rescue thresholds re-annotated (l*=0.312 appearance at q*~0.58, good-attractor l>=0.76, dial>=0.88, floor 14% at l_k=0.5). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Opening act: add the why-human-labour-is-not-an-observability-subsidy paragraph (worked m*/e* flow-balance two-liner; the rejected production-share variant named with its six-fold dilution). - Handoff: nothing happens even in observability — O crosses parity at 0.493 from 0.500, ranges verified against the v4 runs. - Good-path dial-trend paragraph: rising/stagnating/falling catch-rate readings map onto the dial one-for-one; falling = evasion gaining on coverage now; the v3 a_AI/H stagnation degeneracy is gone. - Calibration routes + observability appendix: the two routes nearly coincide at the central calibration (flat-trend c = 1.01 vs pin 1.02); stagnating catch rates are now consistent with flow balance today. Numbers verified against _scratch/review/scripts/v4 this session; drafts/v4-prose.md carries the claims header. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…k anchor - Rescue section: one-frame paragraph on the lk gate reserve — the identity anchors effective seeding today at (1 - lk*O0)*k = 0.031 under both lk readings; the reserve diverges under O movement in both directions, explaining both the worse calibrated margin at higher lk (0.64x vs 0.89x) and the monitoring-rescue split. - Two destinies: one-sentence extension of the entanglement note to lk (higher lk, higher inferred k, worse margin) with a forward pointer. - Named measurements: third measurement added — the recurrence audit of escaped misbehaviour, estimating O(1-lk)/(1-lk*O) (zero at lk=1, ~1/3 at lk=0.5 and O=0.5), the first proposed empirical anchor for lk; parameter table and Building-the-indicators list updated to match. - Timeless-voice fix in Named calibrations (drop 'as this version of the post does'). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Names each David-specified refinement (delta, production gating, bounded lk, growth-pegged passive opacity), credits the delta heuristic and the two-lineage verification scheme explicitly, and keeps revision provenance confined to this note per the post's convention. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Escape the thin-space in the gate-reserve inline math (Goldmark was
eating the single backslash, rendering a literal comma).
- Reword the gate-reserve frame's opening (the margin pattern lives in
the verdict table, not the rescue section) and the handoff sentence
('passes through the handoff' instead of the ambiguous 'crosses
parity' for an O value).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Verdict-table one-character fixes (l* 0.41 at delta=1; attractor 0.68 at delta=0.3); widen the benign-Strict-corners example; Strict prediction bullet rescoped to fifty-fold growth; O*'(0) slope claim scoped to working ranges; sigma~23 rounding. Verdict-gap headline recast as bracket-the-boundary (the reproduction-relevance input straddles the basin boundary; no near-boundary-tuning claim). Fixed-instrument flip-side added to the instrument theory (declining fixed-harness series ~guaranteed at nonzero fix rate; the series is an l-meter) with a matching reflex note in Building the indicators, and the exec summary's lagged-timeseries sentence anchor-linked. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ionale Pools defined over deployed labour rather than weights (constructive misuse as a present-day human-assisted member of A_u); strategy-stealing reframed as long-run convergence; RepliBench scoped as weights-level with the assemblage-level qualification; the horizon-independence form behind Strict's discount, with its two asymmetries. No parameter changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ument Development labour acts on successors by construction, so independence of misbehaviour and setting yields the broad rate, not a discount; the low reading rests on evidenced negative correlation (rates fall sharply under naturalisation). Broad restated as zero correlation with setting; Strict's 10x as a judgement about how far the naturalisation trend extends. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Body gains the rarely-farsighted defeater (matching the summary) with its empty-lagged-audit signature, and both body and summary state the semi-positive reading of a quiet signal (occasional findings + flat trend + an explaining signature, else evidence about observability). Dangling dichotomy references removed or relabelled; definition and typo fixes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Occasional reproduction-relevant findings rule out the rarely-farsighted defeater, gradual-not-sharp declines rule out a suppression surge, and a series that includes misuse rules out mispointed instruments — once all three hold, a flat series raises concern (rise coming, or observability eroding) rather than providing reassurance. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Exec summary claim 3 is now two sentences plus the verified bend claim and a link to What each path predicts; defeater detail lives in the body (which already carried it). Body gains the verified bend statement (20/20 cells for true share and observed rate, including shallow-form survival in the O-collapse exception cell) and the two practical corollaries (misuse-inclusive measures; bend-or-investigate reading). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The definition now lives in the state space (low plateau, wide margin, inside the valid regime — with the low bar and validity bar explicitly related but doing different jobs) and the measurement story follows separately, instead of a checklist mixing the two. Jargon translated; verdicts and numbers unchanged. Also fixes the near-term observability paragraph's M-vs-E slip and typo. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Confidently good now means: low plateau, wide margin, and both holding across the plausible ranges of every input. Verified at Strict leakage: central is good (4.7%, 4.6-6.6x) but the one-percent plateau is out of reach everywhere (best ~1.5%), the ten-times bar is reachable only in the friendliest corners (10-20x), and the low-observability corner loses the good outcome entirely. Plain-language wording throughout. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Confidence = survives the sweep; Strict fails on exactly one axis — long-run observability within plausible guesses — surviving every other input. Bars and corner detail dropped per David's direction. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Seeing a third of misbehaviour long-run (vs half today) lifts Strict's plateau from ~4.7% to ~9-12%; verified by the wiggle-sweep ladder. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claim 1 is now the central-parametrization headline with the four-step intuitive mechanism and an explicit uncertainty paragraph; lead-in and cross-references adjusted to match the one-scenario framing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
A new blog post — Dynamical Models of AI Governability — plus an interactive companion app (the cooperativeness basin explorer, served at
/basin-explorer/).The post puts a small ODE model under the question of whether the AI workforce that builds future AI ends up mostly cooperative or mostly uncooperative: where the basin boundary between a managed-endemic outcome and takeover lies, what current misbehaviour/suppression evidence says about which side we're on (two named calibrations, Broad and Strict, which currently land on opposite sides), whether there's a fire alarm on the bad path, and what observables would tell us we're on the good one. The app exposes the model's full parameter space with presets matching the post's calibrations.
I've used a lot of AI assistance in developing the model and writing the post, and I'm working through all the items I need to review.
Also included (affect the live site — could be cherry-picked separately)
29ace12fixes a theme bug (classList +=clobbering.post-contentstyling on any mathjax post with top-level display math — paragraph spacing was collapsing to a wall of text).f3c3337removes thepolyfill.ioscript from the mathjax partial (domain was compromised in a 2024 supply-chain attack; the ES6 polyfills are unnecessary).🤖 Generated with Claude Code