Skip to content

Releases: ethpandaops/panda

v0.38.0-rc.3

v0.38.0-rc.3 Pre-release
Pre-release

Choose a tag to compare

@Savid Savid released this 03 Jul 02:49
1bbe2d3

What's Changed

  • docs(runbooks): fix accuracy, contract, and retrieval issues from live testing by @Savid in #278

Full Changelog: v0.38.0-rc.2...v0.38.0-rc.3

🐼 Release qualification

Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit 1bbe2d3.

release pass rate mean score mean tokens (correct)
v0.38.0-rc.3 (this release) 0% (0/216) 0.000 0
v0.38.0-rc.1 94% (204/216) 0.232 22,867
v0.38.0-rc.0 94% (203/216) 0.236 23,507
v0.37.0 94% (202/216) 0.232 22,162
v0.37.0-rc.0 92% (198/216) 0.227 23,260
v0.36.2-rc.0 92% (194/210) 0.230 22,363
v0.36.0 89% (176/198) 0.220 24,678
v0.36.0-rc.0 87% (173/198) 0.219 24,067
v0.35.0 87% (173/198) 0.216 22,956

Tokens per correct run: p50 0 · p90 0 · p99 0 · 0 per solve (all spend / correct answers).

Efficiency vs v0.38.0-rc.1: 0.000 — the harden-loop score with the previous release as the frozen token reference. 0.000 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.38.0-rc.1

question v0.38.0-rc.1 v0.38.0-rc.3
attestation_volume_analysis 6/6 0/6 🔻 regressed
au_block_arrivals_chart 3/6 0/6 🔻 regressed
blob_count_stats 6/6 0/6 🔻 regressed
block_count_24h 6/6 0/6 🔻 regressed
block_size_vs_propagation 6/6 0/6 🔻 regressed
block_time_histogram 6/6 0/6 🔻 regressed
block_timing_analysis 5/6 0/6 🔻 regressed
cbt_table_bounds 6/6 0/6 🔻 regressed
chartkit_default_arrival_distribution 6/6 0/6 🔻 regressed
chartkit_default_entity_bar 6/6 0/6 🔻 regressed
dec4_relay_circuit_breaker 6/6 0/6 🔻 regressed
devnet_error_logs 6/6 0/6 🔻 regressed
devnets_running 6/6 0/6 🔻 regressed
empty_blob_submitters 6/6 0/6 🔻 regressed
engine_getblobs_blob_scaling 5/6 0/6 🔻 regressed
engine_newpayload_el_comparison 5/6 0/6 🔻 regressed
forky_node_coverage 6/6 0/6 🔻 regressed
head_vote_correctness_24h 6/6 0/6 🔻 regressed
late_data_columns 3/6 0/6 🔻 regressed
list_datasources 6/6 0/6 🔻 regressed
mainnet_average_base_fee_per_gas_jun_10_2026 3/3 0/3 🔻 regressed
mainnet_block_arrival_p50 6/6 0/6 🔻 regressed
mainnet_xatu_top_peer_count_node 3/3 0/3 🔻 regressed
mempool_observer_coverage 5/6 0/6 🔻 regressed
mev_max_value_block 6/6 0/6 🔻 regressed
mev_relay_share 6/6 0/6 🔻 regressed
mev_timing_games 6/6 0/6 🔻 regressed
missed_slots_24h 6/6 0/6 🔻 regressed
parent_distance_orphans 6/6 0/6 🔻 regressed
precompile_usage 6/6 0/6 🔻 regressed
sepolia_validator_lookup 6/6 0/6 🔻 regressed
session_data_analysis 6/6 0/6 🔻 regressed
storage_upload_session_scoped 6/6 0/6 🔻 regressed
testnet_status 4/6 0/6 🔻 regressed
tracoor_node_coverage 6/6 0/6 🔻 regressed
validator_count 6/6 0/6 🔻 regressed
validator_head_accuracy 6/6 0/6 🔻 regressed

Categories with failures

category questions pass rate median tokens (correct)
attestations 3 0/18 (0%) 0
blobs 4 0/24 (0%) 0
blocks 14 0/84 (0%) 0
clickhouse 27 0/156 (0%) 0
execution 6 0/33 (0%) 0
mev 4 0/24 (0%) 0
multi_step 6 0/36 (0%) 0
networks 8 0/45 (0%) 0
smoke 8 0/48 (0%) 0
storage 1 0/6 (0%) 0
timing 9 0/54 (0%) 0
validators 4 0/24 (0%) 0
visualization 7 0/42 (0%) 0

Failed runs

question failed sample grader reason
attestation_volume_analysis 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
au_block_arrivals_chart 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
blob_count_stats 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
block_count_24h 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
block_size_vs_propagation 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
block_time_histogram 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
block_timing_analysis 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
cbt_table_bounds 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
chartkit_default_arrival_distribution 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
chartkit_default_entity_bar 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
dec4_relay_circuit_breaker 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
devnet_error_logs 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
devnets_running 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
empty_blob_submitters 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
engine_getblobs_blob_scaling 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
engine_newpayload_el_comparison 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
forky_node_coverage 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
head_vote_correctness_24h 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
late_data_columns 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
list_datasources 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
mainnet_average_base_fee_per_gas_jun_10_2026 3/3 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
mainnet_block_arrival_p50 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
mainnet_xatu_top_peer_count_node 3/3 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
mempool_observer_coverage 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
mev_max_value_block 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
mev_relay_share 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
mev_timing_games 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
missed_slots_24h 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
parent_distance_orphans 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
precompile_usage 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
sepolia_validator_lookup 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
session_data_analysis 6/6 API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`storage_uplo...
Read more

v0.38.0-rc.2

v0.38.0-rc.2 Pre-release
Pre-release

Choose a tag to compare

@Savid Savid released this 03 Jul 00:35
ad0426b

What's Changed

  • feat(compute): human-readable default output + list filtering by @samcm in #267
  • feat(compute): add snapshots promote command by @samcm in #271
  • fix(cli): datasource-aware error hints to reduce avoidable CLI errors by @qu0b in #130
  • feat(runbooks): add devnet consensus bug report runbook by @qu0b in #273
  • Rebuild runbook registry with hybrid retrieval; hard cutover to task-typed embedding by @Savid in #274

Full Changelog: v0.38.0-rc.1...v0.38.0-rc.2

v0.38.0-rc.1

v0.38.0-rc.1 Pre-release
Pre-release

Choose a tag to compare

@github-actions github-actions released this 30 Jun 08:08
f2f4716

Changelog

  • f2f4716 feat(proxy): forward caller token to compute backend (#268)

🐼 Release qualification

Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit f2f4716.

release pass rate mean score mean tokens (correct)
v0.38.0-rc.1 (this release) 94% (204/216) 0.232 22,867
v0.38.0-rc.0 94% (203/216) 0.236 23,507
v0.37.0 94% (202/216) 0.232 22,162
v0.37.0-rc.0 92% (198/216) 0.227 23,260
v0.36.2-rc.0 92% (194/210) 0.230 22,363
v0.36.0 89% (176/198) 0.220 24,678
v0.36.0-rc.0 87% (173/198) 0.219 24,067
v0.35.0 87% (173/198) 0.216 22,956
v0.35.0-rc.0 88% (174/198) 0.219 22,182

Tokens per correct run: p50 19,303 · p90 34,731 · p99 72,288 · 24,424 per solve (all spend / correct answers).

Efficiency vs v0.38.0-rc.0: 0.240 — the harden-loop score with the previous release as the frozen token reference. 0.236 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.38.0-rc.0

question v0.38.0-rc.0 v0.38.0-rc.1
au_block_arrivals_chart 5/6 3/6 🔻 regressed
block_timing_analysis 6/6 5/6 🔻 regressed
cbt_table_bounds 4/6 6/6 🟢 improved
chartkit_default_entity_bar 5/6 6/6 🟢 improved
engine_getblobs_blob_scaling 6/6 5/6 🔻 regressed
engine_newpayload_el_comparison 6/6 5/6 🔻 regressed
late_data_columns 2/6 3/6 🟢 improved
parent_distance_orphans 5/6 6/6 🟢 improved
testnet_status 3/6 4/6 🟢 improved

Categories with failures

category questions pass rate median tokens (correct)
blobs 4 20/24 (83%) 25,612
timing 9 46/54 (85%) 22,610
visualization 7 38/42 (90%) 23,413
execution 6 30/33 (91%) 26,034
blocks 14 80/84 (95%) 20,640
clickhouse 27 149/156 (96%) 19,788
networks 8 43/45 (96%) 16,263
multi_step 6 35/36 (97%) 23,743

Failed runs

question failed sample grader reason
au_block_arrivals_chart 3/6 API error: 400 Bad Request {"error":{"message":"Error from provider (Alibaba): <400> InternalError.Algo.DataInspectionFailed: Input text dat
block_timing_analysis 1/6 The agent failed to re-use the session to maintain persistence between turns. It repeatedly attempted to use a non-existent session name ('b
engine_getblobs_blob_scaling 1/6 API call error: Error: Error parsing response from https://opencode.ai/zen/go/v1/chat/completions: Unexpected end of JSON input. HTTP 503 Se
engine_newpayload_el_comparison 1/6 The output fails the criterion to conclude that everything sat far below the 12s slot. Instead, it explicitly states that erigon's max durat
late_data_columns 3/6 string
mempool_observer_coverage 1/6 The agent reported 4 distinct execution clients, which falls outside the accepted range of 19-23 distinct observing clients required by the
testnet_status 2/6 The output provides block-production and participation figures derived from real queries, but it completely omits the current epoch and fina

eval trend

🔭 Langfuse traces (216 runs; ⚠️ = failed)
Read more

v0.38.0-rc.0

v0.38.0-rc.0 Pre-release
Pre-release

Choose a tag to compare

@github-actions github-actions released this 30 Jun 04:22
bc33ee9

Changelog

  • bc33ee9 feat(compute): add compute datasource for ephemeral sandboxes (#265)
  • 3c3fa2f feat(resources): resolve the right resource path on a read miss (#261)
  • 695c89c fix(search): activate semantic search on proxy discovery, not only at startup (#260)
  • d388d4a release-eval: serve release reports via the race-proof viewer (#259)

🐼 Release qualification

Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit bc33ee9.

release pass rate mean score mean tokens (correct)
v0.38.0-rc.0 (this release) 94% (203/216) 0.236 23,507
v0.37.0 94% (202/216) 0.232 22,162
v0.37.0-rc.0 92% (198/216) 0.227 23,260
v0.36.2-rc.0 92% (194/210) 0.230 22,363
v0.36.0 89% (176/198) 0.220 24,678
v0.36.0-rc.0 87% (173/198) 0.219 24,067
v0.35.0 87% (173/198) 0.216 22,956
v0.35.0-rc.0 88% (174/198) 0.219 22,182
v0.34.0 85% (169/198) 0.209 23,915

Tokens per correct run: p50 19,454 · p90 36,674 · p99 76,701 · 25,016 per solve (all spend / correct answers).

Efficiency vs v0.37.0: 0.224 — the harden-loop score with the previous release as the frozen token reference. 0.235 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.37.0

question v0.37.0 v0.38.0-rc.0
au_block_arrivals_chart 4/6 5/6 🟢 improved
blob_count_stats 5/6 6/6 🟢 improved
chartkit_default_entity_bar 6/6 5/6 🔻 regressed
devnet_error_logs 5/6 6/6 🟢 improved
engine_getblobs_blob_scaling 5/6 6/6 🟢 improved
engine_newpayload_el_comparison 5/6 6/6 🟢 improved
testnet_status 6/6 3/6 🔻 regressed

Categories with failures

category questions pass rate median tokens (correct)
blobs 4 20/24 (83%) 31,030
networks 8 40/45 (89%) 15,956
timing 9 49/54 (91%) 22,714
visualization 7 40/42 (95%) 23,132
clickhouse 27 149/156 (96%) 20,830
blocks 14 81/84 (96%) 20,879
execution 6 32/33 (97%) 25,844
multi_step 6 35/36 (97%) 25,040

Failed runs

question failed sample grader reason
au_block_arrivals_chart 1/6 The output successfully filters the data for Australian observers, builds a chart, and provides the storage URL. However, it fails the speci
cbt_table_bounds 2/6 The agent used a full-table MIN/MAX SQL scan to determine the date range instead of sourcing the answer from the pipeline's coverage/bounds
chartkit_default_entity_bar 1/6 The agent failed to re-use the panda execute session across turns. It used session 169c33497071 in Turn 1 but created a new session 29a67bde
late_data_columns 4/6 The agent's query returned 88,240 late observations, which falls outside the accepted range of roughly 35,000 to 55,000 observations specifi
mempool_observer_coverage 1/6 The output reports 1 distinct client, which falls outside the accepted range of 19-23 clients specified in the rubric.
parent_distance_orphans 1/6 The agent failed to report the full distribution of parent distances (distance 1, 2, and 3+). Its query lumped all distances > 1 together, a
testnet_status 3/6 The output reports the current epoch (105,673) and recent block-production figures, but it completely omits the finalized epoch, which is ex

eval trend

🔭 Langfuse traces (216 runs; ⚠️ = failed)
Read more

v0.37.0

Choose a tag to compare

@samcm samcm released this 29 Jun 07:03
ace9f8e

What's Changed

  • test(eval): fix grader false-negatives on chart-URL and late-column cases by @samcm in #238
  • feat(eval): grade through the Codex Responses API directly (Codex subscription) by @samcm in #239
  • fix(eval): deterministic per-test sandbox cleanup via owner attribution by @samcm in #237
  • ci(release): build and push sandbox image before goreleaser by @samcm in #236
  • fix(docker): run server as the credentials owner for non-1000 host UIDs by @samcm in #241
  • feat(sandbox): add chartkit charting library by @samcm in #246
  • feat(storage): scope uploads by session and return host path by @samcm in #248
  • fix: It should use chartkits for the actual charting? by @samcm in #249
  • fix(chartkit): harden against visually-broken charts; require network + structured source by @samcm in #251
  • feat(networks): devnet data + notes.ethereum.org specs in CLI and Python by @samcm in #250
  • docs(storage): surface host_path and session reuse for the storage upload smoke case by @samcm in #253
  • feat(evm): EVM execution module — opcode trial, tracing, tx submission against devnet nodes by @qu0b in #252
  • [codex] embed network specs in resources by @parithosh in #256
  • feat(auth): make the server the sole owner of proxy credentials by @samcm in #258
  • search: replace inline content with ref URIs + panda read command by @qu0b in #255

Full Changelog: v0.36.0...v0.37.0

🐼 Release qualification

Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit ace9f8e.

release pass rate mean score mean tokens (correct)
v0.37.0 (this release) 94% (202/216) 0.232 22,162
v0.37.0-rc.0 92% (198/216) 0.227 23,260
v0.36.2-rc.0 92% (194/210) 0.230 22,363
v0.36.0 89% (176/198) 0.220 24,678
v0.36.0-rc.0 87% (173/198) 0.219 24,067
v0.35.0 87% (173/198) 0.216 22,956
v0.35.0-rc.0 88% (174/198) 0.219 22,182
v0.34.0 85% (169/198) 0.209 23,915
v0.34.0-rc.0 84% (167/198) 0.209 22,725

Tokens per correct run: p50 19,116 · p90 32,706 · p99 57,458 · 23,606 per solve (all spend / correct answers).

Efficiency vs v0.37.0-rc.0: 0.240 — the harden-loop score with the previous release as the frozen token reference. 0.234 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.37.0-rc.0

question v0.37.0-rc.0 v0.37.0
blob_count_stats 6/6 5/6 🔻 regressed
cbt_table_bounds 3/6 4/6 🟢 improved
chartkit_default_entity_bar 5/6 6/6 🟢 improved
devnet_error_logs 6/6 5/6 🔻 regressed
parent_distance_orphans 4/6 5/6 🟢 improved
session_data_analysis 5/6 6/6 🟢 improved
testnet_status 5/6 6/6 🟢 improved
validator_count 5/6 6/6 🟢 improved

Categories with failures

category questions pass rate median tokens (correct)
blobs 4 18/24 (75%) 28,362
timing 9 47/54 (87%) 20,283
execution 6 30/33 (91%) 23,764
networks 8 42/45 (93%) 17,644
clickhouse 27 146/156 (94%) 19,286
visualization 7 40/42 (95%) 22,260
blocks 14 81/84 (96%) 19,364

Failed runs

question failed sample grader reason
au_block_arrivals_chart 2/6 The agent correctly filtered the data for Australian observers on 8 June 2026, built a chart, and provided the storage URL. However, the age
blob_count_stats 1/6 The agent reported an average of 3.13 and a median (P50) of 2. While the average is within the 40% tolerance of the expected ~4.7, the repor
cbt_table_bounds 2/6 The agent used a full-table MIN/MAX SQL scan to determine the date range instead of sourcing the answer from the pipeline's coverage/bounds
devnet_error_logs 1/6 The agent discovered active devnets and attempted to query the OTel logs in ClickHouse, but the query failed with a primary key error. The o
engine_getblobs_blob_scaling 1/6 The output fails to provide concrete per-client numbers for all execution clients, only giving a vague bound ('sub-50ms') for the fastest ti
engine_newpayload_el_comparison 1/6 The agent correctly reports per-execution-client durations from a real query, with values within the acceptable 2x range and correct fastest
late_data_columns 4/6 string
mempool_observer_coverage 1/6 The output reports 1 distinct client, which falls outside the accepted range of 19-23 clients specified in the rubric, even though the total
parent_distance_orphans 1/6 The agent failed to report the full parent distance distribution, completely missing the ~221,000 blocks at distance 1 (which make up 99.6%

eval trend

🔭 Langfuse traces (216 runs; ⚠️ = failed)
Read more

v0.37.0-rc.0

v0.37.0-rc.0 Pre-release
Pre-release

Choose a tag to compare

@samcm samcm released this 29 Jun 04:12
ace9f8e

What's Changed

  • feat(storage): scope uploads by session and return host path by @samcm in #248
  • fix: It should use chartkits for the actual charting? by @samcm in #249
  • fix(chartkit): harden against visually-broken charts; require network + structured source by @samcm in #251
  • feat(networks): devnet data + notes.ethereum.org specs in CLI and Python by @samcm in #250
  • docs(storage): surface host_path and session reuse for the storage upload smoke case by @samcm in #253
  • feat(evm): EVM execution module — opcode trial, tracing, tx submission against devnet nodes by @qu0b in #252
  • [codex] embed network specs in resources by @parithosh in #256
  • feat(auth): make the server the sole owner of proxy credentials by @samcm in #258
  • search: replace inline content with ref URIs + panda read command by @qu0b in #255

Full Changelog: v0.36.2-rc.0...v0.37.0-rc.0

🐼 Release qualification

Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit ace9f8e.

release pass rate mean score mean tokens (correct)
v0.37.0-rc.0 (this release) 92% (198/216) 0.227 23,260
v0.36.2-rc.0 92% (194/210) 0.230 22,363
v0.36.0 89% (176/198) 0.220 24,678
v0.36.0-rc.0 87% (173/198) 0.219 24,067
v0.35.0 87% (173/198) 0.216 22,956
v0.35.0-rc.0 88% (174/198) 0.219 22,182
v0.34.0 85% (169/198) 0.209 23,915
v0.34.0-rc.0 84% (167/198) 0.209 22,725
v0.33.0 87% (173/198) 0.214 21,847

Tokens per correct run: p50 19,292 · p90 36,003 · p99 65,429 · 25,122 per solve (all spend / correct answers).

Efficiency vs v0.36.2-rc.0: 0.223 — the harden-loop score with the previous release as the frozen token reference. 0.229 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.36.2-rc.0

question v0.36.2-rc.0 v0.37.0-rc.0
attestation_volume_analysis 5/6 6/6 🟢 improved
au_block_arrivals_chart 5/6 4/6 🔻 regressed
chartkit_default_arrival_distribution 5/6 6/6 🟢 improved
chartkit_default_entity_bar 6/6 5/6 🔻 regressed
engine_getblobs_blob_scaling 4/6 5/6 🟢 improved
engine_newpayload_el_comparison 6/6 5/6 🔻 regressed
late_data_columns 6/6 2/6 🔻 regressed
mempool_observer_coverage 3/6 5/6 🟢 improved
parent_distance_orphans 5/6 4/6 🔻 regressed
session_data_analysis 4/6 5/6 🟢 improved
storage_upload_session_scoped 6/6 🆕 new question
testnet_status 4/6 5/6 🟢 improved
validator_count 6/6 5/6 🔻 regressed

Categories with failures

category questions pass rate median tokens (correct)
blobs 4 19/24 (79%) 30,196
timing 9 47/54 (87%) 22,314
visualization 7 38/42 (90%) 22,744
execution 6 30/33 (91%) 21,710
networks 8 41/45 (91%) 16,197
clickhouse 27 144/156 (92%) 19,976
blocks 14 78/84 (93%) 21,088
multi_step 6 34/36 (94%) 22,924
validators 4 23/24 (96%) 17,916

Failed runs

question failed sample grader reason
au_block_arrivals_chart 2/6 The agent successfully geo-filters the data for Australian observers on June 8, 2026, builds a chart, and provides the storage URL. However,
cbt_table_bounds 3/6 The agent performed a full-table MIN/MAX scan on the fct_block table (SELECT min(...) ... FROM mainnet.fct_block) instead of sourcing th
chartkit_default_entity_bar 1/6 string
engine_getblobs_blob_scaling 1/6 The output provides concrete per-client durations from a real query and correctly preserves the fast/slow client grouping. However, the abso
engine_newpayload_el_comparison 1/6 The output fails to conclude that everything sat far below the 12s slot, as it explicitly states that Erigon's max of 7.8s (65% of the slot)
late_data_columns 4/6 The agent executed a real query over the raw libp2p gossipsub data-column-sidecar observations for 8 June 2026 with a >5-minute (300,000 ms)
mempool_observer_coverage 1/6 The agent reported 1 distinct client and 53,148,763 observations. The rubric requires the number of distinct clients to be between 19 and 23
parent_distance_orphans 2/6 The agent failed to report the distribution of parent distances (distance 1, 2, and 3+). It only provided the aggregate count for blocks wit
session_data_analysis 1/6 string
testnet_status 1/6 The output provides concrete participation and block production figures from real queries, but it fails to explicitly report the finalized e
validator_count 1/6 The output reports 200,115,277 active validators, which is significantly outside the accepted range of 870k-910k specified in the rubric.

eval trend

🔭 Langfuse traces (216 runs; ⚠️ = failed)
Read more

v0.36.2-rc.0

v0.36.2-rc.0 Pre-release
Pre-release

Choose a tag to compare

@samcm samcm released this 23 Jun 02:53
d607b76

What's Changed

  • feat(sandbox): add chartkit charting library by @samcm in #246

Full Changelog: v0.36.1-rc.0...v0.36.2-rc.0

🐼 Release qualification

Full eval all files: 36 questions, 210 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit d607b76.

release pass rate mean score mean tokens (correct)
v0.36.2-rc.0 (this release) 92% (194/210) 0.230 22,363
v0.36.0 89% (176/198) 0.220 24,678
v0.36.0-rc.0 87% (173/198) 0.219 24,067
v0.35.0 87% (173/198) 0.216 22,956
v0.35.0-rc.0 88% (174/198) 0.219 22,182
v0.34.0 85% (169/198) 0.209 23,915
v0.34.0-rc.0 84% (167/198) 0.209 22,725
v0.33.0 87% (173/198) 0.214 21,847
v0.33.0-rc.5 90% (179/198) 0.226 23,285

Tokens per correct run: p50 18,177 · p90 32,718 · p99 71,544 · 24,618 per solve (all spend / correct answers).

Efficiency vs v0.36.0: 0.241 — the harden-loop score with the previous release as the frozen token reference. 0.231 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.36.0

question v0.36.0 v0.36.2-rc.0
au_block_arrivals_chart 4/6 5/6 🟢 improved
blob_count_stats 5/6 6/6 🟢 improved
block_time_histogram 3/6 6/6 🟢 improved
block_timing_analysis 5/6 6/6 🟢 improved
cbt_table_bounds 4/6 3/6 🔻 regressed
chartkit_default_arrival_distribution 5/6 🆕 new question
chartkit_default_entity_bar 6/6 🆕 new question
engine_getblobs_blob_scaling 6/6 4/6 🔻 regressed
engine_newpayload_el_comparison 4/6 6/6 🟢 improved
late_data_columns 1/6 6/6 🟢 improved
mempool_observer_coverage 6/6 3/6 🔻 regressed
mev_timing_games 5/6 6/6 🟢 improved
parent_distance_orphans 4/6 5/6 🟢 improved
testnet_status 6/6 4/6 🔻 regressed

Categories with failures

category questions pass rate median tokens (correct)
execution 6 28/33 (85%) 20,060
multi_step 5 26/30 (87%) 24,364
visualization 7 37/42 (88%) 21,698
networks 8 40/45 (89%) 16,686
blobs 4 22/24 (92%) 17,434
blocks 14 79/84 (94%) 19,552
clickhouse 27 147/156 (94%) 18,496
attestations 3 17/18 (94%) 17,408
timing 9 52/54 (96%) 20,670
smoke 7 41/42 (98%) 14,943

Failed runs

question failed sample grader reason
attestation_volume_analysis 1/6 no reason recorded
au_block_arrivals_chart 1/6 The agent correctly filtered the data for Australian observers on 8 June 2026 and generated a chart with a valid URL. However, the agent's t
cbt_table_bounds 3/6 The agent used a full-table MIN/MAX scan via a ClickHouse SQL query to determine the date range, rather than sourcing the answer from the pi
chartkit_default_arrival_distribution 1/6 string
engine_getblobs_blob_scaling 2/6 The agent provided average durations instead of the medians requested by the rubric. Furthermore, the values it provided for the slowest cli
mempool_observer_coverage 3/6 The output reports 1 distinct client, which falls outside the accepted range of 19-23 clients specified in the rubric. Although the agent pe
parent_distance_orphans 1/6 The output provides the total number of blocks with a parent distance greater than 1 and their survival/orphan outcomes, but it fails to rep
session_data_analysis 2/6 The agent failed to produce a real histogram image artifact. Due to the charting library (chartkit) running out of memory (OOM), the agent f
testnet_status 2/6 The output provides slot numbers, block production metrics, and missed slot rates, but it completely omits the current epoch and finalized e

eval trend

🔭 Langfuse traces (209 runs; ⚠️ = failed)
Read more

v0.36.1-rc.0

v0.36.1-rc.0 Pre-release
Pre-release

Choose a tag to compare

@samcm samcm released this 22 Jun 05:26
7e4e398

What's Changed

  • test(eval): fix grader false-negatives on chart-URL and late-column cases by @samcm in #238
  • feat(eval): grade through the Codex Responses API directly (Codex subscription) by @samcm in #239
  • fix(eval): deterministic per-test sandbox cleanup via owner attribution by @samcm in #237
  • ci(release): build and push sandbox image before goreleaser by @samcm in #236
  • fix(docker): run server as the credentials owner for non-1000 host UIDs by @samcm in #241

Full Changelog: v0.36.0...v0.36.1-rc.0

v0.36.0

Choose a tag to compare

@samcm samcm released this 19 Jun 05:45
3d7dfaa

What's Changed

  • feat(search): re-index on proxy embedding model change by @samcm in #233
  • fix(auth): coordinate credential refresh across processes and guard logins by @samcm in #235

Full Changelog: v0.35.0...v0.36.0

🐼 Release qualification

Full eval all files: 34 questions, 198 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit 3d7dfaa.

release pass rate mean score mean tokens (correct)
v0.36.0 (this release) 89% (176/198) 0.220 24,678
v0.36.0-rc.0 87% (173/198) 0.219 24,067
v0.35.0 87% (173/198) 0.216 22,956
v0.35.0-rc.0 88% (174/198) 0.219 22,182
v0.34.0 85% (169/198) 0.209 23,915
v0.34.0-rc.0 84% (167/198) 0.209 22,725
v0.33.0 87% (173/198) 0.214 21,847
v0.33.0-rc.5 90% (179/198) 0.226 23,285
v0.33.0-rc.4 91% (180/198) 0.223 22,208

Tokens per correct run: p50 19,586 · p90 41,017 · p99 93,631 · 27,560 per solve (all spend / correct answers).

Efficiency vs v0.36.0-rc.0: 0.229 — the harden-loop score with the previous release as the frozen token reference. 0.222 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.36.0-rc.0

question v0.36.0-rc.0 v0.36.0
attestation_volume_analysis 3/6 5/6 🟢 improved
au_block_arrivals_chart 5/6 4/6 🔻 regressed
blob_count_stats 6/6 5/6 🔻 regressed
block_timing_analysis 4/6 5/6 🟢 improved
engine_getblobs_blob_scaling 5/6 6/6 🟢 improved
engine_newpayload_el_comparison 6/6 4/6 🔻 regressed
late_data_columns 0/6 1/6 🟢 improved
mev_relay_share 5/6 6/6 🟢 improved
mev_timing_games 6/6 5/6 🔻 regressed
parent_distance_orphans 6/6 4/6 🔻 regressed
session_data_analysis 3/6 4/6 🟢 improved
testnet_status 3/6 6/6 🟢 improved

Categories with failures

category questions pass rate median tokens (correct)
visualization 5 21/30 (70%) 25,212
timing 8 34/48 (71%) 24,668
blobs 4 18/24 (75%) 31,129
multi_step 3 14/18 (78%) 30,007
blocks 12 61/72 (85%) 20,684
clickhouse 26 135/150 (90%) 20,152
execution 6 31/33 (94%) 24,714
attestations 3 17/18 (94%) 19,453
networks 8 43/45 (96%) 16,224
mev 4 23/24 (96%) 20,023

Failed runs

question failed sample grader reason
attestation_volume_analysis 1/6 The agent successfully retrieved the correct attestation event counts for the specified date range, maintained the correct network ordering
au_block_arrivals_chart 2/6 The output fails on two main criteria. First, it does not provide a storage URL for the uploaded image, only stating the file is 'in the wor
blob_count_stats 1/6 The agent reports a median of 2, which is outside the ±40% tolerance of the expected median of 5 (acceptable range: 3 to 7). Although the mi
block_time_histogram 3/6 The output provides a local file path (/tmp/block_arrival_histogram.png) instead of a storage URL for the uploaded image, failing the explic
block_timing_analysis 1/6 The output fails to provide a storage URL for the chart image, which is explicitly required by the rubric. The agent only provided a text de
cbt_table_bounds 2/6 The agent used a full-table MIN/MAX scan (SELECT min(...) max(...) FROM mainnet.fct_block) to find the date range, rather than sourcing the
engine_newpayload_el_comparison 2/6 string
late_data_columns 5/6 The agent's answer reports 46,440 observations and 122 distinct peers. This falls outside the accepted range specified in the rubric, which
mev_timing_games 1/6 no reason recorded
parent_distance_orphans 2/6 The agent used a canonical-only analysis (querying canonical_beacon_block) which cannot see orphaned blocks, and incorrectly concluded tha
session_data_analysis 2/6 The agent successfully queried ~1000 mainnet blocks, calculated a plausible average gas used (29,909,547), generated a histogram, and reused

eval trend

🔭 Langfuse traces (195 runs; ⚠️ = failed)
Read more

v0.36.0-rc.0

v0.36.0-rc.0 Pre-release
Pre-release

Choose a tag to compare

@samcm samcm released this 19 Jun 04:43
3d7dfaa

What's Changed

  • feat(search): re-index on proxy embedding model change by @samcm in #233
  • fix(auth): coordinate credential refresh across processes and guard logins by @samcm in #235

Full Changelog: v0.35.0...v0.36.0-rc.0

🐼 Release qualification

Full eval all files: 34 questions, 198 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit 3d7dfaa.

release pass rate mean score mean tokens (correct)
v0.36.0-rc.0 (this release) 87% (173/198) 0.219 24,067
v0.35.0 87% (173/198) 0.216 22,956
v0.35.0-rc.0 88% (174/198) 0.219 22,182
v0.34.0 85% (169/198) 0.209 23,915
v0.34.0-rc.0 84% (167/198) 0.209 22,725
v0.33.0 87% (173/198) 0.214 21,847
v0.33.0-rc.5 90% (179/198) 0.226 23,285
v0.33.0-rc.4 91% (180/198) 0.223 22,208
v0.33.0-rc.3 87% (173/198) 0.217 22,738

Tokens per correct run: p50 19,156 · p90 38,756 · p99 91,973 · 28,620 per solve (all spend / correct answers).

Efficiency vs v0.35.0: 0.220 — the harden-loop score with the previous release as the frozen token reference. 0.218 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.35.0

question v0.35.0 v0.36.0-rc.0
au_block_arrivals_chart 4/6 5/6 🟢 improved
block_time_histogram 4/6 3/6 🔻 regressed
cbt_table_bounds 6/6 4/6 🔻 regressed
engine_getblobs_blob_scaling 6/6 5/6 🔻 regressed
engine_newpayload_el_comparison 5/6 6/6 🟢 improved
mempool_observer_coverage 3/6 6/6 🟢 improved
parent_distance_orphans 4/6 6/6 🟢 improved
session_data_analysis 4/6 3/6 🔻 regressed
testnet_status 5/6 3/6 🔻 regressed

Categories with failures

category questions pass rate median tokens (correct)
multi_step 3 10/18 (56%) 26,949
visualization 5 18/30 (60%) 26,880
blobs 4 17/24 (71%) 35,660
timing 8 36/48 (75%) 24,830
attestations 3 15/18 (83%) 18,904
blocks 12 62/72 (86%) 19,886
networks 8 40/45 (89%) 16,563
clickhouse 26 134/150 (89%) 20,360
mev 4 23/24 (96%) 19,467
execution 6 32/33 (97%) 25,332

Failed runs

question failed sample grader reason
attestation_volume_analysis 3/6 The agent correctly calculated the attestation events with the expected magnitudes and network ordering. However, it failed two key requirem
au_block_arrivals_chart 1/6 The agent successfully filtered the data for Australian observers on 8 June 2026 and generated a chart. However, it did not provide a storag
block_time_histogram 3/6 The output provides an ASCII text-based histogram and summary statistics, but it does not generate, upload, or provide a URL to an actual im
block_timing_analysis 2/6 The output fails to provide a storage URL for the chart image, only providing a local file path (/tmp/block_arrival_boxplot.png). According
cbt_table_bounds 2/6 The agent used a full-table MIN/MAX scan with a broad WHERE clause to bypass the primary key error, rather than sourcing the answer from the
engine_getblobs_blob_scaling 1/6 The agent's final answer only describes the query it executed but fails to provide the actual results, concrete per-client numbers, or the a
late_data_columns 6/6 The agent's final answer provides 44,700 observations and 122 distinct peers. While this is grounded in a real query over the raw libp2p gos
mev_relay_share 1/6 The agent reported 100%, which falls outside the 85-95% band specified in the rubric. The agent's query incorrectly filtered the denominator
session_data_analysis 3/6 The agent successfully queries ~1000 recent mainnet blocks, computes a plausible average gas used per block (~30.4M), and produces an ASCII
testnet_status 3/6 The output reports the current epoch and block production figures, but explicitly states that the finalized epoch is unavailable and therefo

eval trend

🔭 Langfuse traces (198 runs; ⚠️ = failed)
Read more