Releases: ethpandaops/panda
Release list
v0.38.0-rc.3
What's Changed
Full Changelog: v0.38.0-rc.2...v0.38.0-rc.3
🐼 Release qualification
Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit 1bbe2d3.
| release | pass rate | mean score | mean tokens (correct) |
|---|---|---|---|
| v0.38.0-rc.3 (this release) | 0% (0/216) | 0.000 | 0 |
| v0.38.0-rc.1 | 94% (204/216) | 0.232 | 22,867 |
| v0.38.0-rc.0 | 94% (203/216) | 0.236 | 23,507 |
| v0.37.0 | 94% (202/216) | 0.232 | 22,162 |
| v0.37.0-rc.0 | 92% (198/216) | 0.227 | 23,260 |
| v0.36.2-rc.0 | 92% (194/210) | 0.230 | 22,363 |
| v0.36.0 | 89% (176/198) | 0.220 | 24,678 |
| v0.36.0-rc.0 | 87% (173/198) | 0.219 | 24,067 |
| v0.35.0 | 87% (173/198) | 0.216 | 22,956 |
Tokens per correct run: p50 0 · p90 0 · p99 0 · 0 per solve (all spend / correct answers).
Efficiency vs v0.38.0-rc.1: 0.000 — the harden-loop score with the previous release as the frozen token reference. 0.000 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.
📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.
Per-question changes vs v0.38.0-rc.1
| question | v0.38.0-rc.1 | v0.38.0-rc.3 | |
|---|---|---|---|
attestation_volume_analysis |
6/6 | 0/6 | 🔻 regressed |
au_block_arrivals_chart |
3/6 | 0/6 | 🔻 regressed |
blob_count_stats |
6/6 | 0/6 | 🔻 regressed |
block_count_24h |
6/6 | 0/6 | 🔻 regressed |
block_size_vs_propagation |
6/6 | 0/6 | 🔻 regressed |
block_time_histogram |
6/6 | 0/6 | 🔻 regressed |
block_timing_analysis |
5/6 | 0/6 | 🔻 regressed |
cbt_table_bounds |
6/6 | 0/6 | 🔻 regressed |
chartkit_default_arrival_distribution |
6/6 | 0/6 | 🔻 regressed |
chartkit_default_entity_bar |
6/6 | 0/6 | 🔻 regressed |
dec4_relay_circuit_breaker |
6/6 | 0/6 | 🔻 regressed |
devnet_error_logs |
6/6 | 0/6 | 🔻 regressed |
devnets_running |
6/6 | 0/6 | 🔻 regressed |
empty_blob_submitters |
6/6 | 0/6 | 🔻 regressed |
engine_getblobs_blob_scaling |
5/6 | 0/6 | 🔻 regressed |
engine_newpayload_el_comparison |
5/6 | 0/6 | 🔻 regressed |
forky_node_coverage |
6/6 | 0/6 | 🔻 regressed |
head_vote_correctness_24h |
6/6 | 0/6 | 🔻 regressed |
late_data_columns |
3/6 | 0/6 | 🔻 regressed |
list_datasources |
6/6 | 0/6 | 🔻 regressed |
mainnet_average_base_fee_per_gas_jun_10_2026 |
3/3 | 0/3 | 🔻 regressed |
mainnet_block_arrival_p50 |
6/6 | 0/6 | 🔻 regressed |
mainnet_xatu_top_peer_count_node |
3/3 | 0/3 | 🔻 regressed |
mempool_observer_coverage |
5/6 | 0/6 | 🔻 regressed |
mev_max_value_block |
6/6 | 0/6 | 🔻 regressed |
mev_relay_share |
6/6 | 0/6 | 🔻 regressed |
mev_timing_games |
6/6 | 0/6 | 🔻 regressed |
missed_slots_24h |
6/6 | 0/6 | 🔻 regressed |
parent_distance_orphans |
6/6 | 0/6 | 🔻 regressed |
precompile_usage |
6/6 | 0/6 | 🔻 regressed |
sepolia_validator_lookup |
6/6 | 0/6 | 🔻 regressed |
session_data_analysis |
6/6 | 0/6 | 🔻 regressed |
storage_upload_session_scoped |
6/6 | 0/6 | 🔻 regressed |
testnet_status |
4/6 | 0/6 | 🔻 regressed |
tracoor_node_coverage |
6/6 | 0/6 | 🔻 regressed |
validator_count |
6/6 | 0/6 | 🔻 regressed |
validator_head_accuracy |
6/6 | 0/6 | 🔻 regressed |
Categories with failures
| category | questions | pass rate | median tokens (correct) |
|---|---|---|---|
attestations |
3 | 0/18 (0%) | 0 |
blobs |
4 | 0/24 (0%) | 0 |
blocks |
14 | 0/84 (0%) | 0 |
clickhouse |
27 | 0/156 (0%) | 0 |
execution |
6 | 0/33 (0%) | 0 |
mev |
4 | 0/24 (0%) | 0 |
multi_step |
6 | 0/36 (0%) | 0 |
networks |
8 | 0/45 (0%) | 0 |
smoke |
8 | 0/48 (0%) | 0 |
storage |
1 | 0/6 (0%) | 0 |
timing |
9 | 0/54 (0%) | 0 |
validators |
4 | 0/24 (0%) | 0 |
visualization |
7 | 0/42 (0%) | 0 |
Failed runs
| question | failed | sample grader reason |
|---|---|---|
attestation_volume_analysis |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
au_block_arrivals_chart |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
blob_count_stats |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
block_count_24h |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
block_size_vs_propagation |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
block_time_histogram |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
block_timing_analysis |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
cbt_table_bounds |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
chartkit_default_arrival_distribution |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
chartkit_default_entity_bar |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
dec4_relay_circuit_breaker |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
devnet_error_logs |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
devnets_running |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
empty_blob_submitters |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
engine_getblobs_blob_scaling |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
engine_newpayload_el_comparison |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
forky_node_coverage |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
head_vote_correctness_24h |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
late_data_columns |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
list_datasources |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
mainnet_average_base_fee_per_gas_jun_10_2026 |
3/3 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
mainnet_block_arrival_p50 |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
mainnet_xatu_top_peer_count_node |
3/3 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
mempool_observer_coverage |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
mev_max_value_block |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
mev_relay_share |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
mev_timing_games |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
missed_slots_24h |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
parent_distance_orphans |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
precompile_usage |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
sepolia_validator_lookup |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
session_data_analysis |
6/6 | API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv |
| `storage_uplo... |
v0.38.0-rc.2
What's Changed
- feat(compute): human-readable default output + list filtering by @samcm in #267
- feat(compute): add snapshots promote command by @samcm in #271
- fix(cli): datasource-aware error hints to reduce avoidable CLI errors by @qu0b in #130
- feat(runbooks): add devnet consensus bug report runbook by @qu0b in #273
- Rebuild runbook registry with hybrid retrieval; hard cutover to task-typed embedding by @Savid in #274
Full Changelog: v0.38.0-rc.1...v0.38.0-rc.2
v0.38.0-rc.1
Changelog
🐼 Release qualification
Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit f2f4716.
| release | pass rate | mean score | mean tokens (correct) |
|---|---|---|---|
| v0.38.0-rc.1 (this release) | 94% (204/216) | 0.232 | 22,867 |
| v0.38.0-rc.0 | 94% (203/216) | 0.236 | 23,507 |
| v0.37.0 | 94% (202/216) | 0.232 | 22,162 |
| v0.37.0-rc.0 | 92% (198/216) | 0.227 | 23,260 |
| v0.36.2-rc.0 | 92% (194/210) | 0.230 | 22,363 |
| v0.36.0 | 89% (176/198) | 0.220 | 24,678 |
| v0.36.0-rc.0 | 87% (173/198) | 0.219 | 24,067 |
| v0.35.0 | 87% (173/198) | 0.216 | 22,956 |
| v0.35.0-rc.0 | 88% (174/198) | 0.219 | 22,182 |
Tokens per correct run: p50 19,303 · p90 34,731 · p99 72,288 · 24,424 per solve (all spend / correct answers).
Efficiency vs v0.38.0-rc.0: 0.240 — the harden-loop score with the previous release as the frozen token reference. 0.236 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.
📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.
Per-question changes vs v0.38.0-rc.0
| question | v0.38.0-rc.0 | v0.38.0-rc.1 | |
|---|---|---|---|
au_block_arrivals_chart |
5/6 | 3/6 | 🔻 regressed |
block_timing_analysis |
6/6 | 5/6 | 🔻 regressed |
cbt_table_bounds |
4/6 | 6/6 | 🟢 improved |
chartkit_default_entity_bar |
5/6 | 6/6 | 🟢 improved |
engine_getblobs_blob_scaling |
6/6 | 5/6 | 🔻 regressed |
engine_newpayload_el_comparison |
6/6 | 5/6 | 🔻 regressed |
late_data_columns |
2/6 | 3/6 | 🟢 improved |
parent_distance_orphans |
5/6 | 6/6 | 🟢 improved |
testnet_status |
3/6 | 4/6 | 🟢 improved |
Categories with failures
| category | questions | pass rate | median tokens (correct) |
|---|---|---|---|
blobs |
4 | 20/24 (83%) | 25,612 |
timing |
9 | 46/54 (85%) | 22,610 |
visualization |
7 | 38/42 (90%) | 23,413 |
execution |
6 | 30/33 (91%) | 26,034 |
blocks |
14 | 80/84 (95%) | 20,640 |
clickhouse |
27 | 149/156 (96%) | 19,788 |
networks |
8 | 43/45 (96%) | 16,263 |
multi_step |
6 | 35/36 (97%) | 23,743 |
Failed runs
| question | failed | sample grader reason |
|---|---|---|
au_block_arrivals_chart |
3/6 | API error: 400 Bad Request {"error":{"message":"Error from provider (Alibaba): <400> InternalError.Algo.DataInspectionFailed: Input text dat |
block_timing_analysis |
1/6 | The agent failed to re-use the session to maintain persistence between turns. It repeatedly attempted to use a non-existent session name ('b |
engine_getblobs_blob_scaling |
1/6 | API call error: Error: Error parsing response from https://opencode.ai/zen/go/v1/chat/completions: Unexpected end of JSON input. HTTP 503 Se |
engine_newpayload_el_comparison |
1/6 | The output fails the criterion to conclude that everything sat far below the 12s slot. Instead, it explicitly states that erigon's max durat |
late_data_columns |
3/6 | string |
mempool_observer_coverage |
1/6 | The agent reported 4 distinct execution clients, which falls outside the accepted range of 19-23 distinct observing clients required by the |
testnet_status |
2/6 | The output provides block-production and participation figures derived from real queries, but it completely omits the current epoch and fina |
🔭 Langfuse traces (216 runs; ⚠️ = failed)
empty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scaling⚠️ engine_getblobs_blob_scalingengine_getblobs_blob_scalingblob_count_statsblob_count_statsblob_count_statsblob_count_statsblob_count_statsblob_count_statslate_data_columns⚠️ late_data_columnslate_data_columnslate_data_columns⚠️ late_data_columnslate_data_columns⚠️ block_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationparent_distance_orphansparent_distance_orphansparent_distance_orphansparent_distance_orphansparent_distance_orphansparent_distance_orphansprecompile_usageprecompile_usageprecompile_usageprecompile_usageprecompile_usage- [
precompile_usage](https://langfuse.analyti...
v0.38.0-rc.0
Changelog
- bc33ee9 feat(compute): add compute datasource for ephemeral sandboxes (#265)
- 3c3fa2f feat(resources): resolve the right resource path on a read miss (#261)
- 695c89c fix(search): activate semantic search on proxy discovery, not only at startup (#260)
- d388d4a release-eval: serve release reports via the race-proof viewer (#259)
🐼 Release qualification
Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit bc33ee9.
| release | pass rate | mean score | mean tokens (correct) |
|---|---|---|---|
| v0.38.0-rc.0 (this release) | 94% (203/216) | 0.236 | 23,507 |
| v0.37.0 | 94% (202/216) | 0.232 | 22,162 |
| v0.37.0-rc.0 | 92% (198/216) | 0.227 | 23,260 |
| v0.36.2-rc.0 | 92% (194/210) | 0.230 | 22,363 |
| v0.36.0 | 89% (176/198) | 0.220 | 24,678 |
| v0.36.0-rc.0 | 87% (173/198) | 0.219 | 24,067 |
| v0.35.0 | 87% (173/198) | 0.216 | 22,956 |
| v0.35.0-rc.0 | 88% (174/198) | 0.219 | 22,182 |
| v0.34.0 | 85% (169/198) | 0.209 | 23,915 |
Tokens per correct run: p50 19,454 · p90 36,674 · p99 76,701 · 25,016 per solve (all spend / correct answers).
Efficiency vs v0.37.0: 0.224 — the harden-loop score with the previous release as the frozen token reference. 0.235 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.
📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.
Per-question changes vs v0.37.0
| question | v0.37.0 | v0.38.0-rc.0 | |
|---|---|---|---|
au_block_arrivals_chart |
4/6 | 5/6 | 🟢 improved |
blob_count_stats |
5/6 | 6/6 | 🟢 improved |
chartkit_default_entity_bar |
6/6 | 5/6 | 🔻 regressed |
devnet_error_logs |
5/6 | 6/6 | 🟢 improved |
engine_getblobs_blob_scaling |
5/6 | 6/6 | 🟢 improved |
engine_newpayload_el_comparison |
5/6 | 6/6 | 🟢 improved |
testnet_status |
6/6 | 3/6 | 🔻 regressed |
Categories with failures
| category | questions | pass rate | median tokens (correct) |
|---|---|---|---|
blobs |
4 | 20/24 (83%) | 31,030 |
networks |
8 | 40/45 (89%) | 15,956 |
timing |
9 | 49/54 (91%) | 22,714 |
visualization |
7 | 40/42 (95%) | 23,132 |
clickhouse |
27 | 149/156 (96%) | 20,830 |
blocks |
14 | 81/84 (96%) | 20,879 |
execution |
6 | 32/33 (97%) | 25,844 |
multi_step |
6 | 35/36 (97%) | 25,040 |
Failed runs
| question | failed | sample grader reason |
|---|---|---|
au_block_arrivals_chart |
1/6 | The output successfully filters the data for Australian observers, builds a chart, and provides the storage URL. However, it fails the speci |
cbt_table_bounds |
2/6 | The agent used a full-table MIN/MAX SQL scan to determine the date range instead of sourcing the answer from the pipeline's coverage/bounds |
chartkit_default_entity_bar |
1/6 | The agent failed to re-use the panda execute session across turns. It used session 169c33497071 in Turn 1 but created a new session 29a67bde |
late_data_columns |
4/6 | The agent's query returned 88,240 late observations, which falls outside the accepted range of roughly 35,000 to 55,000 observations specifi |
mempool_observer_coverage |
1/6 | The output reports 1 distinct client, which falls outside the accepted range of 19-23 clients specified in the rubric. |
parent_distance_orphans |
1/6 | The agent failed to report the full distribution of parent distances (distance 1, 2, and 3+). Its query lumped all distances > 1 together, a |
testnet_status |
3/6 | The output reports the current epoch (105,673) and recent block-production figures, but it completely omits the finalized epoch, which is ex |
🔭 Langfuse traces (216 runs; ⚠️ = failed)
empty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingblob_count_statsblob_count_statsblob_count_statsblob_count_statsblob_count_statsblob_count_statslate_data_columns⚠️ late_data_columnslate_data_columns⚠️ late_data_columns⚠️ late_data_columns⚠️ late_data_columnsblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationparent_distance_orphansparent_distance_orphansparent_distance_orphansparent_distance_orphans⚠️ parent_distance_orphansparent_distance_orphansprecompile_usageprecompile_usageprecompile_usage- [...
v0.37.0
What's Changed
- test(eval): fix grader false-negatives on chart-URL and late-column cases by @samcm in #238
- feat(eval): grade through the Codex Responses API directly (Codex subscription) by @samcm in #239
- fix(eval): deterministic per-test sandbox cleanup via owner attribution by @samcm in #237
- ci(release): build and push sandbox image before goreleaser by @samcm in #236
- fix(docker): run server as the credentials owner for non-1000 host UIDs by @samcm in #241
- feat(sandbox): add chartkit charting library by @samcm in #246
- feat(storage): scope uploads by session and return host path by @samcm in #248
- fix: It should use chartkits for the actual charting? by @samcm in #249
- fix(chartkit): harden against visually-broken charts; require network + structured source by @samcm in #251
- feat(networks): devnet data + notes.ethereum.org specs in CLI and Python by @samcm in #250
- docs(storage): surface host_path and session reuse for the storage upload smoke case by @samcm in #253
- feat(evm): EVM execution module — opcode trial, tracing, tx submission against devnet nodes by @qu0b in #252
- [codex] embed network specs in resources by @parithosh in #256
- feat(auth): make the server the sole owner of proxy credentials by @samcm in #258
- search: replace inline content with ref URIs + panda read command by @qu0b in #255
Full Changelog: v0.36.0...v0.37.0
🐼 Release qualification
Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit ace9f8e.
| release | pass rate | mean score | mean tokens (correct) |
|---|---|---|---|
| v0.37.0 (this release) | 94% (202/216) | 0.232 | 22,162 |
| v0.37.0-rc.0 | 92% (198/216) | 0.227 | 23,260 |
| v0.36.2-rc.0 | 92% (194/210) | 0.230 | 22,363 |
| v0.36.0 | 89% (176/198) | 0.220 | 24,678 |
| v0.36.0-rc.0 | 87% (173/198) | 0.219 | 24,067 |
| v0.35.0 | 87% (173/198) | 0.216 | 22,956 |
| v0.35.0-rc.0 | 88% (174/198) | 0.219 | 22,182 |
| v0.34.0 | 85% (169/198) | 0.209 | 23,915 |
| v0.34.0-rc.0 | 84% (167/198) | 0.209 | 22,725 |
Tokens per correct run: p50 19,116 · p90 32,706 · p99 57,458 · 23,606 per solve (all spend / correct answers).
Efficiency vs v0.37.0-rc.0: 0.240 — the harden-loop score with the previous release as the frozen token reference. 0.234 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.
📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.
Per-question changes vs v0.37.0-rc.0
| question | v0.37.0-rc.0 | v0.37.0 | |
|---|---|---|---|
blob_count_stats |
6/6 | 5/6 | 🔻 regressed |
cbt_table_bounds |
3/6 | 4/6 | 🟢 improved |
chartkit_default_entity_bar |
5/6 | 6/6 | 🟢 improved |
devnet_error_logs |
6/6 | 5/6 | 🔻 regressed |
parent_distance_orphans |
4/6 | 5/6 | 🟢 improved |
session_data_analysis |
5/6 | 6/6 | 🟢 improved |
testnet_status |
5/6 | 6/6 | 🟢 improved |
validator_count |
5/6 | 6/6 | 🟢 improved |
Categories with failures
| category | questions | pass rate | median tokens (correct) |
|---|---|---|---|
blobs |
4 | 18/24 (75%) | 28,362 |
timing |
9 | 47/54 (87%) | 20,283 |
execution |
6 | 30/33 (91%) | 23,764 |
networks |
8 | 42/45 (93%) | 17,644 |
clickhouse |
27 | 146/156 (94%) | 19,286 |
visualization |
7 | 40/42 (95%) | 22,260 |
blocks |
14 | 81/84 (96%) | 19,364 |
Failed runs
| question | failed | sample grader reason |
|---|---|---|
au_block_arrivals_chart |
2/6 | The agent correctly filtered the data for Australian observers on 8 June 2026, built a chart, and provided the storage URL. However, the age |
blob_count_stats |
1/6 | The agent reported an average of 3.13 and a median (P50) of 2. While the average is within the 40% tolerance of the expected ~4.7, the repor |
cbt_table_bounds |
2/6 | The agent used a full-table MIN/MAX SQL scan to determine the date range instead of sourcing the answer from the pipeline's coverage/bounds |
devnet_error_logs |
1/6 | The agent discovered active devnets and attempted to query the OTel logs in ClickHouse, but the query failed with a primary key error. The o |
engine_getblobs_blob_scaling |
1/6 | The output fails to provide concrete per-client numbers for all execution clients, only giving a vague bound ('sub-50ms') for the fastest ti |
engine_newpayload_el_comparison |
1/6 | The agent correctly reports per-execution-client durations from a real query, with values within the acceptable 2x range and correct fastest |
late_data_columns |
4/6 | string |
mempool_observer_coverage |
1/6 | The output reports 1 distinct client, which falls outside the accepted range of 19-23 clients specified in the rubric, even though the total |
parent_distance_orphans |
1/6 | The agent failed to report the full parent distance distribution, completely missing the ~221,000 blocks at distance 1 (which make up 99.6% |
🔭 Langfuse traces (216 runs; ⚠️ = failed)
empty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scaling⚠️ engine_getblobs_blob_scalingblob_count_statsblob_count_statsblob_count_statsblob_count_stats⚠️ blob_count_statsblob_count_statslate_data_columnslate_data_columns⚠️ late_data_columnslate_data_columns⚠️ late_data_columns⚠️ late_data_columns⚠️ block_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagation- [`block_size_vs_propa...
v0.37.0-rc.0
What's Changed
- feat(storage): scope uploads by session and return host path by @samcm in #248
- fix: It should use chartkits for the actual charting? by @samcm in #249
- fix(chartkit): harden against visually-broken charts; require network + structured source by @samcm in #251
- feat(networks): devnet data + notes.ethereum.org specs in CLI and Python by @samcm in #250
- docs(storage): surface host_path and session reuse for the storage upload smoke case by @samcm in #253
- feat(evm): EVM execution module — opcode trial, tracing, tx submission against devnet nodes by @qu0b in #252
- [codex] embed network specs in resources by @parithosh in #256
- feat(auth): make the server the sole owner of proxy credentials by @samcm in #258
- search: replace inline content with ref URIs + panda read command by @qu0b in #255
Full Changelog: v0.36.2-rc.0...v0.37.0-rc.0
🐼 Release qualification
Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit ace9f8e.
| release | pass rate | mean score | mean tokens (correct) |
|---|---|---|---|
| v0.37.0-rc.0 (this release) | 92% (198/216) | 0.227 | 23,260 |
| v0.36.2-rc.0 | 92% (194/210) | 0.230 | 22,363 |
| v0.36.0 | 89% (176/198) | 0.220 | 24,678 |
| v0.36.0-rc.0 | 87% (173/198) | 0.219 | 24,067 |
| v0.35.0 | 87% (173/198) | 0.216 | 22,956 |
| v0.35.0-rc.0 | 88% (174/198) | 0.219 | 22,182 |
| v0.34.0 | 85% (169/198) | 0.209 | 23,915 |
| v0.34.0-rc.0 | 84% (167/198) | 0.209 | 22,725 |
| v0.33.0 | 87% (173/198) | 0.214 | 21,847 |
Tokens per correct run: p50 19,292 · p90 36,003 · p99 65,429 · 25,122 per solve (all spend / correct answers).
Efficiency vs v0.36.2-rc.0: 0.223 — the harden-loop score with the previous release as the frozen token reference. 0.229 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.
📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.
Per-question changes vs v0.36.2-rc.0
| question | v0.36.2-rc.0 | v0.37.0-rc.0 | |
|---|---|---|---|
attestation_volume_analysis |
5/6 | 6/6 | 🟢 improved |
au_block_arrivals_chart |
5/6 | 4/6 | 🔻 regressed |
chartkit_default_arrival_distribution |
5/6 | 6/6 | 🟢 improved |
chartkit_default_entity_bar |
6/6 | 5/6 | 🔻 regressed |
engine_getblobs_blob_scaling |
4/6 | 5/6 | 🟢 improved |
engine_newpayload_el_comparison |
6/6 | 5/6 | 🔻 regressed |
late_data_columns |
6/6 | 2/6 | 🔻 regressed |
mempool_observer_coverage |
3/6 | 5/6 | 🟢 improved |
parent_distance_orphans |
5/6 | 4/6 | 🔻 regressed |
session_data_analysis |
4/6 | 5/6 | 🟢 improved |
storage_upload_session_scoped |
— | 6/6 | 🆕 new question |
testnet_status |
4/6 | 5/6 | 🟢 improved |
validator_count |
6/6 | 5/6 | 🔻 regressed |
Categories with failures
| category | questions | pass rate | median tokens (correct) |
|---|---|---|---|
blobs |
4 | 19/24 (79%) | 30,196 |
timing |
9 | 47/54 (87%) | 22,314 |
visualization |
7 | 38/42 (90%) | 22,744 |
execution |
6 | 30/33 (91%) | 21,710 |
networks |
8 | 41/45 (91%) | 16,197 |
clickhouse |
27 | 144/156 (92%) | 19,976 |
blocks |
14 | 78/84 (93%) | 21,088 |
multi_step |
6 | 34/36 (94%) | 22,924 |
validators |
4 | 23/24 (96%) | 17,916 |
Failed runs
| question | failed | sample grader reason |
|---|---|---|
au_block_arrivals_chart |
2/6 | The agent successfully geo-filters the data for Australian observers on June 8, 2026, builds a chart, and provides the storage URL. However, |
cbt_table_bounds |
3/6 | The agent performed a full-table MIN/MAX scan on the fct_block table (SELECT min(...) ... FROM mainnet.fct_block) instead of sourcing th |
chartkit_default_entity_bar |
1/6 | string |
engine_getblobs_blob_scaling |
1/6 | The output provides concrete per-client durations from a real query and correctly preserves the fast/slow client grouping. However, the abso |
engine_newpayload_el_comparison |
1/6 | The output fails to conclude that everything sat far below the 12s slot, as it explicitly states that Erigon's max of 7.8s (65% of the slot) |
late_data_columns |
4/6 | The agent executed a real query over the raw libp2p gossipsub data-column-sidecar observations for 8 June 2026 with a >5-minute (300,000 ms) |
mempool_observer_coverage |
1/6 | The agent reported 1 distinct client and 53,148,763 observations. The rubric requires the number of distinct clients to be between 19 and 23 |
parent_distance_orphans |
2/6 | The agent failed to report the distribution of parent distances (distance 1, 2, and 3+). It only provided the aggregate count for blocks wit |
session_data_analysis |
1/6 | string |
testnet_status |
1/6 | The output provides concrete participation and block production figures from real queries, but it fails to explicitly report the finalized e |
validator_count |
1/6 | The output reports 200,115,277 active validators, which is significantly outside the accepted range of 870k-910k specified in the rubric. |
🔭 Langfuse traces (216 runs; ⚠️ = failed)
empty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scaling⚠️ engine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingblob_count_statsblob_count_statsblob_count_statsblob_count_statsblob_count_statsblob_count_statslate_data_columns⚠️ late_data_columnslate_data_columnslate_data_columns⚠️ late_data_columns⚠️ late_data_columns⚠️ block_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagation- [
block_size_vs_propagation](https://langfuse.analytics.production.platform.ethpandaops.io/project/panda-eval/traces/3e8edeebf...
v0.36.2-rc.0
What's Changed
Full Changelog: v0.36.1-rc.0...v0.36.2-rc.0
🐼 Release qualification
Full eval all files: 36 questions, 210 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit d607b76.
| release | pass rate | mean score | mean tokens (correct) |
|---|---|---|---|
| v0.36.2-rc.0 (this release) | 92% (194/210) | 0.230 | 22,363 |
| v0.36.0 | 89% (176/198) | 0.220 | 24,678 |
| v0.36.0-rc.0 | 87% (173/198) | 0.219 | 24,067 |
| v0.35.0 | 87% (173/198) | 0.216 | 22,956 |
| v0.35.0-rc.0 | 88% (174/198) | 0.219 | 22,182 |
| v0.34.0 | 85% (169/198) | 0.209 | 23,915 |
| v0.34.0-rc.0 | 84% (167/198) | 0.209 | 22,725 |
| v0.33.0 | 87% (173/198) | 0.214 | 21,847 |
| v0.33.0-rc.5 | 90% (179/198) | 0.226 | 23,285 |
Tokens per correct run: p50 18,177 · p90 32,718 · p99 71,544 · 24,618 per solve (all spend / correct answers).
Efficiency vs v0.36.0: 0.241 — the harden-loop score with the previous release as the frozen token reference. 0.231 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.
📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.
Per-question changes vs v0.36.0
| question | v0.36.0 | v0.36.2-rc.0 | |
|---|---|---|---|
au_block_arrivals_chart |
4/6 | 5/6 | 🟢 improved |
blob_count_stats |
5/6 | 6/6 | 🟢 improved |
block_time_histogram |
3/6 | 6/6 | 🟢 improved |
block_timing_analysis |
5/6 | 6/6 | 🟢 improved |
cbt_table_bounds |
4/6 | 3/6 | 🔻 regressed |
chartkit_default_arrival_distribution |
— | 5/6 | 🆕 new question |
chartkit_default_entity_bar |
— | 6/6 | 🆕 new question |
engine_getblobs_blob_scaling |
6/6 | 4/6 | 🔻 regressed |
engine_newpayload_el_comparison |
4/6 | 6/6 | 🟢 improved |
late_data_columns |
1/6 | 6/6 | 🟢 improved |
mempool_observer_coverage |
6/6 | 3/6 | 🔻 regressed |
mev_timing_games |
5/6 | 6/6 | 🟢 improved |
parent_distance_orphans |
4/6 | 5/6 | 🟢 improved |
testnet_status |
6/6 | 4/6 | 🔻 regressed |
Categories with failures
| category | questions | pass rate | median tokens (correct) |
|---|---|---|---|
execution |
6 | 28/33 (85%) | 20,060 |
multi_step |
5 | 26/30 (87%) | 24,364 |
visualization |
7 | 37/42 (88%) | 21,698 |
networks |
8 | 40/45 (89%) | 16,686 |
blobs |
4 | 22/24 (92%) | 17,434 |
blocks |
14 | 79/84 (94%) | 19,552 |
clickhouse |
27 | 147/156 (94%) | 18,496 |
attestations |
3 | 17/18 (94%) | 17,408 |
timing |
9 | 52/54 (96%) | 20,670 |
smoke |
7 | 41/42 (98%) | 14,943 |
Failed runs
| question | failed | sample grader reason |
|---|---|---|
attestation_volume_analysis |
1/6 | no reason recorded |
au_block_arrivals_chart |
1/6 | The agent correctly filtered the data for Australian observers on 8 June 2026 and generated a chart with a valid URL. However, the agent's t |
cbt_table_bounds |
3/6 | The agent used a full-table MIN/MAX scan via a ClickHouse SQL query to determine the date range, rather than sourcing the answer from the pi |
chartkit_default_arrival_distribution |
1/6 | string |
engine_getblobs_blob_scaling |
2/6 | The agent provided average durations instead of the medians requested by the rubric. Furthermore, the values it provided for the slowest cli |
mempool_observer_coverage |
3/6 | The output reports 1 distinct client, which falls outside the accepted range of 19-23 clients specified in the rubric. Although the agent pe |
parent_distance_orphans |
1/6 | The output provides the total number of blocks with a parent distance greater than 1 and their survival/orphan outcomes, but it fails to rep |
session_data_analysis |
2/6 | The agent failed to produce a real histogram image artifact. Due to the charting library (chartkit) running out of memory (OOM), the agent f |
testnet_status |
2/6 | The output provides slot numbers, block production metrics, and missed slot rates, but it completely omits the current epoch and finalized e |
🔭 Langfuse traces (209 runs; ⚠️ = failed)
empty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scaling⚠️ engine_getblobs_blob_scalingengine_getblobs_blob_scaling⚠️ blob_count_statsblob_count_statsblob_count_statsblob_count_statsblob_count_statsblob_count_statslate_data_columnslate_data_columnslate_data_columnslate_data_columnslate_data_columnslate_data_columnsblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationparent_distance_orphansparent_distance_orphansparent_distance_orphansparent_distance_orphansparent_distance_orphansparent_distance_orphans⚠️ - [
precompile_usage](https://langfuse.analytics.production.platform.ethpandaops.io/project/panda-e...
v0.36.1-rc.0
What's Changed
- test(eval): fix grader false-negatives on chart-URL and late-column cases by @samcm in #238
- feat(eval): grade through the Codex Responses API directly (Codex subscription) by @samcm in #239
- fix(eval): deterministic per-test sandbox cleanup via owner attribution by @samcm in #237
- ci(release): build and push sandbox image before goreleaser by @samcm in #236
- fix(docker): run server as the credentials owner for non-1000 host UIDs by @samcm in #241
Full Changelog: v0.36.0...v0.36.1-rc.0
v0.36.0
What's Changed
- feat(search): re-index on proxy embedding model change by @samcm in #233
- fix(auth): coordinate credential refresh across processes and guard logins by @samcm in #235
Full Changelog: v0.35.0...v0.36.0
🐼 Release qualification
Full eval all files: 34 questions, 198 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit 3d7dfaa.
| release | pass rate | mean score | mean tokens (correct) |
|---|---|---|---|
| v0.36.0 (this release) | 89% (176/198) | 0.220 | 24,678 |
| v0.36.0-rc.0 | 87% (173/198) | 0.219 | 24,067 |
| v0.35.0 | 87% (173/198) | 0.216 | 22,956 |
| v0.35.0-rc.0 | 88% (174/198) | 0.219 | 22,182 |
| v0.34.0 | 85% (169/198) | 0.209 | 23,915 |
| v0.34.0-rc.0 | 84% (167/198) | 0.209 | 22,725 |
| v0.33.0 | 87% (173/198) | 0.214 | 21,847 |
| v0.33.0-rc.5 | 90% (179/198) | 0.226 | 23,285 |
| v0.33.0-rc.4 | 91% (180/198) | 0.223 | 22,208 |
Tokens per correct run: p50 19,586 · p90 41,017 · p99 93,631 · 27,560 per solve (all spend / correct answers).
Efficiency vs v0.36.0-rc.0: 0.229 — the harden-loop score with the previous release as the frozen token reference. 0.222 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.
📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.
Per-question changes vs v0.36.0-rc.0
| question | v0.36.0-rc.0 | v0.36.0 | |
|---|---|---|---|
attestation_volume_analysis |
3/6 | 5/6 | 🟢 improved |
au_block_arrivals_chart |
5/6 | 4/6 | 🔻 regressed |
blob_count_stats |
6/6 | 5/6 | 🔻 regressed |
block_timing_analysis |
4/6 | 5/6 | 🟢 improved |
engine_getblobs_blob_scaling |
5/6 | 6/6 | 🟢 improved |
engine_newpayload_el_comparison |
6/6 | 4/6 | 🔻 regressed |
late_data_columns |
0/6 | 1/6 | 🟢 improved |
mev_relay_share |
5/6 | 6/6 | 🟢 improved |
mev_timing_games |
6/6 | 5/6 | 🔻 regressed |
parent_distance_orphans |
6/6 | 4/6 | 🔻 regressed |
session_data_analysis |
3/6 | 4/6 | 🟢 improved |
testnet_status |
3/6 | 6/6 | 🟢 improved |
Categories with failures
| category | questions | pass rate | median tokens (correct) |
|---|---|---|---|
visualization |
5 | 21/30 (70%) | 25,212 |
timing |
8 | 34/48 (71%) | 24,668 |
blobs |
4 | 18/24 (75%) | 31,129 |
multi_step |
3 | 14/18 (78%) | 30,007 |
blocks |
12 | 61/72 (85%) | 20,684 |
clickhouse |
26 | 135/150 (90%) | 20,152 |
execution |
6 | 31/33 (94%) | 24,714 |
attestations |
3 | 17/18 (94%) | 19,453 |
networks |
8 | 43/45 (96%) | 16,224 |
mev |
4 | 23/24 (96%) | 20,023 |
Failed runs
| question | failed | sample grader reason |
|---|---|---|
attestation_volume_analysis |
1/6 | The agent successfully retrieved the correct attestation event counts for the specified date range, maintained the correct network ordering |
au_block_arrivals_chart |
2/6 | The output fails on two main criteria. First, it does not provide a storage URL for the uploaded image, only stating the file is 'in the wor |
blob_count_stats |
1/6 | The agent reports a median of 2, which is outside the ±40% tolerance of the expected median of 5 (acceptable range: 3 to 7). Although the mi |
block_time_histogram |
3/6 | The output provides a local file path (/tmp/block_arrival_histogram.png) instead of a storage URL for the uploaded image, failing the explic |
block_timing_analysis |
1/6 | The output fails to provide a storage URL for the chart image, which is explicitly required by the rubric. The agent only provided a text de |
cbt_table_bounds |
2/6 | The agent used a full-table MIN/MAX scan (SELECT min(...) max(...) FROM mainnet.fct_block) to find the date range, rather than sourcing the |
engine_newpayload_el_comparison |
2/6 | string |
late_data_columns |
5/6 | The agent's answer reports 46,440 observations and 122 distinct peers. This falls outside the accepted range specified in the rubric, which |
mev_timing_games |
1/6 | no reason recorded |
parent_distance_orphans |
2/6 | The agent used a canonical-only analysis (querying canonical_beacon_block) which cannot see orphaned blocks, and incorrectly concluded tha |
session_data_analysis |
2/6 | The agent successfully queried ~1000 mainnet blocks, calculated a plausible average gas used (29,909,547), generated a histogram, and reused |
🔭 Langfuse traces (195 runs; ⚠️ = failed)
empty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingblob_count_statsblob_count_statsblob_count_statsblob_count_statsblob_count_statsblob_count_stats⚠️ late_data_columns⚠️ late_data_columns⚠️ late_data_columns⚠️ late_data_columns⚠️ late_data_columns⚠️ late_data_columnsblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationparent_distance_orphansparent_distance_orphansparent_distance_orphans⚠️ parent_distance_orphans- [
parent_distance_orphans⚠️ ](https://langfuse.analytic...
v0.36.0-rc.0
What's Changed
- feat(search): re-index on proxy embedding model change by @samcm in #233
- fix(auth): coordinate credential refresh across processes and guard logins by @samcm in #235
Full Changelog: v0.35.0...v0.36.0-rc.0
🐼 Release qualification
Full eval all files: 34 questions, 198 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit 3d7dfaa.
| release | pass rate | mean score | mean tokens (correct) |
|---|---|---|---|
| v0.36.0-rc.0 (this release) | 87% (173/198) | 0.219 | 24,067 |
| v0.35.0 | 87% (173/198) | 0.216 | 22,956 |
| v0.35.0-rc.0 | 88% (174/198) | 0.219 | 22,182 |
| v0.34.0 | 85% (169/198) | 0.209 | 23,915 |
| v0.34.0-rc.0 | 84% (167/198) | 0.209 | 22,725 |
| v0.33.0 | 87% (173/198) | 0.214 | 21,847 |
| v0.33.0-rc.5 | 90% (179/198) | 0.226 | 23,285 |
| v0.33.0-rc.4 | 91% (180/198) | 0.223 | 22,208 |
| v0.33.0-rc.3 | 87% (173/198) | 0.217 | 22,738 |
Tokens per correct run: p50 19,156 · p90 38,756 · p99 91,973 · 28,620 per solve (all spend / correct answers).
Efficiency vs v0.35.0: 0.220 — the harden-loop score with the previous release as the frozen token reference. 0.218 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.
📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.
Per-question changes vs v0.35.0
| question | v0.35.0 | v0.36.0-rc.0 | |
|---|---|---|---|
au_block_arrivals_chart |
4/6 | 5/6 | 🟢 improved |
block_time_histogram |
4/6 | 3/6 | 🔻 regressed |
cbt_table_bounds |
6/6 | 4/6 | 🔻 regressed |
engine_getblobs_blob_scaling |
6/6 | 5/6 | 🔻 regressed |
engine_newpayload_el_comparison |
5/6 | 6/6 | 🟢 improved |
mempool_observer_coverage |
3/6 | 6/6 | 🟢 improved |
parent_distance_orphans |
4/6 | 6/6 | 🟢 improved |
session_data_analysis |
4/6 | 3/6 | 🔻 regressed |
testnet_status |
5/6 | 3/6 | 🔻 regressed |
Categories with failures
| category | questions | pass rate | median tokens (correct) |
|---|---|---|---|
multi_step |
3 | 10/18 (56%) | 26,949 |
visualization |
5 | 18/30 (60%) | 26,880 |
blobs |
4 | 17/24 (71%) | 35,660 |
timing |
8 | 36/48 (75%) | 24,830 |
attestations |
3 | 15/18 (83%) | 18,904 |
blocks |
12 | 62/72 (86%) | 19,886 |
networks |
8 | 40/45 (89%) | 16,563 |
clickhouse |
26 | 134/150 (89%) | 20,360 |
mev |
4 | 23/24 (96%) | 19,467 |
execution |
6 | 32/33 (97%) | 25,332 |
Failed runs
| question | failed | sample grader reason |
|---|---|---|
attestation_volume_analysis |
3/6 | The agent correctly calculated the attestation events with the expected magnitudes and network ordering. However, it failed two key requirem |
au_block_arrivals_chart |
1/6 | The agent successfully filtered the data for Australian observers on 8 June 2026 and generated a chart. However, it did not provide a storag |
block_time_histogram |
3/6 | The output provides an ASCII text-based histogram and summary statistics, but it does not generate, upload, or provide a URL to an actual im |
block_timing_analysis |
2/6 | The output fails to provide a storage URL for the chart image, only providing a local file path (/tmp/block_arrival_boxplot.png). According |
cbt_table_bounds |
2/6 | The agent used a full-table MIN/MAX scan with a broad WHERE clause to bypass the primary key error, rather than sourcing the answer from the |
engine_getblobs_blob_scaling |
1/6 | The agent's final answer only describes the query it executed but fails to provide the actual results, concrete per-client numbers, or the a |
late_data_columns |
6/6 | The agent's final answer provides 44,700 observations and 122 distinct peers. While this is grounded in a real query over the raw libp2p gos |
mev_relay_share |
1/6 | The agent reported 100%, which falls outside the 85-95% band specified in the rubric. The agent's query incorrectly filtered the denominator |
session_data_analysis |
3/6 | The agent successfully queries ~1000 recent mainnet blocks, computes a plausible average gas used per block (~30.4M), and produces an ASCII |
testnet_status |
3/6 | The output reports the current epoch and block production figures, but explicitly states that the finalized epoch is unavailable and therefo |
🔭 Langfuse traces (198 runs; ⚠️ = failed)
empty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersempty_blob_submittersengine_getblobs_blob_scaling⚠️ engine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingengine_getblobs_blob_scalingblob_count_statsblob_count_statsblob_count_statsblob_count_statsblob_count_statsblob_count_statslate_data_columns⚠️ late_data_columns⚠️ late_data_columns⚠️ late_data_columns⚠️ late_data_columns⚠️ late_data_columns⚠️ block_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationblock_size_vs_propagationparent_distance_orphansparent_distance_orphansparent_distance_orphansparent_distance_orphans- [
parent_distance_orphans](https://langfuse.analytics.production.platform.ethpandaops.io/project/panda-eval/traces/5c8fdd...






