Releases · ethpandaops/panda

Release list

v0.38.0-rc.3 Pre-release

Pre-release

Savid released this 03 Jul 02:49

v0.38.0-rc.3

1bbe2d3

What's Changed

docs(runbooks): fix accuracy, contract, and retrieval issues from live testing by @Savid in #278

Full Changelog: v0.38.0-rc.2...v0.38.0-rc.3

🐼 Release qualification

Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit 1bbe2d3.

release	pass rate	mean score	mean tokens (correct)
v0.38.0-rc.3 (this release)	0% (0/216)	0.000	0
v0.38.0-rc.1	94% (204/216)	0.232	22,867
v0.38.0-rc.0	94% (203/216)	0.236	23,507
v0.37.0	94% (202/216)	0.232	22,162
v0.37.0-rc.0	92% (198/216)	0.227	23,260
v0.36.2-rc.0	92% (194/210)	0.230	22,363
v0.36.0	89% (176/198)	0.220	24,678
v0.36.0-rc.0	87% (173/198)	0.219	24,067
v0.35.0	87% (173/198)	0.216	22,956

Tokens per correct run: p50 0 · p90 0 · p99 0 · 0 per solve (all spend / correct answers).

Efficiency vs v0.38.0-rc.1: 0.000 — the harden-loop score with the previous release as the frozen token reference. 0.000 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.38.0-rc.1

question	v0.38.0-rc.1	v0.38.0-rc.3
`attestation_volume_analysis`	6/6	0/6	🔻 regressed
`au_block_arrivals_chart`	3/6	0/6	🔻 regressed
`blob_count_stats`	6/6	0/6	🔻 regressed
`block_count_24h`	6/6	0/6	🔻 regressed
`block_size_vs_propagation`	6/6	0/6	🔻 regressed
`block_time_histogram`	6/6	0/6	🔻 regressed
`block_timing_analysis`	5/6	0/6	🔻 regressed
`cbt_table_bounds`	6/6	0/6	🔻 regressed
`chartkit_default_arrival_distribution`	6/6	0/6	🔻 regressed
`chartkit_default_entity_bar`	6/6	0/6	🔻 regressed
`dec4_relay_circuit_breaker`	6/6	0/6	🔻 regressed
`devnet_error_logs`	6/6	0/6	🔻 regressed
`devnets_running`	6/6	0/6	🔻 regressed
`empty_blob_submitters`	6/6	0/6	🔻 regressed
`engine_getblobs_blob_scaling`	5/6	0/6	🔻 regressed
`engine_newpayload_el_comparison`	5/6	0/6	🔻 regressed
`forky_node_coverage`	6/6	0/6	🔻 regressed
`head_vote_correctness_24h`	6/6	0/6	🔻 regressed
`late_data_columns`	3/6	0/6	🔻 regressed
`list_datasources`	6/6	0/6	🔻 regressed
`mainnet_average_base_fee_per_gas_jun_10_2026`	3/3	0/3	🔻 regressed
`mainnet_block_arrival_p50`	6/6	0/6	🔻 regressed
`mainnet_xatu_top_peer_count_node`	3/3	0/3	🔻 regressed
`mempool_observer_coverage`	5/6	0/6	🔻 regressed
`mev_max_value_block`	6/6	0/6	🔻 regressed
`mev_relay_share`	6/6	0/6	🔻 regressed
`mev_timing_games`	6/6	0/6	🔻 regressed
`missed_slots_24h`	6/6	0/6	🔻 regressed
`parent_distance_orphans`	6/6	0/6	🔻 regressed
`precompile_usage`	6/6	0/6	🔻 regressed
`sepolia_validator_lookup`	6/6	0/6	🔻 regressed
`session_data_analysis`	6/6	0/6	🔻 regressed
`storage_upload_session_scoped`	6/6	0/6	🔻 regressed
`testnet_status`	4/6	0/6	🔻 regressed
`tracoor_node_coverage`	6/6	0/6	🔻 regressed
`validator_count`	6/6	0/6	🔻 regressed
`validator_head_accuracy`	6/6	0/6	🔻 regressed

Categories with failures

category	questions	pass rate
`attestations`	3	0/18 (0%)
`blobs`	4	0/24 (0%)
`blocks`	14	0/84 (0%)
`clickhouse`	27	0/156 (0%)
`execution`	6	0/33 (0%)
`mev`	4	0/24 (0%)
`multi_step`	6	0/36 (0%)
`networks`	8	0/45 (0%)
`smoke`	8	0/48 (0%)
`storage`	1	0/6 (0%)
`timing`	9	0/54 (0%)
`validators`	4	0/24 (0%)
`visualization`	7	0/42 (0%)

Failed runs

question	failed	sample grader reason
`attestation_volume_analysis`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`au_block_arrivals_chart`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`blob_count_stats`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`block_count_24h`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`block_size_vs_propagation`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`block_time_histogram`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`block_timing_analysis`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`cbt_table_bounds`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`chartkit_default_arrival_distribution`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`chartkit_default_entity_bar`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`dec4_relay_circuit_breaker`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`devnet_error_logs`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`devnets_running`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`empty_blob_submitters`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`engine_getblobs_blob_scaling`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`engine_newpayload_el_comparison`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`forky_node_coverage`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`head_vote_correctness_24h`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`late_data_columns`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`list_datasources`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`mainnet_average_base_fee_per_gas_jun_10_2026`	3/3	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`mainnet_block_arrival_p50`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`mainnet_xatu_top_peer_count_node`	3/3	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`mempool_observer_coverage`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`mev_max_value_block`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`mev_relay_share`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`mev_timing_games`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`missed_slots_24h`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`parent_distance_orphans`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`precompile_usage`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`sepolia_validator_lookup`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`session_data_analysis`	6/6	API error: 503 Service Unavailable {"error":{"message":"Error from provider (Console Go): Inference is temporarily unavailable","type":"serv
`storage_uplo...

Contributors

Savid

Assets 24

v0.38.0-rc.2 Pre-release

Pre-release

Savid released this 03 Jul 00:35

v0.38.0-rc.2

ad0426b

What's Changed

feat(compute): human-readable default output + list filtering by @samcm in #267
feat(compute): add snapshots promote command by @samcm in #271
fix(cli): datasource-aware error hints to reduce avoidable CLI errors by @qu0b in #130
feat(runbooks): add devnet consensus bug report runbook by @qu0b in #273
Rebuild runbook registry with hybrid retrieval; hard cutover to task-typed embedding by @Savid in #274

Full Changelog: v0.38.0-rc.1...v0.38.0-rc.2

Contributors

Savid, samcm, and qu0b

Assets 21

v0.38.0-rc.1 Pre-release

Pre-release

github-actions released this 30 Jun 08:08

v0.38.0-rc.1

f2f4716

Changelog

f2f4716 feat(proxy): forward caller token to compute backend (#268)

🐼 Release qualification

Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit f2f4716.

release	pass rate	mean score	mean tokens (correct)
v0.38.0-rc.1 (this release)	94% (204/216)	0.232	22,867
v0.38.0-rc.0	94% (203/216)	0.236	23,507
v0.37.0	94% (202/216)	0.232	22,162
v0.37.0-rc.0	92% (198/216)	0.227	23,260
v0.36.2-rc.0	92% (194/210)	0.230	22,363
v0.36.0	89% (176/198)	0.220	24,678
v0.36.0-rc.0	87% (173/198)	0.219	24,067
v0.35.0	87% (173/198)	0.216	22,956
v0.35.0-rc.0	88% (174/198)	0.219	22,182

Tokens per correct run: p50 19,303 · p90 34,731 · p99 72,288 · 24,424 per solve (all spend / correct answers).

Efficiency vs v0.38.0-rc.0: 0.240 — the harden-loop score with the previous release as the frozen token reference. 0.236 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.38.0-rc.0

question	v0.38.0-rc.0	v0.38.0-rc.1
`au_block_arrivals_chart`	5/6	3/6	🔻 regressed
`block_timing_analysis`	6/6	5/6	🔻 regressed
`cbt_table_bounds`	4/6	6/6	🟢 improved
`chartkit_default_entity_bar`	5/6	6/6	🟢 improved
`engine_getblobs_blob_scaling`	6/6	5/6	🔻 regressed
`engine_newpayload_el_comparison`	6/6	5/6	🔻 regressed
`late_data_columns`	2/6	3/6	🟢 improved
`parent_distance_orphans`	5/6	6/6	🟢 improved
`testnet_status`	3/6	4/6	🟢 improved

Categories with failures

category	questions	pass rate	median tokens (correct)
`blobs`	4	20/24 (83%)	25,612
`timing`	9	46/54 (85%)	22,610
`visualization`	7	38/42 (90%)	23,413
`execution`	6	30/33 (91%)	26,034
`blocks`	14	80/84 (95%)	20,640
`clickhouse`	27	149/156 (96%)	19,788
`networks`	8	43/45 (96%)	16,263
`multi_step`	6	35/36 (97%)	23,743

Failed runs

question	failed	sample grader reason
`au_block_arrivals_chart`	3/6	API error: 400 Bad Request {"error":{"message":"Error from provider (Alibaba): <400> InternalError.Algo.DataInspectionFailed: Input text dat
`block_timing_analysis`	1/6	The agent failed to re-use the session to maintain persistence between turns. It repeatedly attempted to use a non-existent session name ('b
`engine_getblobs_blob_scaling`	1/6	API call error: Error: Error parsing response from https://opencode.ai/zen/go/v1/chat/completions: Unexpected end of JSON input. HTTP 503 Se
`engine_newpayload_el_comparison`	1/6	The output fails the criterion to conclude that everything sat far below the 12s slot. Instead, it explicitly states that erigon's max durat
`late_data_columns`	3/6	string
`mempool_observer_coverage`	1/6	The agent reported 4 distinct execution clients, which falls outside the accepted range of 19-23 distinct observing clients required by the
`testnet_status`	2/6	The output provides block-production and participation figures derived from real queries, but it completely omits the current epoch and fina

🔭 Langfuse traces (216 runs; ⚠️ = failed)

Assets 24

v0.38.0-rc.0 Pre-release

Pre-release

github-actions released this 30 Jun 04:22

v0.38.0-rc.0

bc33ee9

Changelog

bc33ee9 feat(compute): add compute datasource for ephemeral sandboxes (#265)
3c3fa2f feat(resources): resolve the right resource path on a read miss (#261)
695c89c fix(search): activate semantic search on proxy discovery, not only at startup (#260)
d388d4a release-eval: serve release reports via the race-proof viewer (#259)

🐼 Release qualification

Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit bc33ee9.

release	pass rate	mean score	mean tokens (correct)
v0.38.0-rc.0 (this release)	94% (203/216)	0.236	23,507
v0.37.0	94% (202/216)	0.232	22,162
v0.37.0-rc.0	92% (198/216)	0.227	23,260
v0.36.2-rc.0	92% (194/210)	0.230	22,363
v0.36.0	89% (176/198)	0.220	24,678
v0.36.0-rc.0	87% (173/198)	0.219	24,067
v0.35.0	87% (173/198)	0.216	22,956
v0.35.0-rc.0	88% (174/198)	0.219	22,182
v0.34.0	85% (169/198)	0.209	23,915

Tokens per correct run: p50 19,454 · p90 36,674 · p99 76,701 · 25,016 per solve (all spend / correct answers).

Efficiency vs v0.37.0: 0.224 — the harden-loop score with the previous release as the frozen token reference. 0.235 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.37.0

question	v0.37.0	v0.38.0-rc.0
`au_block_arrivals_chart`	4/6	5/6	🟢 improved
`blob_count_stats`	5/6	6/6	🟢 improved
`chartkit_default_entity_bar`	6/6	5/6	🔻 regressed
`devnet_error_logs`	5/6	6/6	🟢 improved
`engine_getblobs_blob_scaling`	5/6	6/6	🟢 improved
`engine_newpayload_el_comparison`	5/6	6/6	🟢 improved
`testnet_status`	6/6	3/6	🔻 regressed

Categories with failures

category	questions	pass rate	median tokens (correct)
`blobs`	4	20/24 (83%)	31,030
`networks`	8	40/45 (89%)	15,956
`timing`	9	49/54 (91%)	22,714
`visualization`	7	40/42 (95%)	23,132
`clickhouse`	27	149/156 (96%)	20,830
`blocks`	14	81/84 (96%)	20,879
`execution`	6	32/33 (97%)	25,844
`multi_step`	6	35/36 (97%)	25,040

Failed runs

question	failed	sample grader reason
`au_block_arrivals_chart`	1/6	The output successfully filters the data for Australian observers, builds a chart, and provides the storage URL. However, it fails the speci
`cbt_table_bounds`	2/6	The agent used a full-table MIN/MAX SQL scan to determine the date range instead of sourcing the answer from the pipeline's coverage/bounds
`chartkit_default_entity_bar`	1/6	The agent failed to re-use the panda execute session across turns. It used session 169c33497071 in Turn 1 but created a new session 29a67bde
`late_data_columns`	4/6	The agent's query returned 88,240 late observations, which falls outside the accepted range of roughly 35,000 to 55,000 observations specifi
`mempool_observer_coverage`	1/6	The output reports 1 distinct client, which falls outside the accepted range of 19-23 clients specified in the rubric.
`parent_distance_orphans`	1/6	The agent failed to report the full distribution of parent distances (distance 1, 2, and 3+). Its query lumped all distances > 1 together, a
`testnet_status`	3/6	The output reports the current epoch (105,673) and recent block-production figures, but it completely omits the finalized epoch, which is ex

🔭 Langfuse traces (216 runs; ⚠️ = failed)

Assets 24

v0.37.0 Latest

Latest

samcm released this 29 Jun 07:03

v0.37.0

ace9f8e

What's Changed

test(eval): fix grader false-negatives on chart-URL and late-column cases by @samcm in #238
feat(eval): grade through the Codex Responses API directly (Codex subscription) by @samcm in #239
fix(eval): deterministic per-test sandbox cleanup via owner attribution by @samcm in #237
ci(release): build and push sandbox image before goreleaser by @samcm in #236
fix(docker): run server as the credentials owner for non-1000 host UIDs by @samcm in #241
feat(sandbox): add chartkit charting library by @samcm in #246
feat(storage): scope uploads by session and return host path by @samcm in #248
fix: It should use chartkits for the actual charting? by @samcm in #249
fix(chartkit): harden against visually-broken charts; require network + structured source by @samcm in #251
feat(networks): devnet data + notes.ethereum.org specs in CLI and Python by @samcm in #250
docs(storage): surface host_path and session reuse for the storage upload smoke case by @samcm in #253
feat(evm): EVM execution module — opcode trial, tracing, tx submission against devnet nodes by @qu0b in #252
[codex] embed network specs in resources by @parithosh in #256
feat(auth): make the server the sole owner of proxy credentials by @samcm in #258
search: replace inline content with ref URIs + panda read command by @qu0b in #255

Full Changelog: v0.36.0...v0.37.0

🐼 Release qualification

Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit ace9f8e.

release	pass rate	mean score	mean tokens (correct)
v0.37.0 (this release)	94% (202/216)	0.232	22,162
v0.37.0-rc.0	92% (198/216)	0.227	23,260
v0.36.2-rc.0	92% (194/210)	0.230	22,363
v0.36.0	89% (176/198)	0.220	24,678
v0.36.0-rc.0	87% (173/198)	0.219	24,067
v0.35.0	87% (173/198)	0.216	22,956
v0.35.0-rc.0	88% (174/198)	0.219	22,182
v0.34.0	85% (169/198)	0.209	23,915
v0.34.0-rc.0	84% (167/198)	0.209	22,725

Tokens per correct run: p50 19,116 · p90 32,706 · p99 57,458 · 23,606 per solve (all spend / correct answers).

Efficiency vs v0.37.0-rc.0: 0.240 — the harden-loop score with the previous release as the frozen token reference. 0.234 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.37.0-rc.0

question	v0.37.0-rc.0	v0.37.0
`blob_count_stats`	6/6	5/6	🔻 regressed
`cbt_table_bounds`	3/6	4/6	🟢 improved
`chartkit_default_entity_bar`	5/6	6/6	🟢 improved
`devnet_error_logs`	6/6	5/6	🔻 regressed
`parent_distance_orphans`	4/6	5/6	🟢 improved
`session_data_analysis`	5/6	6/6	🟢 improved
`testnet_status`	5/6	6/6	🟢 improved
`validator_count`	5/6	6/6	🟢 improved

Categories with failures

category	questions	pass rate	median tokens (correct)
`blobs`	4	18/24 (75%)	28,362
`timing`	9	47/54 (87%)	20,283
`execution`	6	30/33 (91%)	23,764
`networks`	8	42/45 (93%)	17,644
`clickhouse`	27	146/156 (94%)	19,286
`visualization`	7	40/42 (95%)	22,260
`blocks`	14	81/84 (96%)	19,364

Failed runs

question	failed	sample grader reason
`au_block_arrivals_chart`	2/6	The agent correctly filtered the data for Australian observers on 8 June 2026, built a chart, and provided the storage URL. However, the age
`blob_count_stats`	1/6	The agent reported an average of 3.13 and a median (P50) of 2. While the average is within the 40% tolerance of the expected ~4.7, the repor
`cbt_table_bounds`	2/6	The agent used a full-table MIN/MAX SQL scan to determine the date range instead of sourcing the answer from the pipeline's coverage/bounds
`devnet_error_logs`	1/6	The agent discovered active devnets and attempted to query the OTel logs in ClickHouse, but the query failed with a primary key error. The o
`engine_getblobs_blob_scaling`	1/6	The output fails to provide concrete per-client numbers for all execution clients, only giving a vague bound ('sub-50ms') for the fastest ti
`engine_newpayload_el_comparison`	1/6	The agent correctly reports per-execution-client durations from a real query, with values within the acceptable 2x range and correct fastest
`late_data_columns`	4/6	string
`mempool_observer_coverage`	1/6	The output reports 1 distinct client, which falls outside the accepted range of 19-23 clients specified in the rubric, even though the total
`parent_distance_orphans`	1/6	The agent failed to report the full parent distance distribution, completely missing the ~221,000 blocks at distance 1 (which make up 99.6%

🔭 Langfuse traces (216 runs; ⚠️ = failed)

Contributors

samcm, parithosh, and qu0b

Assets 24

checksums.txt

sha256:011246621976fb643399216a37e4c2332093564b39235fc3d89e2eb2afbaadb8

1.66 KB 2026-06-29T07:57:34Z
checksums.txt.pem

sha256:ff9953e75daa11d35c8ea395e8b0c81f7818bfee6e792e3d42441ceb6871f18c

3.23 KB 2026-06-29T07:57:34Z
checksums.txt.sig

sha256:fc42920f515763a6a13ef1df444f5c61f39a1a71c000073f44865d9311ab123e

96 Bytes 2026-06-29T07:57:34Z
eval-qualification.json

sha256:c62b42affd804b29c7650b5dc11996c07ca97e4bdafe06c12fbf6e77f9069f60

8.24 KB 2026-06-29T07:48:59Z
eval-report.html

sha256:40ab0fc7355c7b620e595aff807e46a835acca0043c7f98dffb4a8f2ff7c9268

817 KB 2026-06-29T07:48:59Z
eval-trend.png

sha256:9994643589b5f33de3f6dc807472861109e8a13ff47cb8ab8e69f82c71133b3c

63.8 KB 2026-06-29T07:48:59Z
panda-proxy_0.37.0_linux_amd64.tar.gz

sha256:dd55da63f220e44da0c22cef8631d7d1bd5ef7465613f31ecb51dffca099c205

6.55 MB 2026-06-29T07:57:31Z
panda-proxy_0.37.0_linux_amd64.tar.gz.sbom.json

sha256:f3e48e6b97e3df66b1ba286592fd1fe2669449bc5c48c0f962e32d316e3d971b

45.2 KB 2026-06-29T07:57:33Z
panda-proxy_0.37.0_linux_arm64.tar.gz

sha256:31219b4e68b50c27d018f9a90237b066f1df4e1189871f9a7755be8c8d5cb0f3

5.82 MB 2026-06-29T07:57:31Z
panda-proxy_0.37.0_linux_arm64.tar.gz.sbom.json

sha256:a58374342985439e8221760f0ee973d46e8a42ae866d459a5e83c975cafe8c7d

45.2 KB 2026-06-29T07:57:33Z
Source code (zip)

2026-06-29T04:11:36Z
Source code (tar.gz)

2026-06-29T04:11:36Z

v0.37.0-rc.0 Pre-release

Pre-release

samcm released this 29 Jun 04:12

v0.37.0-rc.0

ace9f8e

What's Changed

feat(storage): scope uploads by session and return host path by @samcm in #248
fix: It should use chartkits for the actual charting? by @samcm in #249
fix(chartkit): harden against visually-broken charts; require network + structured source by @samcm in #251
feat(networks): devnet data + notes.ethereum.org specs in CLI and Python by @samcm in #250
docs(storage): surface host_path and session reuse for the storage upload smoke case by @samcm in #253
feat(evm): EVM execution module — opcode trial, tracing, tx submission against devnet nodes by @qu0b in #252
[codex] embed network specs in resources by @parithosh in #256
feat(auth): make the server the sole owner of proxy credentials by @samcm in #258
search: replace inline content with ref URIs + panda read command by @qu0b in #255

Full Changelog: v0.36.2-rc.0...v0.37.0-rc.0

🐼 Release qualification

Full eval all files: 37 questions, 216 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit ace9f8e.

release	pass rate	mean score	mean tokens (correct)
v0.37.0-rc.0 (this release)	92% (198/216)	0.227	23,260
v0.36.2-rc.0	92% (194/210)	0.230	22,363
v0.36.0	89% (176/198)	0.220	24,678
v0.36.0-rc.0	87% (173/198)	0.219	24,067
v0.35.0	87% (173/198)	0.216	22,956
v0.35.0-rc.0	88% (174/198)	0.219	22,182
v0.34.0	85% (169/198)	0.209	23,915
v0.34.0-rc.0	84% (167/198)	0.209	22,725
v0.33.0	87% (173/198)	0.214	21,847

Tokens per correct run: p50 19,292 · p90 36,003 · p99 65,429 · 25,122 per solve (all spend / correct answers).

Efficiency vs v0.36.2-rc.0: 0.223 — the harden-loop score with the previous release as the frozen token reference. 0.229 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.36.2-rc.0

question	v0.36.2-rc.0	v0.37.0-rc.0
`attestation_volume_analysis`	5/6	6/6	🟢 improved
`au_block_arrivals_chart`	5/6	4/6	🔻 regressed
`chartkit_default_arrival_distribution`	5/6	6/6	🟢 improved
`chartkit_default_entity_bar`	6/6	5/6	🔻 regressed
`engine_getblobs_blob_scaling`	4/6	5/6	🟢 improved
`engine_newpayload_el_comparison`	6/6	5/6	🔻 regressed
`late_data_columns`	6/6	2/6	🔻 regressed
`mempool_observer_coverage`	3/6	5/6	🟢 improved
`parent_distance_orphans`	5/6	4/6	🔻 regressed
`session_data_analysis`	4/6	5/6	🟢 improved
`storage_upload_session_scoped`	—	6/6	🆕 new question
`testnet_status`	4/6	5/6	🟢 improved
`validator_count`	6/6	5/6	🔻 regressed

Categories with failures

category	questions	pass rate	median tokens (correct)
`blobs`	4	19/24 (79%)	30,196
`timing`	9	47/54 (87%)	22,314
`visualization`	7	38/42 (90%)	22,744
`execution`	6	30/33 (91%)	21,710
`networks`	8	41/45 (91%)	16,197
`clickhouse`	27	144/156 (92%)	19,976
`blocks`	14	78/84 (93%)	21,088
`multi_step`	6	34/36 (94%)	22,924
`validators`	4	23/24 (96%)	17,916

Failed runs

question	failed	sample grader reason
`au_block_arrivals_chart`	2/6	The agent successfully geo-filters the data for Australian observers on June 8, 2026, builds a chart, and provides the storage URL. However,
`cbt_table_bounds`	3/6	The agent performed a full-table MIN/MAX scan on the `fct_block` table (`SELECT min(...) ... FROM mainnet.fct_block`) instead of sourcing th
`chartkit_default_entity_bar`	1/6	string
`engine_getblobs_blob_scaling`	1/6	The output provides concrete per-client durations from a real query and correctly preserves the fast/slow client grouping. However, the abso
`engine_newpayload_el_comparison`	1/6	The output fails to conclude that everything sat far below the 12s slot, as it explicitly states that Erigon's max of 7.8s (65% of the slot)
`late_data_columns`	4/6	The agent executed a real query over the raw libp2p gossipsub data-column-sidecar observations for 8 June 2026 with a >5-minute (300,000 ms)
`mempool_observer_coverage`	1/6	The agent reported 1 distinct client and 53,148,763 observations. The rubric requires the number of distinct clients to be between 19 and 23
`parent_distance_orphans`	2/6	The agent failed to report the distribution of parent distances (distance 1, 2, and 3+). It only provided the aggregate count for blocks wit
`session_data_analysis`	1/6	string
`testnet_status`	1/6	The output provides concrete participation and block production figures from real queries, but it fails to explicitly report the finalized e
`validator_count`	1/6	The output reports 200,115,277 active validators, which is significantly outside the accepted range of 870k-910k specified in the rubric.

🔭 Langfuse traces (216 runs; ⚠️ = failed)

Contributors

samcm, parithosh, and qu0b

Assets 24

v0.36.2-rc.0 Pre-release

Pre-release

samcm released this 23 Jun 02:53

v0.36.2-rc.0

d607b76

What's Changed

feat(sandbox): add chartkit charting library by @samcm in #246

Full Changelog: v0.36.1-rc.0...v0.36.2-rc.0

🐼 Release qualification

Full eval all files: 36 questions, 210 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit d607b76.

release	pass rate	mean score	mean tokens (correct)
v0.36.2-rc.0 (this release)	92% (194/210)	0.230	22,363
v0.36.0	89% (176/198)	0.220	24,678
v0.36.0-rc.0	87% (173/198)	0.219	24,067
v0.35.0	87% (173/198)	0.216	22,956
v0.35.0-rc.0	88% (174/198)	0.219	22,182
v0.34.0	85% (169/198)	0.209	23,915
v0.34.0-rc.0	84% (167/198)	0.209	22,725
v0.33.0	87% (173/198)	0.214	21,847
v0.33.0-rc.5	90% (179/198)	0.226	23,285

Tokens per correct run: p50 18,177 · p90 32,718 · p99 71,544 · 24,618 per solve (all spend / correct answers).

Efficiency vs v0.36.0: 0.241 — the harden-loop score with the previous release as the frozen token reference. 0.231 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.36.0

question	v0.36.0	v0.36.2-rc.0
`au_block_arrivals_chart`	4/6	5/6	🟢 improved
`blob_count_stats`	5/6	6/6	🟢 improved
`block_time_histogram`	3/6	6/6	🟢 improved
`block_timing_analysis`	5/6	6/6	🟢 improved
`cbt_table_bounds`	4/6	3/6	🔻 regressed
`chartkit_default_arrival_distribution`	—	5/6	🆕 new question
`chartkit_default_entity_bar`	—	6/6	🆕 new question
`engine_getblobs_blob_scaling`	6/6	4/6	🔻 regressed
`engine_newpayload_el_comparison`	4/6	6/6	🟢 improved
`late_data_columns`	1/6	6/6	🟢 improved
`mempool_observer_coverage`	6/6	3/6	🔻 regressed
`mev_timing_games`	5/6	6/6	🟢 improved
`parent_distance_orphans`	4/6	5/6	🟢 improved
`testnet_status`	6/6	4/6	🔻 regressed

Categories with failures

category	questions	pass rate	median tokens (correct)
`execution`	6	28/33 (85%)	20,060
`multi_step`	5	26/30 (87%)	24,364
`visualization`	7	37/42 (88%)	21,698
`networks`	8	40/45 (89%)	16,686
`blobs`	4	22/24 (92%)	17,434
`blocks`	14	79/84 (94%)	19,552
`clickhouse`	27	147/156 (94%)	18,496
`attestations`	3	17/18 (94%)	17,408
`timing`	9	52/54 (96%)	20,670
`smoke`	7	41/42 (98%)	14,943

Failed runs

question	failed	sample grader reason
`attestation_volume_analysis`	1/6	no reason recorded
`au_block_arrivals_chart`	1/6	The agent correctly filtered the data for Australian observers on 8 June 2026 and generated a chart with a valid URL. However, the agent's t
`cbt_table_bounds`	3/6	The agent used a full-table MIN/MAX scan via a ClickHouse SQL query to determine the date range, rather than sourcing the answer from the pi
`chartkit_default_arrival_distribution`	1/6	string
`engine_getblobs_blob_scaling`	2/6	The agent provided average durations instead of the medians requested by the rubric. Furthermore, the values it provided for the slowest cli
`mempool_observer_coverage`	3/6	The output reports 1 distinct client, which falls outside the accepted range of 19-23 clients specified in the rubric. Although the agent pe
`parent_distance_orphans`	1/6	The output provides the total number of blocks with a parent distance greater than 1 and their survival/orphan outcomes, but it fails to rep
`session_data_analysis`	2/6	The agent failed to produce a real histogram image artifact. Due to the charting library (chartkit) running out of memory (OOM), the agent f
`testnet_status`	2/6	The output provides slot numbers, block production metrics, and missed slot rates, but it completely omits the current epoch and finalized e

🔭 Langfuse traces (209 runs; ⚠️ = failed)

Contributors

samcm

Assets 24

v0.36.1-rc.0 Pre-release

Pre-release

samcm released this 22 Jun 05:26

v0.36.1-rc.0

7e4e398

What's Changed

test(eval): fix grader false-negatives on chart-URL and late-column cases by @samcm in #238
feat(eval): grade through the Codex Responses API directly (Codex subscription) by @samcm in #239
fix(eval): deterministic per-test sandbox cleanup via owner attribution by @samcm in #237
ci(release): build and push sandbox image before goreleaser by @samcm in #236
fix(docker): run server as the credentials owner for non-1000 host UIDs by @samcm in #241

Full Changelog: v0.36.0...v0.36.1-rc.0

Contributors

samcm

Assets 21

v0.36.0

samcm released this 19 Jun 05:45

v0.36.0

3d7dfaa

What's Changed

feat(search): re-index on proxy embedding model change by @samcm in #233
fix(auth): coordinate credential refresh across processes and guard logins by @samcm in #235

Full Changelog: v0.35.0...v0.36.0

🐼 Release qualification

Full eval all files: 34 questions, 198 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit 3d7dfaa.

release	pass rate	mean score	mean tokens (correct)
v0.36.0 (this release)	89% (176/198)	0.220	24,678
v0.36.0-rc.0	87% (173/198)	0.219	24,067
v0.35.0	87% (173/198)	0.216	22,956
v0.35.0-rc.0	88% (174/198)	0.219	22,182
v0.34.0	85% (169/198)	0.209	23,915
v0.34.0-rc.0	84% (167/198)	0.209	22,725
v0.33.0	87% (173/198)	0.214	21,847
v0.33.0-rc.5	90% (179/198)	0.226	23,285
v0.33.0-rc.4	91% (180/198)	0.223	22,208

Tokens per correct run: p50 19,586 · p90 41,017 · p99 93,631 · 27,560 per solve (all spend / correct answers).

Efficiency vs v0.36.0-rc.0: 0.229 — the harden-loop score with the previous release as the frozen token reference. 0.222 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.36.0-rc.0

question	v0.36.0-rc.0	v0.36.0
`attestation_volume_analysis`	3/6	5/6	🟢 improved
`au_block_arrivals_chart`	5/6	4/6	🔻 regressed
`blob_count_stats`	6/6	5/6	🔻 regressed
`block_timing_analysis`	4/6	5/6	🟢 improved
`engine_getblobs_blob_scaling`	5/6	6/6	🟢 improved
`engine_newpayload_el_comparison`	6/6	4/6	🔻 regressed
`late_data_columns`	0/6	1/6	🟢 improved
`mev_relay_share`	5/6	6/6	🟢 improved
`mev_timing_games`	6/6	5/6	🔻 regressed
`parent_distance_orphans`	6/6	4/6	🔻 regressed
`session_data_analysis`	3/6	4/6	🟢 improved
`testnet_status`	3/6	6/6	🟢 improved

Categories with failures

category	questions	pass rate	median tokens (correct)
`visualization`	5	21/30 (70%)	25,212
`timing`	8	34/48 (71%)	24,668
`blobs`	4	18/24 (75%)	31,129
`multi_step`	3	14/18 (78%)	30,007
`blocks`	12	61/72 (85%)	20,684
`clickhouse`	26	135/150 (90%)	20,152
`execution`	6	31/33 (94%)	24,714
`attestations`	3	17/18 (94%)	19,453
`networks`	8	43/45 (96%)	16,224
`mev`	4	23/24 (96%)	20,023

Failed runs

question	failed	sample grader reason
`attestation_volume_analysis`	1/6	The agent successfully retrieved the correct attestation event counts for the specified date range, maintained the correct network ordering
`au_block_arrivals_chart`	2/6	The output fails on two main criteria. First, it does not provide a storage URL for the uploaded image, only stating the file is 'in the wor
`blob_count_stats`	1/6	The agent reports a median of 2, which is outside the ±40% tolerance of the expected median of 5 (acceptable range: 3 to 7). Although the mi
`block_time_histogram`	3/6	The output provides a local file path (/tmp/block_arrival_histogram.png) instead of a storage URL for the uploaded image, failing the explic
`block_timing_analysis`	1/6	The output fails to provide a storage URL for the chart image, which is explicitly required by the rubric. The agent only provided a text de
`cbt_table_bounds`	2/6	The agent used a full-table MIN/MAX scan (SELECT min(...) max(...) FROM mainnet.fct_block) to find the date range, rather than sourcing the
`engine_newpayload_el_comparison`	2/6	string
`late_data_columns`	5/6	The agent's answer reports 46,440 observations and 122 distinct peers. This falls outside the accepted range specified in the rubric, which
`mev_timing_games`	1/6	no reason recorded
`parent_distance_orphans`	2/6	The agent used a canonical-only analysis (querying `canonical_beacon_block`) which cannot see orphaned blocks, and incorrectly concluded tha
`session_data_analysis`	2/6	The agent successfully queried ~1000 mainnet blocks, calculated a plausible average gas used (29,909,547), generated a histogram, and reused

🔭 Langfuse traces (195 runs; ⚠️ = failed)

Contributors

samcm

Assets 24

v0.36.0-rc.0 Pre-release

Pre-release

samcm released this 19 Jun 04:43

v0.36.0-rc.0

3d7dfaa

What's Changed

feat(search): re-index on proxy embedding model change by @samcm in #233
fix(auth): coordinate credential refresh across processes and guard logins by @samcm in #235

Full Changelog: v0.35.0...v0.36.0-rc.0

🐼 Release qualification

Full eval all files: 34 questions, 198 runs (phrasings x repeats) against the hosted proxy. Subject opencode-go/deepseek-v4-flash:cli, commit 3d7dfaa.

release	pass rate	mean score	mean tokens (correct)
v0.36.0-rc.0 (this release)	87% (173/198)	0.219	24,067
v0.35.0	87% (173/198)	0.216	22,956
v0.35.0-rc.0	88% (174/198)	0.219	22,182
v0.34.0	85% (169/198)	0.209	23,915
v0.34.0-rc.0	84% (167/198)	0.209	22,725
v0.33.0	87% (173/198)	0.214	21,847
v0.33.0-rc.5	90% (179/198)	0.226	23,285
v0.33.0-rc.4	91% (180/198)	0.223	22,208
v0.33.0-rc.3	87% (173/198)	0.217	22,738

Tokens per correct run: p50 19,156 · p90 38,756 · p99 91,973 · 28,620 per solve (all spend / correct answers).

Efficiency vs v0.35.0: 0.220 — the harden-loop score with the previous release as the frozen token reference. 0.218 would be this pass rate at identical per-question cost; higher is leaner, lower is heavier.

📊 Full report — per-run matrix, category breakdowns, token distributions, every grader reason.

Per-question changes vs v0.35.0

question	v0.35.0	v0.36.0-rc.0
`au_block_arrivals_chart`	4/6	5/6	🟢 improved
`block_time_histogram`	4/6	3/6	🔻 regressed
`cbt_table_bounds`	6/6	4/6	🔻 regressed
`engine_getblobs_blob_scaling`	6/6	5/6	🔻 regressed
`engine_newpayload_el_comparison`	5/6	6/6	🟢 improved
`mempool_observer_coverage`	3/6	6/6	🟢 improved
`parent_distance_orphans`	4/6	6/6	🟢 improved
`session_data_analysis`	4/6	3/6	🔻 regressed
`testnet_status`	5/6	3/6	🔻 regressed

Categories with failures

category	questions	pass rate	median tokens (correct)
`multi_step`	3	10/18 (56%)	26,949
`visualization`	5	18/30 (60%)	26,880
`blobs`	4	17/24 (71%)	35,660
`timing`	8	36/48 (75%)	24,830
`attestations`	3	15/18 (83%)	18,904
`blocks`	12	62/72 (86%)	19,886
`networks`	8	40/45 (89%)	16,563
`clickhouse`	26	134/150 (89%)	20,360
`mev`	4	23/24 (96%)	19,467
`execution`	6	32/33 (97%)	25,332

Failed runs

question	failed	sample grader reason
`attestation_volume_analysis`	3/6	The agent correctly calculated the attestation events with the expected magnitudes and network ordering. However, it failed two key requirem
`au_block_arrivals_chart`	1/6	The agent successfully filtered the data for Australian observers on 8 June 2026 and generated a chart. However, it did not provide a storag
`block_time_histogram`	3/6	The output provides an ASCII text-based histogram and summary statistics, but it does not generate, upload, or provide a URL to an actual im
`block_timing_analysis`	2/6	The output fails to provide a storage URL for the chart image, only providing a local file path (/tmp/block_arrival_boxplot.png). According
`cbt_table_bounds`	2/6	The agent used a full-table MIN/MAX scan with a broad WHERE clause to bypass the primary key error, rather than sourcing the answer from the
`engine_getblobs_blob_scaling`	1/6	The agent's final answer only describes the query it executed but fails to provide the actual results, concrete per-client numbers, or the a
`late_data_columns`	6/6	The agent's final answer provides 44,700 observations and 122 distinct peers. While this is grounded in a real query over the raw libp2p gos
`mev_relay_share`	1/6	The agent reported 100%, which falls outside the 85-95% band specified in the rubric. The agent's query incorrectly filtered the denominator
`session_data_analysis`	3/6	The agent successfully queries ~1000 recent mainnet blocks, computes a plausible average gas used per block (~30.4M), and produces an ASCII
`testnet_status`	3/6	The output reports the current epoch and block production figures, but explicitly states that the finalized epoch is unavailable and therefo

🔭 Langfuse traces (198 runs; ⚠️ = failed)

Contributors

samcm

Assets 24

Uh oh!

Releases: ethpandaops/panda

Release list

v0.38.0-rc.3

What's Changed

🐼 Release qualification

Per-question changes vs v0.38.0-rc.1

Categories with failures

Failed runs

Contributors

Uh oh!

v0.38.0-rc.2

What's Changed

Contributors

Uh oh!

v0.38.0-rc.1

Changelog

🐼 Release qualification

Per-question changes vs v0.38.0-rc.0

Categories with failures

Failed runs

Uh oh!

v0.38.0-rc.0

Changelog

🐼 Release qualification

Per-question changes vs v0.37.0

Categories with failures

Failed runs

Uh oh!

v0.37.0

What's Changed

🐼 Release qualification

Per-question changes vs v0.37.0-rc.0

Categories with failures

Failed runs

Contributors

Uh oh!

v0.37.0-rc.0

What's Changed

🐼 Release qualification

Per-question changes vs v0.36.2-rc.0

Categories with failures

Failed runs

Contributors

Uh oh!

v0.36.2-rc.0

What's Changed

🐼 Release qualification

Per-question changes vs v0.36.0

Categories with failures

Failed runs

Contributors

Uh oh!

v0.36.1-rc.0

What's Changed

Contributors

Uh oh!

v0.36.0

What's Changed

🐼 Release qualification

Per-question changes vs v0.36.0-rc.0

Categories with failures

Failed runs

Contributors

Uh oh!

v0.36.0-rc.0

What's Changed

🐼 Release qualification

Per-question changes vs v0.35.0

Categories with failures

Failed runs

Contributors

Uh oh!