A public lab notebook for the ARC Prize 2026 - ARC-AGI-3 Kaggle Code Competition ($850K prize pool, ends 2026-11-02). One person's daily attack on the leaderboard, with the agents, experiments, tooling, and decision log produced along the way.
Current state (D16, 2026-05-14): 8 scored submissions + 1 pending
D16 safety resubmit; LB trace 0.19 / 0.00 / 0.24 / 0.10 / 0.21 /
0.00 / 0.17 / 0.12 / PENDING. Best to date is D3's variance probe at
0.24. D9 confirmed the silent-crash
hypothesis on Goose CNN v1 (0.00 → v2 0.17 with enable_gpu=false
plus defensive try/except). D10+D11 ported the dolphin-in-a-coma
frame-segmentation algorithm (arXiv:2512.24156, MIT) into
agents/frame_segmenter.py and wired it as the ACTION6 click-coord
prior in agents/trigger_bfs_agent.py; D15's wired-up submission
landed at LB 0.12, so the segmenter prior alone is a marginal
+0.02 over trigger-bfs v0 but below Goose v2 and master_v7. D16's
slot was used on a FORGE variance safety resubmit of the best-known
completed kernel (ash-s-arc-agi-3-agent v2, prior LB 0.24) while the
fresh unchanged v3 rerun remains queued. Baseline
anchor remains LB 0.19 (vanilla
fork of an upstream public Kaggle notebook implementing FORGE v19;
see NOTICE for upstream credit + paper attributions). All
"delta vs baseline" deltas are measured against this 0.19 number.
See .factory/memories.md for the running
narrative and CHANGELOG.md for user-visible changes.
- What this repo is (and isn't)
- Quickstart
- Repository layout
- Daily workflow
- Agents in the box
- Kaggle integration
- Phase plan and target scores
- Tooling and code quality
- Contributing
- Citing
- License
This repo is:
- An agent zoo + harness for ARC-AGI-3: a uniform
choose_action(frame) -> GameActioncontract that any agent (random, search-based, neural, LLM-driven) can conform to. - An offline smoke runner (
experiments/local_runner.py) that lets you exercise an agent end-to-end without burning a Kaggle daily submission slot, using either the realarc-agiSDK or a tiny built-in mock environment. - A Kaggle automation toolkit (
scripts/): downloads the competition data, installs the SDK from offline wheels, and helps push / track / submit Kaggle kernels. - A research log: every Kaggle submission, every gotcha, every
decision is captured in
.factory/memories.md(append-only) and surfaced publicly inCHANGELOG.md.
This repo is NOT:
- A finished competition entry (we're still climbing the leaderboard).
- A model zoo: trained model weights are NOT included; large weights are downloaded at runtime from HuggingFace and bundled as private Kaggle Datasets.
- A drop-in
pip installlibrary: it's a workspace, not a published package.pyproject.tomlhaspackage = falseunder[tool.uv].
# 1. Get the source
git clone https://github.com/cataluna84/arc-agi-3.git
cd arc-agi-3
# 2. Install uv (https://docs.astral.sh/uv/) if you don't already have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# 3. Create the venv and install deps (including dev: ruff, pre-commit, pytest)
uv venv --python 3.12
uv sync --all-groups
# 4. Install the pre-commit hooks (ruff + secret-leak guards on every commit)
uv run pre-commit install
# 5. Run the offline smoke (no GPU, no Kaggle creds, no model weights)
uv run python scripts/qwen_agent_smoke_local.py
# expect: ALL OK (22/22 checks)
# 6. Run the offline runner against the mock environment
uv run python experiments/local_runner.py \
--agent agents.random_agent:RandomAgent \
--games ls20-mock --max-actions 50
# 7. (optional) Configure Kaggle access
cp .env.example .env # paste KAGGLE_USERNAME and KAGGLE_KEY
uv run python scripts/download_kaggle_data.py
uv run python scripts/install_arc_agi_sdk.pyIf steps 1-6 work, your dev environment is ready. Step 7 is only needed if you want to push kernels or run agents against the real ARC games.
arc-agi-3/
|-- agents/ # agent classes (one per file)
| |-- __init__.py # Agent Protocol + GameAction/GameState fallbacks + MockFrame
| |-- agent.py # Local stub of upstream agents.agent.Agent
| |-- random_agent.py # baseline 1: uniform-random over available actions
| |-- greedy_explore_agent.py # baseline 2: empirical change-rate epsilon-greedy
| |-- forge_agent.py # adapter for the verbatim FORGE port
| |-- _forge_v19.py # VENDORED: bit-for-bit FORGE v19 cell #1 - do not edit
| |-- qwen_agent.py # Qwen3.6-35B-A3B vision-language agent (exp004)
| |-- trigger_bfs_agent.py # trigger-aware BFS over state-hash graph (exp005)
| |-- graph_explorer.py # priority-threshold scheduler prototype (exp009)
| |-- graph_explorer_agent.py # GraphExplorerAgent wrapper (not submission-ready)
| |-- goose_agent.py # Stochastic-Goose-style CNN with conv coord head (exp007)
| |-- state_graph.py # state-hash graph + trigger scoring used by trigger_bfs
| `-- frame_segmenter.py # stateless port of dolphin-in-a-coma frame-segmentation (exp008)
|-- experiments/
| |-- EXPERIMENTS.md # tracker of all expNNN folders
| |-- SPEC_4WEEKS.md # 4-week per-day SPEC with kernel slugs + targets + outcomes
| |-- local_runner.py # offline smoke harness (mock + arc-agi SDK fallback)
| |-- exp001_baseline_forge/ # LB 0.19 anchor (vanilla FORGE v19 fork)
| |-- exp002_forge_variance_probe/ # variance probe for the FORGE baseline (LB 0.24)
| |-- exp003_baseline_just_explore/ # orthogonal reference baseline
| |-- exp004_qwen_agent/ # Qwen3.6-35B-A3B vision-language agent (LB 0.00; RTX probe load works)
| |-- exp005_trigger_aware_bfs/ # trigger-aware BFS v0 (LB 0.10)
| |-- exp007_goose_cnn/ # Goose CNN v1 (0.00) -> v2 (0.17)
| |-- exp008_trigger_bfs_seg/ # trigger_bfs + frame-segmenter (LB 0.12)
| |-- exp010_forge_variance_resubmit/ # D16 safety resubmit (PENDING)
| |-- kernel_h100_probe/ # sanity probe of Kaggle's H100 image
| `-- kernel_qwen_bridge_probe/ # probes HF -> Kaggle bridge feasibility
|-- scripts/
| |-- download_kaggle_data.py # pulls competition data via Kaggle API (KGAT_-aware)
| |-- install_arc_agi_sdk.py # offline install of arc-agi + arcengine wheels
| |-- qwen_agent_smoke_local.py # pure-Python QwenAgent smoke (no GPU)
| |-- resubmit_forge.sh # variance-probe helper (Track A in RUNBOOK_D2)
| `-- README.md # script catalogue
|-- research/
| |-- 01_landscape_review.md # LB landscape + top public notebooks + attack ranking
| |-- 02_exa_deep_research_2026-04-29.md # Exa Deep Researcher Pro report
| |-- 03_strategy_and_kaggle_compute_2026-04-29.md
| `-- ash_notebook/ # captured upstream notebook + extracted text
|-- documentation/
| `-- kaggle/ # MHTML mirrors of comp pages + extracted text
|-- .factory/ # canonical project memory (Factory.ai convention)
| |-- plan.md # phased D0..D20+ daily Kaggle roadmap
| |-- memories.md # append-only project decisions and gotchas log
| |-- verify.md # V1..V9 pre-Kaggle-submission checklist
| `-- rules/ # split-by-topic project conventions (7 files)
|-- .github/
| |-- ISSUE_TEMPLATE/ # bug, feature, experiment proposal templates
| |-- workflows/ci.yml # ruff + pytest + smoke runner CI
| |-- dependabot.yml # weekly Python + Actions updates
| `-- PULL_REQUEST_TEMPLATE.md
|-- AGENTS.md # repo briefing (also doubles as Factory.ai context)
|-- CONTRIBUTING.md # how to contribute (read me before opening a PR)
|-- CODE_OF_CONDUCT.md # Contributor Covenant 2.1 (link)
|-- CHANGELOG.md # Keep-a-Changelog dated entries
|-- CITATION.cff # Citation File Format (academic)
|-- LICENSE # Apache 2.0
|-- NOTICE # upstream attribution
|-- pyproject.toml # deps + ruff + pytest config
|-- uv.lock # reproducible lockfile
`-- .pre-commit-config.yaml # ruff + hygiene + actionlint + gitleaks
Folders intentionally not in version control (gitignored):
.venv/, data/, runs/, environment_files/, __pycache__/,
.env, model weights of any kind.
The rhythm of the project is one Kaggle daily slot per day:
- Open
.factory/plan.mdand pick today's experiment (or pick fromexperiments/EXPERIMENTS.md). - Implement under
experiments/expNNN_<slug>/and / oragents/<my_agent>.py. - Smoke-test locally:
uv run python experiments/local_runner.py \ --agent agents.<my_agent>:<MyAgent> \ --games ls20-mock --max-actions 200 --seed 0 - (Optional, free) push to a Kaggle dev kernel for runtime parity:
uv run kaggle kernels push -p experiments/expNNN_<slug>/dev_kernel
- Submit on Kaggle (this burns the daily slot; note the CLI
subcommand is
submit, notsubmit-code— see gotcha #15):uv run kaggle competitions submit arc-prize-2026-arc-agi-3 \ -k cataluna84/<comp-kernel> -v <N> \ -f submission.parquet -m "expNNN: <one-liner>" - After the LB result lands, append a dated section to the top of
.factory/memories.mdwith the score, delta vs 0.19, per-game notes, and the next-step decision. - Update
CHANGELOG.md[Unreleased]with the user-visible change.
For tomorrow's specific runbook, see
experiments/exp004_qwen_agent/RUNBOOK_D2.md.
| Agent | File | Approach | Status |
|---|---|---|---|
RandomAgent |
agents/random_agent.py |
uniform over available_actions; ACTION6 click is uniform-random |
working |
GreedyExploreAgent |
agents/greedy_explore_agent.py |
epsilon-greedy on per-action empirical frame-change rate | working |
ForgeAgent |
agents/forge_agent.py |
adapter around verbatim FORGE v19 (BFS + ForgeNet CNN) | LB 0.19 baseline; MASTER v7 remix reached 0.21 |
QwenAgent |
agents/qwen_agent.py |
vision-language MoE: image + hex grid + history -> ACTION (Qwen3.6-35B-A3B BF16); RTX 6000 Phase-0 probe now loads/generates offline |
LB 0.00 (D2); LLM-as-direct-policy is structurally worse than random unless constrained (see gotcha #18) |
TriggerBFSAgent |
agents/trigger_bfs_agent.py |
trigger-aware BFS over state-hash graph; ACTION6 click coords come from frame_segmenter 5-tier saliency |
LB 0.10 (D4 v0); segmenter prior exp008 = 0.12 |
GraphExplorerAgent |
agents/graph_explorer_agent.py |
prototype of the paper's priority-threshold action scheduler with segment-keyed ACTION6 candidates and shortest-path frontier routing | local SDK smoke only; not submission-ready (0/25 mounted games) |
GooseAgent |
agents/goose_agent.py |
Stochastic-Goose-style 4-layer CNN with conv coord head, BCE on frame_changed over a 200K hash-dedup buffer |
LB 0.00 v1 (D6) → 0.17 v2 (D9) after enable_gpu=false + defensive try/except |
frame_segmenter (lib) |
agents/frame_segmenter.py |
stateless port of the dolphin-in-a-coma frame-segmentation algorithm (arXiv:2512.24156, MIT): per-color connected components, 5-tier saliency, status-bar detection | used by TriggerBFSAgent for ACTION6 click coords |
The agent contract is documented at the top of
agents/__init__.py. Adding a new agent is
described in CONTRIBUTING.md.
scripts/download_kaggle_data.pyreads.envforKAGGLE_USERNAME/KAGGLE_KEY, auto-detects KGAT_-format tokens and switches to Bearer auth, then downloads the competition data- bundled wheels into
data/kaggle/arc-prize-2026-arc-agi-3/.
- bundled wheels into
scripts/install_arc_agi_sdk.pyinstallsarc-agi+arcenginefrom those wheels into the venv (offline-friendly).experiments/local_runner.py --use-sdkuses the real ARC environment; without--use-sdk, it uses a tiny built-in mock.
The hardware constraints we've verified on Kaggle's H100 image
(gcr.io/kaggle-gpu-images/python) are documented in
.factory/memories.md:
- 1 x H100 80 GB HBM3, sm_90 Hopper, FP8 native.
- 31.4 GB system RAM (no large CPU offload viable).
/kaggle/workingonly 19.5 GB;/tmp1.2 TB free.transformers 5.0.0,accelerate 1.12.0,torchao 0.10.0,triton 3.6.0pre-installed; vLLM / SGLang / flash-attn NOT.
| Phase | Days | Target | Δ vs 0.19 | Approach |
|---|---|---|---|---|
| 0 - Foundation | D0..D4 | 0.19-0.30 | +0.00..+0.11 | Anchor on FORGE 0.19, variance probe, local runner + agent zoo |
| 1 - Core search + learning | D5..D7 | 0.30-0.35 | +0.11..+0.16 | Trigger-aware BFS, StochasticGoose CNN, hybrid search-and-learn |
| 2 - Object-centric + WM | D8..D12 | 0.40-0.50 | +0.21..+0.31 | Segmentation+click, MCTS+CNN prior, DreamerV3-lite |
| 3 - TTT, DSL, slot WM | D13..D16 | 0.50-0.55 | +0.31..+0.36 | Test-Time Training, DSL synthesis, slot-attention world model |
| 4 - Composition / ensemble | D17..D20+ | 0.60-0.70+ | +0.41+ | Per-game dispatcher, offline pretraining, LLM orchestrator |
Full breakdown in .factory/plan.md.
This project uses the state-of-the-art Python tooling stack (2026):
| Tool | Role | Config |
|---|---|---|
| uv | dep + venv manager | pyproject.toml + uv.lock |
| Ruff | lint + format + import sort | [tool.ruff] in pyproject.toml |
| pre-commit | git hooks | .pre-commit-config.yaml |
| pytest | test runner | [tool.pytest.ini_options] in pyproject.toml |
| actionlint | YAML CI lint | as a pre-commit hook |
| gitleaks | secret leak detection | as a pre-commit hook |
| GitHub Actions | CI | .github/workflows/ci.yml |
| Dependabot | dep updates | .github/dependabot.yml |
The Ruff ruleset is intentionally broad - E, W, F, I, B,
C4, UP, RUF, SIM, TID, PTH, PERF, A, ARG, S, N,
RET, TCH, ICN, ISC - with pragmatic per-file ignores for
intentional patterns (e.g. /tmp paths on Kaggle, mirrors of upstream
APIs, etc.). The vendored FORGE port (agents/_forge_v19.py) is
excluded from linting since it is a verbatim copy.
Run all the checks locally exactly as CI runs them:
uv run pre-commit run --all-files # ruff + secret-leak + JSON/YAML/TOML
uv run pytest # tests + smoke files
uv run python scripts/qwen_agent_smoke_local.py
uv run python experiments/local_runner.py \
--agent agents.random_agent:RandomAgent --games ls20-mock --max-actions 30See CONTRIBUTING.md. TL;DR: fork, branch off main,
run all checks locally, open a PR using the template. By submitting a
contribution you agree to license it under Apache 2.0.
If this work informs your research, please cite using the
CITATION.cff file or the BibTeX below:
@software{bhaskar_arc_agi_3_2026,
author = {Mayank Bhaskar},
title = {arc-agi-3: a Kaggle ARC Prize 2026 lab notebook},
year = {2026},
url = {https://github.com/cataluna84/arc-agi-3},
license = {Apache-2.0}
}Apache License 2.0 - see also NOTICE for upstream attribution.
Copyright 2026 Mayank Bhaskar (cataluna84).