arc-agi-3

A public lab notebook for the ARC Prize 2026 - ARC-AGI-3 Kaggle Code Competition ($850K prize pool, ends 2026-11-02). One person's daily attack on the leaderboard, with the agents, experiments, tooling, and decision log produced along the way.

Current state (D16, 2026-05-14): 8 scored submissions + 1 pending D16 safety resubmit; LB trace 0.19 / 0.00 / 0.24 / 0.10 / 0.21 / 0.00 / 0.17 / 0.12 / PENDING. Best to date is D3's variance probe at 0.24. D9 confirmed the silent-crash hypothesis on Goose CNN v1 (0.00 → v2 0.17 with enable_gpu=false plus defensive try/except). D10+D11 ported the dolphin-in-a-coma frame-segmentation algorithm (arXiv:2512.24156, MIT) into agents/frame_segmenter.py and wired it as the ACTION6 click-coord prior in agents/trigger_bfs_agent.py; D15's wired-up submission landed at LB 0.12, so the segmenter prior alone is a marginal +0.02 over trigger-bfs v0 but below Goose v2 and master_v7. D16's slot was used on a FORGE variance safety resubmit of the best-known completed kernel (ash-s-arc-agi-3-agent v2, prior LB 0.24) while the fresh unchanged v3 rerun remains queued. Baseline anchor remains LB 0.19 (vanilla fork of an upstream public Kaggle notebook implementing FORGE v19; see NOTICE for upstream credit + paper attributions). All "delta vs baseline" deltas are measured against this 0.19 number. See .factory/memories.md for the running narrative and CHANGELOG.md for user-visible changes.

What this repo is (and isn't)

This repo is:

An agent zoo + harness for ARC-AGI-3: a uniform choose_action(frame) -> GameAction contract that any agent (random, search-based, neural, LLM-driven) can conform to.
An offline smoke runner (experiments/local_runner.py) that lets you exercise an agent end-to-end without burning a Kaggle daily submission slot, using either the real arc-agi SDK or a tiny built-in mock environment.
A Kaggle automation toolkit (scripts/): downloads the competition data, installs the SDK from offline wheels, and helps push / track / submit Kaggle kernels.
A research log: every Kaggle submission, every gotcha, every decision is captured in .factory/memories.md (append-only) and surfaced publicly in CHANGELOG.md.

This repo is NOT:

A finished competition entry (we're still climbing the leaderboard).
A model zoo: trained model weights are NOT included; large weights are downloaded at runtime from HuggingFace and bundled as private Kaggle Datasets.
A drop-in pip install library: it's a workspace, not a published package. pyproject.toml has package = false under [tool.uv].

Quickstart

# 1. Get the source
git clone https://github.com/cataluna84/arc-agi-3.git
cd arc-agi-3

# 2. Install uv (https://docs.astral.sh/uv/) if you don't already have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# 3. Create the venv and install deps (including dev: ruff, pre-commit, pytest)
uv venv --python 3.12
uv sync --all-groups

# 4. Install the pre-commit hooks (ruff + secret-leak guards on every commit)
uv run pre-commit install

# 5. Run the offline smoke (no GPU, no Kaggle creds, no model weights)
uv run python scripts/qwen_agent_smoke_local.py
# expect: ALL OK (22/22 checks)

# 6. Run the offline runner against the mock environment
uv run python experiments/local_runner.py \
    --agent agents.random_agent:RandomAgent \
    --games ls20-mock --max-actions 50

# 7. (optional) Configure Kaggle access
cp .env.example .env  # paste KAGGLE_USERNAME and KAGGLE_KEY
uv run python scripts/download_kaggle_data.py
uv run python scripts/install_arc_agi_sdk.py

If steps 1-6 work, your dev environment is ready. Step 7 is only needed if you want to push kernels or run agents against the real ARC games.

Repository layout

arc-agi-3/
|-- agents/                         # agent classes (one per file)
|   |-- __init__.py                 # Agent Protocol + GameAction/GameState fallbacks + MockFrame
|   |-- agent.py                    # Local stub of upstream agents.agent.Agent
|   |-- random_agent.py             # baseline 1: uniform-random over available actions
|   |-- greedy_explore_agent.py     # baseline 2: empirical change-rate epsilon-greedy
|   |-- forge_agent.py              # adapter for the verbatim FORGE port
|   |-- _forge_v19.py                # VENDORED: bit-for-bit FORGE v19 cell #1 - do not edit
|   |-- qwen_agent.py               # Qwen3.6-35B-A3B vision-language agent (exp004)
|   |-- trigger_bfs_agent.py        # trigger-aware BFS over state-hash graph (exp005)
|   |-- graph_explorer.py           # priority-threshold scheduler prototype (exp009)
|   |-- graph_explorer_agent.py     # GraphExplorerAgent wrapper (not submission-ready)
|   |-- goose_agent.py              # Stochastic-Goose-style CNN with conv coord head (exp007)
|   |-- state_graph.py              # state-hash graph + trigger scoring used by trigger_bfs
|   `-- frame_segmenter.py          # stateless port of dolphin-in-a-coma frame-segmentation (exp008)
|-- experiments/
|   |-- EXPERIMENTS.md              # tracker of all expNNN folders
|   |-- SPEC_4WEEKS.md              # 4-week per-day SPEC with kernel slugs + targets + outcomes
|   |-- local_runner.py             # offline smoke harness (mock + arc-agi SDK fallback)
|   |-- exp001_baseline_forge/      # LB 0.19 anchor (vanilla FORGE v19 fork)
|   |-- exp002_forge_variance_probe/ # variance probe for the FORGE baseline (LB 0.24)
|   |-- exp003_baseline_just_explore/   # orthogonal reference baseline
|   |-- exp004_qwen_agent/          # Qwen3.6-35B-A3B vision-language agent (LB 0.00; RTX probe load works)
|   |-- exp005_trigger_aware_bfs/   # trigger-aware BFS v0 (LB 0.10)
|   |-- exp007_goose_cnn/           # Goose CNN v1 (0.00) -> v2 (0.17)
|   |-- exp008_trigger_bfs_seg/     # trigger_bfs + frame-segmenter (LB 0.12)
|   |-- exp010_forge_variance_resubmit/ # D16 safety resubmit (PENDING)
|   |-- kernel_h100_probe/          # sanity probe of Kaggle's H100 image
|   `-- kernel_qwen_bridge_probe/   # probes HF -> Kaggle bridge feasibility
|-- scripts/
|   |-- download_kaggle_data.py     # pulls competition data via Kaggle API (KGAT_-aware)
|   |-- install_arc_agi_sdk.py      # offline install of arc-agi + arcengine wheels
|   |-- qwen_agent_smoke_local.py   # pure-Python QwenAgent smoke (no GPU)
|   |-- resubmit_forge.sh           # variance-probe helper (Track A in RUNBOOK_D2)
|   `-- README.md                   # script catalogue
|-- research/
|   |-- 01_landscape_review.md      # LB landscape + top public notebooks + attack ranking
|   |-- 02_exa_deep_research_2026-04-29.md   # Exa Deep Researcher Pro report
|   |-- 03_strategy_and_kaggle_compute_2026-04-29.md
|   `-- ash_notebook/               # captured upstream notebook + extracted text
|-- documentation/
|   `-- kaggle/                     # MHTML mirrors of comp pages + extracted text
|-- .factory/                       # canonical project memory (Factory.ai convention)
|   |-- plan.md                     # phased D0..D20+ daily Kaggle roadmap
|   |-- memories.md                 # append-only project decisions and gotchas log
|   |-- verify.md                   # V1..V9 pre-Kaggle-submission checklist
|   `-- rules/                      # split-by-topic project conventions (7 files)
|-- .github/
|   |-- ISSUE_TEMPLATE/             # bug, feature, experiment proposal templates
|   |-- workflows/ci.yml            # ruff + pytest + smoke runner CI
|   |-- dependabot.yml              # weekly Python + Actions updates
|   `-- PULL_REQUEST_TEMPLATE.md
|-- AGENTS.md                       # repo briefing (also doubles as Factory.ai context)
|-- CONTRIBUTING.md                 # how to contribute (read me before opening a PR)
|-- CODE_OF_CONDUCT.md              # Contributor Covenant 2.1 (link)
|-- CHANGELOG.md                    # Keep-a-Changelog dated entries
|-- CITATION.cff                    # Citation File Format (academic)
|-- LICENSE                         # Apache 2.0
|-- NOTICE                          # upstream attribution
|-- pyproject.toml                  # deps + ruff + pytest config
|-- uv.lock                         # reproducible lockfile
`-- .pre-commit-config.yaml         # ruff + hygiene + actionlint + gitleaks

Folders intentionally not in version control (gitignored): .venv/, data/, runs/, environment_files/, __pycache__/, .env, model weights of any kind.

Daily workflow

The rhythm of the project is one Kaggle daily slot per day:

Open .factory/plan.md and pick today's experiment (or pick from experiments/EXPERIMENTS.md).
Implement under experiments/expNNN_<slug>/ and / or agents/<my_agent>.py.

Smoke-test locally:

uv run python experiments/local_runner.py \
    --agent agents.<my_agent>:<MyAgent> \
    --games ls20-mock --max-actions 200 --seed 0

(Optional, free) push to a Kaggle dev kernel for runtime parity:

uv run kaggle kernels push -p experiments/expNNN_<slug>/dev_kernel

Submit on Kaggle (this burns the daily slot; note the CLI subcommand is submit, not submit-code — see gotcha #15):

uv run kaggle competitions submit arc-prize-2026-arc-agi-3 \
    -k cataluna84/<comp-kernel> -v <N> \
    -f submission.parquet -m "expNNN: <one-liner>"

After the LB result lands, append a dated section to the top of .factory/memories.md with the score, delta vs 0.19, per-game notes, and the next-step decision.
Update CHANGELOG.md [Unreleased] with the user-visible change.

For tomorrow's specific runbook, see experiments/exp004_qwen_agent/RUNBOOK_D2.md.

Agents in the box

Agent	File	Approach	Status
`RandomAgent`	`agents/random_agent.py`	uniform over `available_actions`; ACTION6 click is uniform-random	working
`GreedyExploreAgent`	`agents/greedy_explore_agent.py`	epsilon-greedy on per-action empirical frame-change rate	working
`ForgeAgent`	`agents/forge_agent.py`	adapter around verbatim FORGE v19 (BFS + ForgeNet CNN)	LB 0.19 baseline; MASTER v7 remix reached 0.21
`QwenAgent`	`agents/qwen_agent.py`	vision-language MoE: image + hex grid + history -> ACTION (`Qwen3.6-35B-A3B` BF16); RTX 6000 Phase-0 probe now loads/generates offline	LB 0.00 (D2); LLM-as-direct-policy is structurally worse than random unless constrained (see gotcha #18)
`TriggerBFSAgent`	`agents/trigger_bfs_agent.py`	trigger-aware BFS over state-hash graph; ACTION6 click coords come from `frame_segmenter` 5-tier saliency	LB 0.10 (D4 v0); segmenter prior exp008 = 0.12
`GraphExplorerAgent`	`agents/graph_explorer_agent.py`	prototype of the paper's priority-threshold action scheduler with segment-keyed ACTION6 candidates and shortest-path frontier routing	local SDK smoke only; not submission-ready (0/25 mounted games)
`GooseAgent`	`agents/goose_agent.py`	Stochastic-Goose-style 4-layer CNN with conv coord head, BCE on `frame_changed` over a 200K hash-dedup buffer	LB 0.00 v1 (D6) → 0.17 v2 (D9) after `enable_gpu=false` + defensive `try/except`
`frame_segmenter` (lib)	`agents/frame_segmenter.py`	stateless port of the dolphin-in-a-coma frame-segmentation algorithm (arXiv:2512.24156, MIT): per-color connected components, 5-tier saliency, status-bar detection	used by `TriggerBFSAgent` for ACTION6 click coords

The agent contract is documented at the top of agents/__init__.py. Adding a new agent is described in CONTRIBUTING.md.

Kaggle integration

scripts/download_kaggle_data.py reads .env for KAGGLE_USERNAME / KAGGLE_KEY, auto-detects KGAT_-format tokens and switches to Bearer auth, then downloads the competition data
- bundled wheels into data/kaggle/arc-prize-2026-arc-agi-3/.
scripts/install_arc_agi_sdk.py installs arc-agi + arcengine from those wheels into the venv (offline-friendly).
experiments/local_runner.py --use-sdk uses the real ARC environment; without --use-sdk, it uses a tiny built-in mock.

The hardware constraints we've verified on Kaggle's H100 image (gcr.io/kaggle-gpu-images/python) are documented in .factory/memories.md:

1 x H100 80 GB HBM3, sm_90 Hopper, FP8 native.
31.4 GB system RAM (no large CPU offload viable).
/kaggle/working only 19.5 GB; /tmp 1.2 TB free.
transformers 5.0.0, accelerate 1.12.0, torchao 0.10.0, triton 3.6.0 pre-installed; vLLM / SGLang / flash-attn NOT.

Phase plan and target scores

Phase	Days	Target	Δ vs 0.19	Approach
0 - Foundation	D0..D4	0.19-0.30	+0.00..+0.11	Anchor on FORGE 0.19, variance probe, local runner + agent zoo
1 - Core search + learning	D5..D7	0.30-0.35	+0.11..+0.16	Trigger-aware BFS, StochasticGoose CNN, hybrid search-and-learn
2 - Object-centric + WM	D8..D12	0.40-0.50	+0.21..+0.31	Segmentation+click, MCTS+CNN prior, DreamerV3-lite
3 - TTT, DSL, slot WM	D13..D16	0.50-0.55	+0.31..+0.36	Test-Time Training, DSL synthesis, slot-attention world model
4 - Composition / ensemble	D17..D20+	0.60-0.70+	+0.41+	Per-game dispatcher, offline pretraining, LLM orchestrator

Full breakdown in .factory/plan.md.

Tooling and code quality

This project uses the state-of-the-art Python tooling stack (2026):

Tool	Role	Config
uv	dep + venv manager	`pyproject.toml` + `uv.lock`
Ruff	lint + format + import sort	`[tool.ruff]` in `pyproject.toml`
pre-commit	git hooks	`.pre-commit-config.yaml`
pytest	test runner	`[tool.pytest.ini_options]` in `pyproject.toml`
actionlint	YAML CI lint	as a pre-commit hook
gitleaks	secret leak detection	as a pre-commit hook
GitHub Actions	CI	`.github/workflows/ci.yml`
Dependabot	dep updates	`.github/dependabot.yml`

The Ruff ruleset is intentionally broad - E, W, F, I, B, C4, UP, RUF, SIM, TID, PTH, PERF, A, ARG, S, N, RET, TCH, ICN, ISC - with pragmatic per-file ignores for intentional patterns (e.g. /tmp paths on Kaggle, mirrors of upstream APIs, etc.). The vendored FORGE port (agents/_forge_v19.py) is excluded from linting since it is a verbatim copy.

Run all the checks locally exactly as CI runs them:

uv run pre-commit run --all-files   # ruff + secret-leak + JSON/YAML/TOML
uv run pytest                        # tests + smoke files
uv run python scripts/qwen_agent_smoke_local.py
uv run python experiments/local_runner.py \
    --agent agents.random_agent:RandomAgent --games ls20-mock --max-actions 30

Contributing

See CONTRIBUTING.md. TL;DR: fork, branch off main, run all checks locally, open a PR using the template. By submitting a contribution you agree to license it under Apache 2.0.

Citing

If this work informs your research, please cite using the CITATION.cff file or the BibTeX below:

@software{bhaskar_arc_agi_3_2026,
  author  = {Mayank Bhaskar},
  title   = {arc-agi-3: a Kaggle ARC Prize 2026 lab notebook},
  year    = {2026},
  url     = {https://github.com/cataluna84/arc-agi-3},
  license = {Apache-2.0}
}

License

Apache License 2.0 - see also NOTICE for upstream attribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arc-agi-3

Table of contents

What this repo is (and isn't)

Quickstart

Repository layout

Daily workflow

Agents in the box

Kaggle integration

Phase plan and target scores

Tooling and code quality

Contributing

Citing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.factory		.factory
.github		.github
agents		agents
documentation		documentation
experiments		experiments
research		research
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

arc-agi-3

Table of contents

What this repo is (and isn't)

Quickstart

Repository layout

Daily workflow

Agents in the box

Kaggle integration

Phase plan and target scores

Tooling and code quality

Contributing

Citing

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages