Claude Code for AI Engineers — Open-Source Preview

This is the open-source preview of Claude Code for AI Engineers, a methodology-first skill pack for engineers building RAG systems, multi-agent workflows, and MCP servers.

The preview includes two of six skills, one of three project templates, and one of five slash commands. The full pack is available on Gumroad: surpradhan.gumroad.com/l/claude-code-for-ai-engineers.

Verified by self-test

This pack has its own eval harness. The harness caught two real bugs in the skills before this README went up — fixes in 9b845a4 (baseline gate enforcement) and b8c0215 (refusal script wording). That's the result that matters; the rest is regression coverage.

Latest run (2026-06-15) on the preview skills:

Triggering on rag-eval-harness (n=30): 15/15 should-fire prompts fired, 0/15 near-misses fired. Small N by design — this catches regressions on the description, not generalization. Prompts.
Behavior (5 scenarios): two-variable comparison pushback, baseline required, fixed dataset required, adversarial-skip-the-gate, deterministic variables locked. All 9 assertions pass. The adversarial scenario explicitly asks Claude to skip the methodology gate. Claude held.
Judge calibration: Haiku and Sonnet agreed on all 9 assertions. Easy-case agreement, not robustness — the calibration value will rise as new scenarios are added.

Harness: claude-code-for-ai-engineers-eval · full report

What a skill firing looks like

You: "Scaffold a RAG benchmark comparing LangChain BM25 to LlamaIndex hybrid retrieval."

Claude (rag-eval-harness loaded): "Hold on — this varies two things at once: the framework AND the retriever. The delta you measure won't be interpretable. Two options: (1) narrow to one variable (e.g. LlamaIndex BM25 vs LlamaIndex hybrid), or (2) keep both but document the confound explicitly. Which one?"

That's the entire pitch. The skills refuse to scaffold the test that won't answer your question.

What's in this preview (open-source, MIT)

Two skills (skills/)

rag-eval-harness — set up a controlled RAG evaluation harness with the variables locked. Enforces fixing the embedding model, chunker, vector store, and generator so comparisons mean something.
agent-trace-debug — systematic debugging for multi-agent workflows from a captured trace. Refuses to propose fixes without one. Classifies the seven canonical failure modes.

One project template (claude_md_templates/)

multi-agent-project.md — multi-agent systems (LangGraph / CrewAI / AutoGen / OpenAI Agents SDK)

One slash command (slash_commands/)

/agent-debug — analyze a captured agent trace; identify failure mode; propose fix

What's in the full pack (Gumroad)

Everything in this preview, plus:

Four more skills

benchmark-scaffold — scaffold a reproducible benchmark project (configs, runners, results)
mcp-server-bootstrap — scaffold a new MCP server in Python (FastMCP) or TypeScript
eval-report-writer — generate a publication-quality evaluation report with mandatory methodology and "where this is honest" sections
paper-reproduce — set up a paper reproduction scaffold with validation gates

Two more project templates

rag-project.md — RAG projects (retrieval, eval, deployment)
mcp-server-project.md — MCP server projects (tool schemas, transport, testing)

Four more slash commands

/eval-run — run the configured evaluation with current changes
/benchmark-new — scaffold a new architecture inside an existing benchmark project
/mcp-scaffold — add a new tool to an existing MCP server
/experiment-log — capture the current project state into the experiment log

Get the full pack →

How to use this preview

cp -r skills/* ~/.claude/skills/
cp -r slash_commands/* ~/.claude/commands/

Then open Claude Code in any project and try one of:

"Compare two RAG retrieval architectures on HotpotQA" — rag-eval-harness activates and refuses if your comparison varies more than one thing at a time.
"My LangGraph agent keeps calling the wrong tool, fix the prompt" — agent-trace-debug refuses to propose a fix without a captured trace and walks you through capturing one.

If those gates feel right, the full pack applies the same discipline to benchmarking, MCP server work, paper reproduction, and evaluation reporting.

Design principles

Fix what you can; vary one thing at a time.
Capture before you debug.
Reproducibility before novelty.
Honesty in reporting.

Each skill enforces at least one of these as a hard gate.

Support

If a skill misfires or you have questions, open an issue or DM me on LinkedIn.

Surabhi Pradhan publishes reproducible AI benchmarks — rag-benchmark, agent-workflow-comparison, rec-bench, forecasting-showdown — and maintains the open-source agent observability protocol agent-event-protocol. She writes on Medium about LLM internals and developer infrastructure.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
claude_md_templates		claude_md_templates
skills		skills
slash_commands		slash_commands
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Claude Code for AI Engineers — Open-Source Preview

Verified by self-test

What a skill firing looks like

What's in this preview (open-source, MIT)

What's in the full pack (Gumroad)

How to use this preview

Design principles

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Claude Code for AI Engineers — Open-Source Preview

Verified by self-test

What a skill firing looks like

What's in this preview (open-source, MIT)

What's in the full pack (Gumroad)

How to use this preview

Design principles

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages