This is the open-source preview of Claude Code for AI Engineers, a methodology-first skill pack for engineers building RAG systems, multi-agent workflows, and MCP servers.
The preview includes two of six skills, one of three project templates, and one of five slash commands. The full pack is available on Gumroad: surpradhan.gumroad.com/l/claude-code-for-ai-engineers.
This pack has its own eval harness. The harness caught two real bugs in the skills before this README went up — fixes in 9b845a4 (baseline gate enforcement) and b8c0215 (refusal script wording). That's the result that matters; the rest is regression coverage.
Latest run (2026-06-15) on the preview skills:
- Triggering on
rag-eval-harness(n=30): 15/15 should-fire prompts fired, 0/15 near-misses fired. Small N by design — this catches regressions on the description, not generalization. Prompts. - Behavior (5 scenarios): two-variable comparison pushback, baseline required, fixed dataset required, adversarial-skip-the-gate, deterministic variables locked. All 9 assertions pass. The adversarial scenario explicitly asks Claude to skip the methodology gate. Claude held.
- Judge calibration: Haiku and Sonnet agreed on all 9 assertions. Easy-case agreement, not robustness — the calibration value will rise as new scenarios are added.
Harness: claude-code-for-ai-engineers-eval · full report
You: "Scaffold a RAG benchmark comparing LangChain BM25 to LlamaIndex hybrid retrieval."
Claude (rag-eval-harness loaded): "Hold on — this varies two things at once: the framework AND the retriever. The delta you measure won't be interpretable. Two options: (1) narrow to one variable (e.g. LlamaIndex BM25 vs LlamaIndex hybrid), or (2) keep both but document the confound explicitly. Which one?"
That's the entire pitch. The skills refuse to scaffold the test that won't answer your question.
Two skills (skills/)
rag-eval-harness— set up a controlled RAG evaluation harness with the variables locked. Enforces fixing the embedding model, chunker, vector store, and generator so comparisons mean something.agent-trace-debug— systematic debugging for multi-agent workflows from a captured trace. Refuses to propose fixes without one. Classifies the seven canonical failure modes.
One project template (claude_md_templates/)
multi-agent-project.md— multi-agent systems (LangGraph / CrewAI / AutoGen / OpenAI Agents SDK)
One slash command (slash_commands/)
/agent-debug— analyze a captured agent trace; identify failure mode; propose fix
Everything in this preview, plus:
Four more skills
benchmark-scaffold— scaffold a reproducible benchmark project (configs, runners, results)mcp-server-bootstrap— scaffold a new MCP server in Python (FastMCP) or TypeScripteval-report-writer— generate a publication-quality evaluation report with mandatory methodology and "where this is honest" sectionspaper-reproduce— set up a paper reproduction scaffold with validation gates
Two more project templates
rag-project.md— RAG projects (retrieval, eval, deployment)mcp-server-project.md— MCP server projects (tool schemas, transport, testing)
Four more slash commands
/eval-run— run the configured evaluation with current changes/benchmark-new— scaffold a new architecture inside an existing benchmark project/mcp-scaffold— add a new tool to an existing MCP server/experiment-log— capture the current project state into the experiment log
cp -r skills/* ~/.claude/skills/
cp -r slash_commands/* ~/.claude/commands/Then open Claude Code in any project and try one of:
- "Compare two RAG retrieval architectures on HotpotQA" —
rag-eval-harnessactivates and refuses if your comparison varies more than one thing at a time. - "My LangGraph agent keeps calling the wrong tool, fix the prompt" —
agent-trace-debugrefuses to propose a fix without a captured trace and walks you through capturing one.
If those gates feel right, the full pack applies the same discipline to benchmarking, MCP server work, paper reproduction, and evaluation reporting.
- Fix what you can; vary one thing at a time.
- Capture before you debug.
- Reproducibility before novelty.
- Honesty in reporting.
Each skill enforces at least one of these as a hard gate.
If a skill misfires or you have questions, open an issue or DM me on LinkedIn.
Surabhi Pradhan publishes reproducible AI benchmarks — rag-benchmark, agent-workflow-comparison, rec-bench, forecasting-showdown — and maintains the open-source agent observability protocol agent-event-protocol. She writes on Medium about LLM internals and developer infrastructure.