Skip to content

surpradhan/claude-code-for-ai-engineers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Claude Code for AI Engineers — Open-Source Preview

This is the open-source preview of Claude Code for AI Engineers, a methodology-first skill pack for engineers building RAG systems, multi-agent workflows, and MCP servers.

The preview includes two of six skills, one of three project templates, and one of five slash commands. The full pack is available on Gumroad: surpradhan.gumroad.com/l/claude-code-for-ai-engineers.


Verified by self-test

This pack has its own eval harness. The harness caught two real bugs in the skills before this README went up — fixes in 9b845a4 (baseline gate enforcement) and b8c0215 (refusal script wording). That's the result that matters; the rest is regression coverage.

Latest run (2026-06-15) on the preview skills:

  • Triggering on rag-eval-harness (n=30): 15/15 should-fire prompts fired, 0/15 near-misses fired. Small N by design — this catches regressions on the description, not generalization. Prompts.
  • Behavior (5 scenarios): two-variable comparison pushback, baseline required, fixed dataset required, adversarial-skip-the-gate, deterministic variables locked. All 9 assertions pass. The adversarial scenario explicitly asks Claude to skip the methodology gate. Claude held.
  • Judge calibration: Haiku and Sonnet agreed on all 9 assertions. Easy-case agreement, not robustness — the calibration value will rise as new scenarios are added.

Harness: claude-code-for-ai-engineers-eval · full report

What a skill firing looks like

You: "Scaffold a RAG benchmark comparing LangChain BM25 to LlamaIndex hybrid retrieval."

Claude (rag-eval-harness loaded): "Hold on — this varies two things at once: the framework AND the retriever. The delta you measure won't be interpretable. Two options: (1) narrow to one variable (e.g. LlamaIndex BM25 vs LlamaIndex hybrid), or (2) keep both but document the confound explicitly. Which one?"

That's the entire pitch. The skills refuse to scaffold the test that won't answer your question.


What's in this preview (open-source, MIT)

Two skills (skills/)

  • rag-eval-harness — set up a controlled RAG evaluation harness with the variables locked. Enforces fixing the embedding model, chunker, vector store, and generator so comparisons mean something.
  • agent-trace-debug — systematic debugging for multi-agent workflows from a captured trace. Refuses to propose fixes without one. Classifies the seven canonical failure modes.

One project template (claude_md_templates/)

  • multi-agent-project.md — multi-agent systems (LangGraph / CrewAI / AutoGen / OpenAI Agents SDK)

One slash command (slash_commands/)

  • /agent-debug — analyze a captured agent trace; identify failure mode; propose fix

What's in the full pack (Gumroad)

Everything in this preview, plus:

Four more skills

  • benchmark-scaffold — scaffold a reproducible benchmark project (configs, runners, results)
  • mcp-server-bootstrap — scaffold a new MCP server in Python (FastMCP) or TypeScript
  • eval-report-writer — generate a publication-quality evaluation report with mandatory methodology and "where this is honest" sections
  • paper-reproduce — set up a paper reproduction scaffold with validation gates

Two more project templates

  • rag-project.md — RAG projects (retrieval, eval, deployment)
  • mcp-server-project.md — MCP server projects (tool schemas, transport, testing)

Four more slash commands

  • /eval-run — run the configured evaluation with current changes
  • /benchmark-new — scaffold a new architecture inside an existing benchmark project
  • /mcp-scaffold — add a new tool to an existing MCP server
  • /experiment-log — capture the current project state into the experiment log

Get the full pack →


How to use this preview

cp -r skills/* ~/.claude/skills/
cp -r slash_commands/* ~/.claude/commands/

Then open Claude Code in any project and try one of:

  • "Compare two RAG retrieval architectures on HotpotQA" — rag-eval-harness activates and refuses if your comparison varies more than one thing at a time.
  • "My LangGraph agent keeps calling the wrong tool, fix the prompt" — agent-trace-debug refuses to propose a fix without a captured trace and walks you through capturing one.

If those gates feel right, the full pack applies the same discipline to benchmarking, MCP server work, paper reproduction, and evaluation reporting.


Design principles

  1. Fix what you can; vary one thing at a time.
  2. Capture before you debug.
  3. Reproducibility before novelty.
  4. Honesty in reporting.

Each skill enforces at least one of these as a hard gate.

Support

If a skill misfires or you have questions, open an issue or DM me on LinkedIn.


Surabhi Pradhan publishes reproducible AI benchmarks — rag-benchmark, agent-workflow-comparison, rec-bench, forecasting-showdown — and maintains the open-source agent observability protocol agent-event-protocol. She writes on Medium about LLM internals and developer infrastructure.

About

Open-source preview of *Claude Code for AI Engineers* - a methodology-first skill pack for RAG eval, agent debugging, MCP servers, paper reproduction, and benchmark reporting. Full pack (6 skills, 3 templates, 5 slash commands) at surpradhan.gumroad.com/l/claude-code-for-ai-engineers

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors