Production-grade AI engineering skills, audits, workflows, benchmarks, and evaluation frameworks.
The open-source AI engineering framework for architecture reviews, security audits, startup validation, competitor analysis, AI systems design, SEO audits, AI code review, technical due diligence, and technical decision-making.
Built for developers, founders, CTOs, technical teams, and AI builders.
Developed by the LetsCookTech Open Source Team.
Default AI answers are often plausible but hard to trust: they skip evidence, confuse guesses with facts, miss release risks, and leave no test path. AI Engineering Arsenal gives an assistant an operational contract: what to inspect, what to prove, what to refuse, how to verify, and what artifact to hand to a human.
The reputation this project is designed to earn:
This framework catches things normal AI misses.
AI Engineering Arsenal is currently a library of cross-model operating skills. The long-term direction is an AI engineering operating layer for routing, evaluation, policy, and lifecycle. The current repository is intentionally honest about what exists today.
- AI code review
- Security audit AI workflows
- Architecture review and system design
- Startup validation
- Competitor analysis
- Technical due diligence
- Engineering playbooks and engineering workflows
- AI evaluation and benchmark design
- AI CTO operating rhythms
- SaaS, Supabase, Next.js, RAG, and AI agent decision-making
These are the category-defining Arsenal skills. Start here if you want practical value instead of another prompt collection.
| Reviewer | What it catches | Best for |
|---|---|---|
nextjs-production-architecture-reviewer |
Next.js architecture, App Router, Server Actions, performance, SEO, AI-search, security, Vercel cost, Supabase integration, and deployment risks. | Next.js SaaS, AI platforms, ecommerce, dashboards, blogs, marketplaces, agency sites. |
supabase-production-auditor |
RLS bypass, service-role misuse, weak auth, storage exposure, Realtime fan-out, database growth, cost, backups, and production-readiness gaps. | Supabase SaaS, AI apps, mobile apps, internal tools, marketplaces, learning platforms. |
ai-agent-architecture-reviewer |
Planning failures, memory risks, tool abuse, MCP risks, prompt injection, hallucination gaps, cost fan-out, observability, and reliability issues. | AI agents, copilots, workflow agents, browser agents, coding agents, research agents, multi-agent systems. |
Each flagship reviewer includes a benchmark rubric and comparison template under benchmarks/ so future claims can be proven with baseline-vs-framework outputs.
| Without a playbook | With an Arsenal playbook |
|---|---|
| "Add authentication and validate inputs." | Maps assets and trust boundaries; reports evidence, preconditions, impact, remediation, regression tests, confidence, and review gaps. |
| "Use a queue and a database." | Compares designs, records assumptions and trade-offs, specifies timeouts/retries/rollback, and names the test that validates the decision. |
| "Build an AI SaaS." | Produces an acceptance contract plus tenancy, authorization, AI-evaluation, cost-cap, migration, observability, release, and rollback gates. |
Read a safe, concrete finding from the synthetic tenant-review case study. It demonstrates an evidence-linked result; it does not claim a benchmark win.
| Playbook | Use it when | Proof path |
|---|---|---|
security-auditor |
You need an authorized code, API, infra, or release-risk review. | Case study · Rubric |
startup-validator |
You need to test whether a product should exist before building it. | Case study · Rubric |
competitor-analyzer |
You need positioning based on evidence rather than a feature grid. | Case study · Rubric |
cto-operating-system |
You need a focused operating plan from engineering signals. | Case study · Rubric |
Copy a skill folder into your agent's skills directory, or attach its SKILL.md to the task. Example:
Use $security-auditor to review this authorized SaaS API. Scope: /api/invoices.
Evidence: repository files and deployment configuration attached.
Return only confirmed findings, review gaps, safe remediation, and verification tests.
Works as portable Markdown with Codex, Claude Code/Projects, ChatGPT, Gemini, Cursor, Windsurf, Cline, Roo Code, Aider, and agent SDKs. See compatibility.
nextjs-production-architecture-reviewer · supabase-production-auditor · ai-agent-architecture-reviewer · security-auditor · startup-validator · competitor-analyzer · system-architect · database-architect · technical-debt-hunter · ai-search-optimizer · seo-auditor · cost-explosion-detector · cto-operating-system · production-ai-saas-builder
These pages are written for GitHub, Google, and AI-search discoverability while staying useful to developers:
AI Engineering Arsenal does not claim that a playbook finds more issues, saves money, or outperforms a model until a reproducible result is published. Each benchmark holds model/version, tools, temperature, budget, inputs, rubric, baseline, playbook run, evaluator, and limitations constant. Read the benchmark protocol.
AI Engineering Arsenal has a repository-level system for improving itself instead of only adding more skills:
| System | Purpose |
|---|---|
| Repository audit | Finds weak assets, filler risk, missing proof, and deletion candidates. |
| Arsenal constitution | Defines the laws every contribution must follow. |
| AI CTO operating model | Standardizes input, research, verification, risk review, decision, and quality review. |
| Evaluation standard | Scores outputs across accuracy, evidence, verification, actionability, security, and user value. |
| Red-team framework | Attacks outputs before users trust them. |
| Benchmark lab | Defines the proof artifacts required before performance claims. |
| Self-evolution roadmap | Moves the project toward a proof engine, runtime adapters, and Open Source AI CTO workflows. |
No. A prompt repository optimizes for copyable text. AI Engineering Arsenal optimizes for evidence, verification, failure detection, benchmarks, and repeatable engineering decisions.
No. The playbooks are Markdown-first and model-portable. They are designed for Claude, ChatGPT, Gemini, Codex, Cursor, Windsurf, Cline, Roo Code, Aider, OpenAI Agents, Anthropic agents, and future AI systems.
Not yet. The repository includes rubrics, synthetic case studies, and benchmark protocol. Public benchmark wins should only be claimed after raw baseline and framework outputs are published.
A contribution needs a recurring decision problem, an evidence/verification contract, safety boundaries, and a sanitized evaluation case. Generic personas and untested prompt collections do not qualify. Start with CONTRIBUTING.md.
| Path | Purpose |
|---|---|
skills/ |
Portable operating playbooks. |
case-studies/ |
Safe, concrete demonstrations of the output standard. |
benchmarks/ |
Per-playbook rubrics and reproducibility protocol. |
evals/ |
Versioned task fixtures for baseline-versus-playbook runs. |
docs/ |
Product thesis, compatibility, launch, and publishing guidance. |
templates/ |
Proof-pack templates for graduating skills into trusted assets. |
v0.1.0 is a foundation release. Case studies are synthetic demonstrations; public benchmark results are not yet published. That distinction is intentional.
Developed by the LetsCookTech Open Source Team.
MIT. See LICENSE.
