Playbook

The Agentic
Playbook

The AI-native way of working with agents.

Multi-Agent Orchestration · Spec-Driven Development · Guardrails & Enforcement · Way of Working

Chapter 01

The Shift

The one thing most developers struggle with when starting out with AI agents: the hard part isn't the technology. The hard part is unlearning twenty years of muscle memory about what it means to build software.

You probably still think your job is to write code. It isn't. Not anymore. Your job is to define what should be built, decide how to verify it was built correctly, and make damn sure the agent doesn't wreck anything along the way. The agent writes code faster than you ever could. But left to its own devices, it'll also cheerfully refactor your authentication layer, introduce three new abstractions nobody asked for, and commit the results to main.

The engineers who are thriving right now have made a specific mental leap. They've stopped thinking in terms of implementation and started thinking in terms of orchestration. They write specs, not functions. They design guardrails, not class hierarchies. They spend 70% of their time on problem definition and 30% on execution, which feels backwards until you realise how good agents are at the execution part, and how catastrophically bad they are at deciding what to execute.

THE CORE INSIGHT
Treat the agent like an incredibly fast junior engineer who never gets tired, never pushes back, and has absolutely no idea whether what you asked for is a good idea.

This playbook won't teach you how to build an agent framework. There are plenty of those guides, and most of them are already outdated. Instead, this is about the way of working: the files you create, the rules you set, the workflows you design. The infrastructure that separates teams shipping real products from teams drowning in AI-generated spaghetti.

The Loop

Every team that reliably ships with agents (Anthropic internally, GitHub, the teams we've worked with) converges on the same loop:

INTENT → SPEC → PLAN → IMPLEMENT → VERIFY → REVIEW → SHIP

With hard gates between each stage. The agent doesn't advance until checks pass. This isn't a suggestion. It's the pattern that works, discovered independently by basically everyone who's survived long enough to have an opinion.

Looks sequential. Looks slow. It doesn't have to be. With subagents, teams, and parallel worktrees, multiple stages can run simultaneously: a reviewer checking Phase 1 while a builder implements Phase 2 while a security auditor scans Phase 3. The loop stays disciplined. The throughput multiplies. We'll get to that in Chapter 09.

What This Playbook Covers

The examples here use Claude Code because, as of writing, it has the most mature agentic infrastructure: agents, skills, hooks, memory, teams. But the principles are tool-agnostic. Cursor, Copilot, Windsurf, Devin: they're all converging on the same concepts with different filenames. Where it matters, we note the equivalents.

Everything in here comes from production use, ours and others'. If it's in this playbook, it's because someone shipped with it.

Chapter 02

Anatomy of an Agentic Project

A codebase isn't just code anymore. It's code plus a layer of configuration that shapes how agents interact with it. Skip this layer, and you'll spend every session re-explaining the same context. Set it up once, and every session, for every team member, starts from a good place.

Here's what a well-structured project looks like:

project/
├── CLAUDE.md                    ← The intent layer
├── .claude/
│   ├── settings.json            ← Rules, permissions, hooks
│   ├── settings.local.json      ← Your personal overrides (gitignored)
│   ├── agents/                  ← Specialist agents
│   │   ├── reviewer.md
│   │   ├── planner.md
│   │   └── security-auditor.md
│   ├── skills/                  ← Reusable workflows
│   │   ├── deploy/
│   │   │   └── SKILL.md
│   │   └── review-pr/
│   │       └── SKILL.md
│   └── rules/                   ← Path-scoped instructions
│       ├── frontend.md
│       ├── api.md
│       └── database.md
├── docs/
│   ├── PRD.md                   ← What to build and why
│   └── SPEC.md                  ← How to build it
└── src/

This is the Claude Code layout. If you're using a different agent tool, or want a setup that works across tools, the cross-tool convention is AGENTS.md at the project root and .agents/ for agent definitions. Same ideas, different filenames. Most teams that work across multiple tools maintain both.

None of these files are optional decoration. Each one solves a specific problem you'll hit within the first week of working with agents.

File / Directory	What it does	In git?
`CLAUDE.md`	Agent reads this first, every session	Yes
`.claude/settings.json`	Hooks, permissions, environment	Yes
`.claude/settings.local.json`	Your personal tweaks	No
`.claude/agents/`	Specialist agent definitions	Yes
`.claude/skills/`	Workflows you trigger with /name	Yes
`.claude/rules/`	Instructions scoped to specific paths	Yes
`docs/PRD.md`	The "what" and "why"	Yes
`docs/SPEC.md`	The "how"	Yes

Configuration Cascades

Settings flow from broad to specific, with each level overriding the one above:

Managed policy (org-wide, IT-controlled)
  └── User settings (~/.claude/settings.json)
       └── Project settings (.claude/settings.json)
            └── Local settings (.claude/settings.local.json)
                 └── CLI arguments (session-only)

Your team shares project-level config through git. Your personal quirks stay local. The IT department can enforce policies nobody can override. It's the same cascading model as CSS, and for the same reason: you need defaults that can be overridden without chaos.

Cross-Tool Equivalents

Every major AI coding tool has adopted this pattern, just with different filenames:

Concept	Claude Code	Cursor	Copilot	Codex	Windsurf
Project instructions	`CLAUDE.md`	`.cursor/rules/*.mdc`	`.github/copilot-instructions.md`	`AGENTS.md`	`.windsurfrules`
Path-scoped rules	`.claude/rules/`	`.cursor/rules/` (globs)	—	—	—
Cross-tool standard	`AGENTS.md`	`AGENTS.md`	`AGENTS.md`	`AGENTS.md`	`AGENTS.md`

AGENTS.md is the emerging cross-tool standard. If your team uses multiple tools (and most do), maintain both a tool-specific config and an AGENTS.md that covers the common ground.

The bottom line: check all of this into version control. It's infrastructure. When someone clones your repo, the agents should work correctly without a Slack message asking "hey, how do I set up the AI thing?"

Chapter 03

CLAUDE.md: The Intent Layer

If you only read one chapter, read this one.

CLAUDE.md is loaded into every session, every agent, every subagent. It is, functionally, the agent's understanding of your project. Get it right and the agent behaves like a team member who's read the onboarding docs. Get it wrong (or skip it entirely) and every conversation starts with the agent guessing what you probably want.

It is not documentation. It is not a README. It's closer to a briefing: short, opinionated, and focused on what the agent needs to know to avoid breaking things. Think less "Wikipedia article" and more "note you'd leave for a contractor before you leave the office."

What belongs in a CLAUDE.md

The commands nobody can guess. The rules your team actually cares about. The files that should never be touched. The way to check if something worked.

## Commands
- Build: npm run build
- Test: npm test -- --watchAll=false
- Lint: npm run lint
- Type check: npx tsc --noEmit
- Single test: npm test -- --testPathPattern="filename"

## Stack
Next.js 15 (App Router), React 19, TypeScript 5.7, Tailwind CSS 4
Database: Supabase (PostgreSQL)
Auth: Supabase Auth with RLS policies

## Architecture
- src/app/ : pages and API routes
- src/components/ : React components, colocated with tests
- src/lib/ : shared utilities, database client, types
- src/hooks/ : custom React hooks

## Rules
- Use `unknown` instead of `any`. No exceptions.
- Use early returns. No nesting deeper than 2 levels.
- All database queries go through src/lib/db.ts. Never query directly.
- Error boundaries on every page-level component.
- NEVER modify migration files after they have been committed.

## Verification
After every change, run in this order:
1. npx tsc --noEmit
2. npm test
3. npm run lint
4. npm run build

## Do Not Touch
- .env files: never read, modify, or commit
- src/lib/auth/ : locked, changes need explicit approval
- Database migration files in supabase/migrations/

That's roughly 35 lines. That's enough. Anthropic's own production CLAUDE.md files hover around 50 lines. The best ones in the open-source wild rarely exceed 80.

What does NOT belong in a CLAUDE.md

Anything the agent can figure out by reading the code. Standard language conventions (the agent already knows PEP 8 and the Go style guide). Long tutorials. File-by-file codebase descriptions. Platitudes like "write clean, maintainable code."

Here's a good litmus test: for every line, ask "Would removing this cause the agent to make a specific mistake?" If the answer is no, cut it. A 30-line file that gets read and followed will always outperform a 500-line file that gets half-ignored because the important rules are buried somewhere on line 347.

Making it scale

When you need depth beyond those 35 lines, don't bloat the main file. Branch out.

Imports pull in external files without cluttering the root:

## API Design
Follow the conventions in @docs/api-conventions.md

Subdirectory CLAUDE.md files activate only when the agent works in that directory. A CLAUDE.md inside src/api/ loads when the agent touches API files, stays invisible otherwise.

Path-scoped rules in .claude/rules/ do the same thing with more precision; we'll cover those in the next chapter.

Emphasis works. Words like "IMPORTANT", "YOU MUST", "NEVER" measurably improve adherence. Use them for the rules that actually matter. Overuse them and you get the all-caps-email effect: everything is important, so nothing is.

And treat it like code. Review it when things go wrong. Prune it when rules become obsolete. Let the team contribute. A good CLAUDE.md compounds in value every week.

Chapter 04

Rules & Enforcement

Here's something that took us too long to learn: putting a rule in CLAUDE.md is like putting up a "Please Don't Walk on the Grass" sign. Most of the time, people respect it. But when it really matters, when the grass leads to a cliff, you want a fence.

CLAUDE.md is advisory. The agent reads it, generally follows it, and sometimes doesn't. For rules where "sometimes doesn't" is unacceptable, you need enforcement.

Three tiers of enforcement

Tier 1: Advisory                CLAUDE.md, .claude/rules/ files
        "The agent should..."   Best-effort. Usually works. No guarantees.

Tier 2: Deterministic           Hooks in settings.json
        "The system will..."    Runs automatically. Cannot be skipped.

Tier 3: Infrastructure          CI/CD, linters, test suites
        "The pipeline blocks..."  The last line of defence.

If a rule matters, it should live at Tier 2 or 3. Period. Relying solely on Tier 1 for anything critical is hoping for the best, and production systems don't run on hope.

Tier 1: Advisory rules

These live in CLAUDE.md and .claude/rules/ files. Path-scoped rules are particularly useful because they load only when relevant, keeping the main context clean:

<!-- .claude/rules/database.md -->
---
globs: src/db/**, supabase/migrations/**
---

# Database Rules
- All migrations must be reversible
- Never modify existing migration files
- Naming convention: YYYYMMDDHHMMSS_description.sql
- Test against local Supabase before committing

This rule appears when the agent touches database files. The rest of the time, it's invisible. You can have dozens of rule files without any of them competing for attention.

Tier 2: Hooks

Hooks are where advisory becomes mandatory. They're scripts that fire at specific points in the agent's lifecycle, and they have teeth: exit code 2 blocks the action entirely, feeding the error message back to Claude as an explanation of what went wrong.

The mechanics are simple:

Event fires → Hook runs → Exit code decides

Exit 0:  Carry on
Exit 2:  Blocked. Stderr sent to Claude as feedback.
Other:   Carry on, but log the stderr.

Three hooks that belong in every project:

Don't touch main:

{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Edit|Write",
      "hooks": [{
        "type": "command",
        "command": "[ \"$(git branch --show-current)\" != \"main\" ] || { echo 'Create a feature branch first.' >&2; exit 2; }"
      }]
    }]
  }
}

Auto-format on save:

{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Edit|Write",
      "hooks": [{
        "type": "command",
        "command": "npx prettier --write \"$CLAUDE_TOOL_INPUT_FILE_PATH\" 2>/dev/null"
      }]
    }]
  }
}

Lock down credentials:

{
  "permissions": {
    "deny": [
      "Read(~/.ssh/**)", "Read(~/.aws/**)", "Read(~/.kube/**)",
      "Bash(rm -rf *)", "Bash(git push --force*)", "Bash(git reset --hard*)"
    ]
  }
}

Beyond command hooks, there are three other types: prompt (an LLM makes a judgment call), agent (a full subagent inspects the situation with tool access), and http (POST to an external endpoint for logging or approval). Most teams only need command hooks, but the others exist for when you need nuance.

Key events to know

Event	When	You'd use it to...
`PreToolUse`	Before any tool fires	Block dangerous operations
`PostToolUse`	After a tool succeeds	Auto-format, auto-test
`UserPromptSubmit`	You hit enter	Inject context, route to skills
`Stop`	Agent finishes	Run validation checks
`SessionStart`	Session begins	Set up environment

Tier 3: Infrastructure

The safety net. Even if the agent somehow gets past your hooks, CI catches it. Linters catch style violations. Type checkers catch type errors. Test suites catch broken behaviour. Security scanners catch vulnerabilities. PR reviews, human or automated, catch everything else.

This tier isn't agent-specific. It's just good engineering. But with agents writing the majority of code, it becomes non-negotiable. The agent can't merge what the pipeline won't pass.

Documentation provides hints. Hooks provide rules. CI provides walls. Layer all three.

Chapter 05

Skills: Reusable Workflows

If you've ever found yourself typing the same four-paragraph prompt for the third time this week, you need a skill.

Skills are packaged workflows. A markdown file with some metadata that tells the agent what to do, when to do it, and what tools it's allowed to use. You invoke them with a slash (/deploy, /review-pr, /migrate) and the agent executes a multi-step workflow instead of you dictating one from memory.

They replaced the older "commands" system (.claude/commands/), and the upgrade isn't cosmetic. Commands were flat markdown files dumped into the conversation. Skills have frontmatter that controls model selection, tool access, context isolation, and auto-invocation. It's the difference between a recipe card and a cooking robot that follows the recipe.

What a skill looks like

.claude/skills/
└── deploy/
    ├── SKILL.md          ← The skill definition
    ├── checklist.md      ← Reference material (loaded on demand)
    └── scripts/
        └── verify.sh

The SKILL.md:

---
name: deploy
description: Deploy current branch to staging or production.
user-invocable: true
disable-model-invocation: true
allowed-tools: Bash, Read
argument-hint: "[staging|production]"
---

# Deploy Workflow

## Pre-flight
1. Verify all tests pass: `npm test`
2. Verify build succeeds: `npm run build`
3. Confirm current branch is not main
4. Confirm no uncommitted changes

## Deploy
- **staging**: Run `./scripts/deploy.sh staging`
- **production**: Ask for explicit confirmation first

## Post-deploy
1. Run smoke tests: `npm run test:smoke`
2. Report status

Two frontmatter fields deserve special attention. disable-model-invocation: true prevents Claude from triggering this skill on its own, because you want a human decision before anything goes to production. And allowed-tools: Bash, Read limits what the skill can access, because a deploy workflow has no business editing source files.

The loading trick

Skills use progressive disclosure to stay out of the way:

Always present: Just the name and description, a handful of tokens
When invoked: The full SKILL.md loads
When needed: Supporting files like checklist.md or examples.md

This means you can have thirty skills in a project and they cost essentially nothing until someone actually uses one. Only the one-line descriptions sit in every conversation.

Dynamic injection

This is where skills get genuinely powerful. The !`command` syntax runs a shell command at invocation time and injects the output:

---
name: review-pr
description: Review the current PR
---

## Context
- Branch: !`git branch --show-current`
- Changed files: !`git diff --name-only main...HEAD`
- Diff stats: !`git diff --stat main...HEAD`

## Review for
1. Logic errors and edge cases
2. Missing error handling
3. Test coverage gaps
4. Security implications

When you type /review-pr, those backtick commands execute first. The skill arrives in the conversation already loaded with the current branch, the changed files, and the diff summary. No copy-pasting. No "please look at the current branch."

Skills worth building

The ones that pay for themselves fastest:

Skill	What it saves you
`/deploy`	Pre-flight -> deploy -> smoke test, every time
`/review-pr`	Consistent review with live diff context
`/ticket [ID]`	Pulls ticket context, starts implementation
`/release`	Version bump -> changelog -> tag -> publish
`/onboard`	Deep-dive into unfamiliar code areas
`/migrate`	Database migration with safety checks

If you find yourself explaining the same workflow twice, it should be a skill.

Chapter 06

Custom Agents: Specialisation

There's a counterintuitive truth about LLMs: give them fewer responsibilities and they perform better at each one. A general-purpose agent asked to "review this code and also maybe fix the tests and update the docs" will do a mediocre job at all three. Three specialised agents (a reviewer, a test writer, a documenter) will each do their part well.

This is the case for custom agents. Not because the model can't multitask, but because a focused brief with restricted tools produces sharper output. A reviewer that literally cannot edit files will never accidentally "fix" the code it was supposed to review.

Defining an agent

Agent definitions are markdown files with YAML frontmatter. Drop them in .claude/agents/ and they're available to the whole team.

<!-- .claude/agents/reviewer.md -->
---
name: reviewer
description: Code reviewer. Finds bugs, security issues, missing tests.
tools: Read, Grep, Glob
model: sonnet
maxTurns: 15
---

You are a code reviewer. Your job is to find problems, not fix them.

Be specific: file names, line numbers, concrete examples of what could go wrong.
Prioritise correctness over style. If something works fine and just looks ugly, leave it.
Separate blocking issues from nice-to-haves.

You do NOT rewrite code. You do NOT suggest style changes a linter would catch.

Notice what's happening: this agent has Read, Grep, and Glob but no Edit, no Write, no Bash. It physically cannot modify anything. The maxTurns: 15 cap prevents it from spinning forever on a complex review. And the prompt's negative instructions ("You do NOT rewrite code") are as important as the positive ones.

The key fields

Field	Why it matters
`name`	How you and Claude reference this agent
`description`	Claude reads this to decide when to delegate, so make it precise
`tools`	The most important constraint. Fewer tools = fewer mistakes
`model`	Match the model to the task. Haiku for simple, Opus for architecture
`maxTurns`	Prevent runaway loops. 10-15 for reviews, 20-30 for complex work
`memory`	Give the agent persistent memory across sessions
`isolation: worktree`	The agent works on its own copy of the repo
`skills`	Pre-load specific skills into this agent's context

Built-in agents

Claude Code ships with three you get for free:

Agent	Model	Access	Good for
Explore	Haiku	Read-only	Fast codebase search and discovery
Plan	Inherited	Read-only	Research during plan mode
General-purpose	Inherited	Everything	Complex multi-step tasks

These fire automatically when Claude decides to delegate. Your custom agents extend the roster.

Agents worth building

Agent	Why	Key constraint
`reviewer`	Consistent code review	Read-only
`planner`	Architecture without accidental changes	Read-only, plan mode
`security-auditor`	Dedicated security lens	Read-only, focused prompt
`test-writer`	Tests written by someone who didn't write the code	Can only write test files
`documenter`	Docs updated without touching source	Can only write .md files

One agent, one job. Resist the urge to build a Swiss Army knife. A reviewer that also writes tests is a reviewer that sometimes writes tests when it should be reviewing.

Chapter 07

MCP: Connecting the Dots

Out of the box, an agent can read files, write files, and run shell commands. That covers a lot of ground. But the moment you need it to check a GitHub issue, query a database, or post to Slack, you need a bridge.

That bridge is MCP (Model Context Protocol). It's an open standard that lets the agent connect to external tools the same way it uses built-in ones. Install a GitHub MCP server and suddenly the agent can search issues, read PRs, post comments, and manage releases. It doesn't feel like an integration. It feels like the agent just... knows how to use GitHub.

How it works in practice

You: "Are there any open issues about the login timeout?"
  ↓
Claude → GitHub MCP Server → GitHub API → results
  ↓
Claude: "Found 2 issues: #142 (session expires too early)
         and #187 (remember-me not working)."

The agent sees MCP tools alongside its native tools. They show up as mcp__github__search_issues or mcp__postgres__query, structured, typed, and documented. No prompt engineering required.

Setting it up

MCP servers are configured in .mcp.json:

{
  "mcpServers": {
    "github": {
      "type": "stdio",
      "command": "npx",
      "args": ["@modelcontextprotocol/server-github"],
      "env": { "GITHUB_TOKEN": "${GITHUB_TOKEN}" }
    },
    "postgres": {
      "type": "stdio",
      "command": "mcp-postgres",
      "env": { "DATABASE_URL": "${DATABASE_URL}" }
    }
  }
}

Project-level config goes in .mcp.json (checked into git). Personal config goes in ~/.claude/.mcp.json. You can also pass --mcp-config for session-only setups.

When to use MCP vs. CLI

Honest answer: most of the time, the CLI is fine. The agent can already run gh issue list or aws s3 ls through Bash. MCP adds value when you need structured tool interfaces (typed inputs, predictable outputs, less context pollution) or when the CLI output is too verbose and noisy for the context window.

Service	Just use the CLI	Consider MCP
GitHub	`gh` handles most workflows	When you need structured issue/PR data
AWS	`aws` CLI works great	For specific services with complex output
Databases	Simple queries via CLI	When you want schema-aware tools
Slack	—	Yes, MCP is the way here
Figma	—	Yes, MCP required

Scoping to agents

You can give MCP access to specific agents only, keeping the main conversation uncluttered:

---
name: github-ops
description: Handles GitHub issues, PRs, and releases
mcpServers:
  - github
---

This agent gets GitHub tools. Others don't. Useful when you have many MCP servers but most conversations only need one or two.

The reality check

MCP is the right idea. The execution isn't there yet. Servers crash, connections drop, tool discovery is flaky, and debugging a broken MCP setup is an exercise in reading logs that tell you nothing useful. The ecosystem is young: most servers are community-maintained, quality varies wildly, and the ones that work great on Monday sometimes don't on Thursday after an update.

This isn't a reason to avoid MCP. It's a reason to be selective. Pick one or two servers that solve a real problem for your workflow, get them stable, and leave the rest alone until the ecosystem matures. The teams that try to wire up eight MCP servers on day one spend more time debugging connections than writing code.

We'll probably be living with these rough edges for a while. The protocol is solid. The tooling around it is catching up.

Chapter 08

Memory & Learning

Every new session starts from zero. The agent doesn't remember yesterday's debugging session, the architectural decision you made last week, or the fact that it's the third time it's tried to use Jest when your project uses Vitest. Unless you give it memory.

Claude Code has two memory systems, and understanding the split is important.

What you write vs. what the agent writes

CLAUDE.md is memory you control. It's deterministic: the same instructions, every session, for every team member. This is where your team's conventions, architectural decisions, and non-obvious project context belongs. It's shared through git.

Auto memory is the agent's own notebook. When enabled, Claude saves observations across sessions: build commands it discovered, debugging patterns it found, preferences it learned from your corrections. This is personal and machine-local. Your teammates don't see it.

	CLAUDE.md	Auto Memory
Author	You	The agent
Contains	Rules, conventions, architecture	Learnings, patterns, quirks
Loads	Every session, fully	First 200 lines at start
Shared	Yes (git)	No (your machine only)

How learning actually works

The most interesting part of auto memory isn't the explicit "remember this" but the implicit learning. The agent picks up on corrections:

"No, don't use any, use unknown."
-> Stored. Applied in future sessions.

"Perfect, exactly like that."
-> Approach reinforced. Agent leans toward this pattern next time.

It's not magic. It's pattern matching on your feedback, saved to a file, and re-injected into future conversations. But it works surprisingly well for building up preferences over time.

Practical tips

Tell it explicitly when something matters. "Remember: we always use Vitest, never Jest." The agent saves it with high confidence.

Review memory periodically. /memory shows you what the agent has saved. Delete anything stale. Outdated memory is worse than no memory, because the agent will confidently act on information that's no longer true.

Custom agents get their own memory. Set memory: project in the agent frontmatter and the reviewer remembers patterns from previous reviews without cluttering the main agent's knowledge.

Auto memory is per machine. If you work from a laptop and a desktop, each has its own memory. CLAUDE.md is the shared mechanism. Auto memory is the personal one.

Chapter 09

Multi-Agent Orchestration

A single agent is fine until it isn't. The context window fills up. The task needs both a deep investigation and a careful implementation, but combining them in one session means the investigation artifacts crowd out the implementation. Or you need three things done in parallel and the agent can only do one at a time.

That's when you reach for multiple agents. But the term "multi-agent" gets thrown around loosely, and it actually covers two very different things.

Subagents vs. Agent Teams: the critical distinction

A subagent is a helper. It works for your main agent. You send it on an errand ("go investigate this," "review that diff"), it does the job in its own context window, hands back a summary, and disappears. The main agent stays in control. Think of it as delegating a task to someone in the next room. They come back with an answer. You never left your desk.

An agent team is a crew. Multiple independent Claude Code instances, each with its own full context, working in parallel on different parts of the same problem. They coordinate through a shared task list and can message each other directly, without going through a lead. Think of it as five people in a room, each working on their piece, calling out to each other when they need something.

	Subagent	Agent Team
Reports to	The main agent	Each other + shared task list
Context	Own window, result summarised back	Fully independent, persistent
Communication	One-way: task in, result out	Peer-to-peer messaging
Coordination	Main agent manages everything	Self-coordinating
Cost	Low (one focused task)	High (N separate sessions)
Best for	Investigation, review, focused tasks	Large features, parallel implementation

Most of the time, subagents are what you want. They're cheap, focused, and keep your main context clean. Agent teams are for the jobs that are genuinely too big for one agent to hold in its head, and where parallelism actually matters.

Three mechanisms, in order of complexity

Subagents

The workhorse. A subagent runs in its own context window, does its job, and reports back. Your main conversation stays clean because the subagent's exploration, dead ends, and intermediate work never touch your context.

Main Agent
  ├── → Explore agent: "Find all files that handle authentication"
  ├── → Review agent: "Review this diff for security issues"
  └── ← Receives summarised results from both

Use them for investigation, research, focused analysis, code review, any task where you want the answer but don't need the work to happen in your main session. The main agent decides when to delegate and synthesises the results. Simple, effective, and by far the most common pattern.

Agent Teams

A different beast entirely. Multiple independent Claude Code instances working in parallel with a shared task list and direct messaging between teammates. This is real concurrency: each teammate has its own context, works on its own files, and can coordinate with others without bottlenecking through a lead.

Team Lead
  ├── Teammate A (frontend)  ←→  Shared Task List  ←→  Teammate B (backend)
  └── Teammate C (tests)     ←→                    ←→  Teammate D (docs)

Worth knowing: Agent Teams is still experimental. You enable it with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS in your settings. It works (the C compiler project proved that), but there are rough edges. Session resumption doesn't restore teammates, task status can lag, and shutdown is sometimes slow. Teammates can't spawn their own teams, and you can't transfer leadership. It's the kind of feature where you should start with research and review tasks to get a feel for it before you throw a full implementation at it.

Practical guidance from teams that have used this: start with 3-5 teammates, give each one clear file ownership to avoid merge conflicts, and aim for 5-6 tasks per teammate. The token cost scales linearly: a 5-agent team costs roughly 5x a single session.

Full setup guide: code.claude.com/docs/en/agent-teams

Worktrees

Git worktrees give each agent its own copy of the repository. Full isolation, so changes can't conflict because each agent works on its own branch.

claude --worktree feature-auth          # Named worktree
claude --worktree fix-perf --tmux       # Split-pane display

Use these when agents need to modify the same files independently, or when you want zero risk of conflict between parallel tasks.

Patterns that work

Pipeline. Planner -> Builder -> Reviewer -> Committer. Each stage processes the output of the previous one. Predictable, debuggable, sequential.

Fan-out. An orchestrator decomposes a task into independent subtasks, farms them out to parallel workers, synthesises the results. Good when the subtasks don't depend on each other.

Writer/Reviewer. Two separate sessions. Session A implements. Session B reviews with fresh context, no bias from having written the code. This catches blind spots a single session can't see.

The Interview. Session 1: Claude interviews you about the feature, writing a detailed spec. Session 2: Claude implements from the spec in a clean context. Separates definition from execution, which prevents the most common failure mode: the agent misunderstanding something early and building on it.

The parallelism question

This is the question everyone's really asking: if I run 10 agents in parallel, can I build 10x faster? Can I build the whole app in a day?

Theoretically, yes. If your app breaks down into 20 independent modules with no shared dependencies, you could have 20 agents building them simultaneously. Each one owns its files, writes its tests, commits its work. What used to be a month of sequential development becomes a day of parallel execution.

In practice, perfect parallelism is rare. Here's what limits it:

Dependencies. Module B needs Module A's API to exist before it can call it. The database schema needs to be agreed before anyone writes queries. Authentication needs to work before anything that requires a logged-in user. Most real projects have a dependency graph that limits how much can truly happen at the same time.

Shared files. Two agents editing the same file leads to overwrites. The more agents you run, the more carefully you need to slice the work so nobody steps on anyone else. Worktrees help (each agent gets its own copy of the repo), but merging divergent branches has its own cost.

Review capacity. This is the bottleneck nobody talks about. Ten agents can produce ten PRs in an hour. Can you review ten PRs in an hour? If not, you're just building a queue. And unreviewed code is not shipped code.

Coordination overhead. More agents means more messaging, more task management, more potential for miscommunication. There's a point where adding another agent creates more coordination work than it saves in implementation time.

The realistic version looks more like this: break the project into phases. Within each phase, identify the tasks that are genuinely independent. Parallelise those. Let the agents work. Review and merge. Move to the next phase. You won't get 10x, but 3-5x on the right kind of project is real, and that still turns a month into a week.

The spec is what makes this possible. Without a detailed spec that defines the interfaces between modules, parallel agents will build things that don't fit together. With one, each agent knows exactly what its module should accept and return, even if the other modules don't exist yet.

The numbers

Anthropic stress-tested Agent Teams by having 16 parallel agents build a 100,000-line C compiler that can compile the Linux kernel. It took ~2,000 sessions and ~$20K in API credits. Their multi-agent research system improved quality by 90.2% compared to single-agent, at 15x the token cost.

Those numbers put a real price tag on the trade-off. Subagents are cheap and cover most daily work. Agent teams burn through tokens, but they unlock projects that a single agent can't hold in its head. The five-person, three-month project is the sweet spot, complex enough to justify the cost, parallelisable enough to actually benefit.

Chapter 10

The PRD Workflow

You can give an agent perfect tools, flawless memory, and unlimited compute. If you give it a vague prompt, you'll get vague output. The single biggest determinant of quality isn't the model or the infrastructure. It's the spec.

Spec-driven development is not a new idea. But it has become a critical skill in the agentic era, because the cost of a bad spec used to be "a developer asks for clarification." Now the cost is "an agent builds the wrong thing at astonishing speed and you don't notice until it's deeply embedded."

You know that vague Jira ticket, two sentences, no acceptance criteria, and you end up booking a meeting just to figure out what it actually means? Annoying when a human picks it up. Catastrophic when an agent does. The agent won't book a meeting. It'll interpret the ambiguity with full confidence and ship something that technically matches the words but misses the point entirely.

The workflow, step by step

1. INTENT         A paragraph or two. What and why.
   ↓
2. INTERVIEW      Claude interviews you. Edge cases,
   ↓              trade-offs, things you haven't considered.
3. SPEC           A structured document with verifiable
   ↓              acceptance criteria. PRD.md or SPEC.md.
4. PLAN           Agent reads the spec. Proposes steps.
   ↓              You review, adjust, approve.
5. IMPLEMENT      Fresh session. Agent executes the plan.
   ↓
6. VERIFY         Tests pass. Checks clear. You review.
   ↓
7. SHIP

The interview step deserves emphasis. Start with a minimal brief and let Claude ask you questions. It will surface things you haven't thought about: error handling, concurrent access, what happens when the user's session expires mid-upload. Once the interview is done, write the spec together and start a fresh session for implementation. Clean context. Complete specification.

What makes a good spec

# Feature: Image Upload

## Overview
Users can upload profile images. Images are validated, resized,
stored in Supabase Storage, with metadata in the images table.

## Requirements

### Must Have
- [ ] Upload images up to 5MB
- [ ] Resize to max 1200px width
- [ ] Show upload progress
- [ ] Show error state for failures

### Out of Scope
- Video upload
- File type conversion
- CDN integration (separate project)

## Technical Constraints
- Use existing src/lib/storage.ts
- Store images in Supabase Storage
- Max 3 concurrent uploads
- Client-side validation before upload

## Acceptance Criteria
- [ ] All "Must Have" items implemented
- [ ] Unit tests cover upload, resize, and error paths
- [ ] E2E test covers the upload flow
- [ ] npx tsc --noEmit passes
- [ ] All existing tests pass

## Do Not Change
- Authentication flow
- Navigation structure
- Existing image display components

Notice what's happening here: every requirement is checkboxed and verifiable. The "Out of Scope" section prevents scope creep. The "Do Not Change" section protects adjacent code. The acceptance criteria are things a CI pipeline can verify, not vibes like "clean implementation."

Phased specs

For features that take more than a session, split the spec into phases. Each phase has its own objective, requirements, acceptance criteria, and a "Do Not Change" section that locks down previous phases:

## Phase 1: Upload Infrastructure
OBJECTIVE: File storage and upload API
DO NOT CHANGE: Nothing in src/components/

## Phase 2: Upload UI
OBJECTIVE: The upload interface
DO NOT CHANGE: Upload API from Phase 1

## Phase 3: Image Display
OBJECTIVE: Gallery view
DO NOT CHANGE: Upload flow from Phase 1 and 2

Each phase is a commit. Each phase can be verified independently. The "Do Not Change" sections create an expanding protective shell around completed work.

The three-tier boundary system

Write these into every spec:

Tier	Rule	Examples
Always	No approval needed	Run tests, format code, lint
Ask First	Human reviews	Schema changes, API changes, new dependencies
Never	Hard stop	Commit secrets, delete migrations, modify auth

This isn't just good practice. It's risk management. Without explicit tiers, the agent treats everything with equal confidence, and "equal confidence" means it'll casually drop a migration file with the same energy it uses to fix a typo.

Chapter 11

Verification at Scale

Here's a number that should keep you up at night: AI-generated code introduces 1.7x as many bugs as human-written code. Seventy-five percent more logic errors per pull request. And the worst part: 66% of developers say the biggest problem isn't code that's obviously broken, it's code that's "almost right, but not quite." It compiles. It runs. The tests pass. And it has a subtle bug in a branch condition that won't surface until production.

When you're shipping 10 PRs a week, manual review covers it. When you're shipping 90 PRs a day (which is where agentic teams end up), manual review becomes the bottleneck that breaks everything. You need verification that scales as fast as generation.

The verification pipeline

Think of it as three layers, each catching what the previous one missed:

Layer 1: Deterministic gates          Seconds
         Compilation, type checking,
         linting, unit tests,
         security scanning

Layer 2: AI-powered review            Minutes
         Agentic code review,
         spec compliance checking,
         multi-agent parallel review

Layer 3: Continuous validation        Background
         Property-based testing,
         visual regression,
         integration/E2E suites

Layer 1 is table stakes; you already have this (or should). Layer 2 is where the agentic era changes the game. Layer 3 is where the most advanced teams are pulling ahead.

Spec-first testing

The pattern that works best marries spec-driven development with test-first implementation. The spec defines what should happen. Tests encode the spec into executable checks. Implementation makes the tests pass.

SPEC (what)  →  TESTS (verify)  →  CODE (how)
     ↑                                  │
     └──────── fails? ◄────────────────┘

This can be literal Gherkin if your team uses BDD:

Feature: Image Upload
  Scenario: User uploads a valid image
    Given a logged-in user on the upload page
    When they select a 3MB PNG file
    Then the image should be resized to 1200px width
    And the upload progress should reach 100%
    And the image should appear in the gallery

  Scenario: User uploads an oversized file
    When they select a 12MB file
    Then they should see "File must be under 5MB"
    And no upload should be initiated

Or it can be plain acceptance criteria in your spec with corresponding test files. The format matters less than the discipline: tests exist before implementation starts, and they encode what the spec actually says, not what the agent thinks it says.

The strongest version of this pattern uses two separate agents: one writes tests from the spec, another writes code to pass them. The test-writing agent has never seen the implementation, so it can't accidentally mirror its assumptions.

Agentic code review

Anthropic launched their multi-agent Code Review system in March 2026. The architecture is worth understanding because it's a template for how review works at scale: multiple agents review the same PR in parallel, each examining a different dimension: logic errors, boundary conditions, API misuse, auth flaws, convention compliance. A final agent aggregates and deduplicates the findings.

The results: 54% of PRs receive substantive review comments (up from 16% with single-agent review), and less than 1% of findings are marked incorrect by developers. Cost is $15-25 per review.

For teams not ready for that, Claude Code Action in GitHub Actions is the simpler entry point. It runs the full Claude Code runtime inside your CI pipeline and can review PRs on open or update.

Just-in-Time Tests

Meta published something genuinely new in February 2026: Just-in-Time Tests (JiTTests). The idea breaks a fundamental assumption about testing: that tests live in the codebase.

JiTTests are generated per-diff, on the fly. An LLM reads the PR, understands what changed, infers what could go wrong, and generates a test designed specifically to catch regressions introduced by that exact change. The test runs once. If it fails, the PR needs attention. If it passes, the test is discarded. It never enters the codebase.

This eliminates test maintenance entirely, which, at agentic velocity, becomes a real problem. When agents are changing code faster than test suites can be updated, traditional test maintenance becomes its own bottleneck. JiTTests sidestep it.

We're not saying you should throw away your test suite. The existing tests are your regression safety net. JiTTests are an additional layer that catches the things your existing tests weren't designed to catch, because they were written before the change existed.

What to actually measure

Traditional velocity metrics are breaking down. Story points are meaningless when an agent completes a ticket in twenty minutes. Lines of code is meaningless when the agent writes ten times more code than necessary.

Google's 2025 DORA report added a fifth metric, rework rate: how often teams push unplanned fixes to production. This is the metric that tells you whether AI speed is translating into AI quality or just AI volume.

The metrics that still matter:

Metric	Why it matters more now
Change failure rate	AI code can pass tests but fail in production. Watch this above all else.
Rework rate	How often you're fixing what was just shipped. The canary in the coal mine.
Lead time (by stage)	Break it down: how long in review? In test? In deploy queue? Find the bottleneck.
Deployment frequency	Still useful, but only meaningful paired with failure rate.

What's meaningless: story points per sprint, velocity charts, lines of code, tickets closed. These measure activity, not outcomes. With agents, you have more activity than you know what to do with. The question is whether any of it matters.

Chapter 12

The Death of the Backlog

Something awkward is happening in sprint planning meetings. A developer mentions they used Claude Code over the weekend and the feature that was estimated at 8 story points is already done. The PM opens Jira and realises the board hasn't been updated in three days because nobody's working from it anymore. They're working from a spec file in the repo.

This scene is playing out at companies everywhere. Not because Jira is a bad tool, but because the speed of agentic development has made the traditional ticket-driven workflow feel like filing paperwork after the work is already finished.

What's actually changing

Jira isn't dying. But it's being demoted. It went from "source of truth for all work" to "a place where stakeholders check status," and increasingly, not even that. The problem isn't Jira specifically. It's that ticket-based project management was designed for a world where implementation was the bottleneck. Now implementation is the cheap part. Definition, verification, and review are the bottlenecks.

When a PM can write a spec in markdown and have a running prototype in twenty minutes, the overhead of creating a Jira ticket, estimating story points, assigning it to a sprint, updating the status through five stages, and closing it after merge feels like it belongs to a different era. Because it does.

The spec-as-ticket pattern

The most tangible shift is specs replacing tickets. The PRD or SPEC.md you write isn't just input to the agent. It's the work item itself. It lives in the repo, it's version-controlled, it has acceptance criteria that are machine-verifiable, and when the agent implements it, the commit references the spec file.

GitHub's Spec Kit, AWS Kiro, and Tessl are all building on this idea at different levels of ambition. The lifecycle looks like:

PM writes spec in repo
  → Agent reads spec
  → Agent proposes plan
  → Human approves
  → Agent writes tests from acceptance criteria
  → Agent implements until tests pass
  → CI runs full verification pipeline
  → Human reviews PR
  → Merge

No ticket was created. No status was updated. No sprint was planned. The spec was the ticket, the tests were the acceptance gate, the branch was the status, and CI was the final check.

The PM role isn't dying, it's transforming

This is important to get right, because the hot take is "AI replaces PMs" and the reality is more interesting than that.

What's dying is the administrative layer: creating tickets, grooming backlogs, running status updates, shuffling cards between columns. That work is being automated or eliminated entirely.

What's becoming more important is everything PMs were supposed to be doing instead: understanding users, making trade-off decisions, defining what NOT to build, and writing specs that are precise enough for an agent to implement correctly. The PM who can frame a problem well and write a tight spec is suddenly ten times more productive than before. The PM who was mostly a ticket-shuffler is in trouble.

"Context engineering" is the emerging term for the new core skill. Organising project files, customer feedback, product docs, and existing specs so that agents can reference the right context at the right time. It's less project management and more information architecture.

What happens to sprints

The honest answer: nobody knows yet, and anyone who claims certainty is selling something.

The "Scrum is dead" camp argues that when agents complete features in hours instead of days, two-week sprints are absurd. Features are done before sprint planning can even discuss them. The backlog becomes a graveyard of tickets that were either already completed or no longer relevant.

The "Scrum is evolving" camp argues that sprint boundaries become more important with AI, not less. Without inspection points and Sprint Goals, you get a hyperactive loop of constant iteration with no strategic direction. Faster implementation makes judgment and direction more valuable, not less.

What we're actually seeing in practice:

Estimation is dying. Story points measured human effort. An agent's effort is measured in tokens and compute cost. Teams are moving toward "appetite-based planning," deciding how much time and resources a problem deserves, rather than estimating how long it takes.

Review is the new bottleneck. When individual output surges 98% but review time surges 91%, the constraint has moved from "how fast can we build" to "how fast can we verify." Sprint planning should be organised around review capacity, not implementation capacity.

Continuous flow is winning over time-boxed sprints. Not universally (95% of teams still use some form of Agile). But the teams pushing hardest on agentic workflows are gravitating toward Kanban-style continuous flow with WIP limits, not two-week boxes.

The metrics that matter changed. DORA metrics (deployment frequency, lead time, change failure rate, rework rate) are replacing story points and velocity. Google's 2025 DORA report stopped using the old low/medium/high/elite tiers entirely. Change failure rate is the metric to watch, because shipping faster means nothing if you're shipping more bugs.

The uncomfortable truth

All of this is easier for small teams and individual practitioners than for large organisations. A team of five can switch to spec-driven development in a week. A 500-person engineering org with three years of Jira history, compliance requirements, and cross-team dependencies can't.

The enterprise transformation is still mostly aspirational. The individual practitioner transformation is real and accelerating. This playbook focuses on the latter, because that's where the practices are actually being battle-tested.

Chapter 13

Model Selection & Cost

The most expensive model is not always the best model for the job. That sounds obvious. And yet, the most common setup is "Opus for everything," which is like hiring a senior architect to answer the phone.

Match the model to the task

Model	Good at	Bad at	Cost
Haiku	Classification, routing, commit messages, quick answers	Architecture, complex reasoning	$
Sonnet	Code generation, debugging, reviews, explanations	Simple tasks (overkill), deep architecture (underpowered)	$$
Opus	System design, multi-file refactors, subtle bugs, trade-off analysis	Anything Sonnet handles fine (10-20x the price)	$$$$

Most messages in a typical development session are Sonnet-tier. Code generation, file edits, debugging, explanations: Sonnet handles all of this. Opus is reserved for the moments where you genuinely need architectural reasoning or multi-dimensional trade-off analysis. Haiku handles the long tail of simple questions, classifications, and commit messages.

How to apply this

Per-agent: Give your documenter Haiku. Give your reviewer Sonnet. Give your architect Opus.

---
name: documenter
model: haiku
---

Per-skill: A commit message skill doesn't need Opus.

---
name: generate-commit
model: haiku
---

Per-session: When you know you're doing architecture work, start with claude --model opus. For regular implementation, the default Sonnet is fine.

The real cost driver

It's not the model. It's the context window. A conversation at 80% context is dramatically more expensive per message than one at 20%. And performance degrades too: precision drops around 70% context utilisation, hallucinations increase above 85%.

This is why /clear between tasks isn't just good hygiene. It's cost management. Subagents save money when the alternative is filling your main context with investigation. Skills with context: fork prevent context bloat.

The /compact command summarises history to free up space when you need to continue a long conversation. Use it before context becomes a problem, not after.

Chapter 14

Anti-Patterns & What Not to Do

These are the most common ways agentic workflows fail, collected from production teams, open-source community experience, and our own painful lessons. Each one seemed like a good idea at the time.

The encyclopedia CLAUDE.md

You keep adding rules every time the agent makes a mistake. Six months later it's 400 lines and the agent ignores half of them because the important rules are buried in noise. A CLAUDE.md should be a briefing, not a manual. If it's longer than 200 lines, it's too long.

The kitchen sink session

You start implementing a feature. Then you ask a question about an unrelated module. Then you go back to the feature. The context is now polluted with irrelevant information and the agent's focus is split. One task, one session. /clear between topics. This is the simplest practice and it makes the biggest difference.

If your brain works anything like mine (jumping between ideas, questions popping up mid-flow), keep a scratch file open. Dump every unrelated thought, question, and tangent there instead of firing it at the agent. When you're done with the current task, /clear, and work through the list one by one. Structured input, structured output.

The correction spiral

Agent gets something wrong. You correct it. Still wrong. Correct again. Now you're three corrections deep, the context is full of failed approaches, and the code is worse than when you started. After two failed corrections, stop. /clear. Write a better initial prompt. A fresh start with a clearer spec beats iterating on confusion every time.

Assumption propagation

The agent misunderstands something in your first message and builds an entire feature on that misunderstanding. You don't notice until five commits deep because each individual change looked reasonable. This is the most expensive failure mode: the code works, the tests pass, and the architecture is fundamentally wrong.

Fix: Plan mode. Require the agent to propose a plan before touching code. Read the plan. Catch misunderstandings when they're cheap to fix.

Slop gravity

Without constraints, agents default to verbose, over-engineered solutions. They create class hierarchies where a function would do. They add configuration layers nobody asked for. They scaffold 1,000 lines where 100 would suffice. Early velocity is high, but you're building technical debt at AI speed.

The compound effect is brutal: once there's an over-engineered abstraction in the codebase, the agent treats it as a pattern and applies it everywhere. You wanted a utility function. You got a factory that produces builders that configure strategies that generate utility functions. And now every new feature gets the same treatment because the agent thinks that's how things are done here.

Fix: Explicit instructions in CLAUDE.md. "Prefer simple solutions. Three similar lines are better than a premature abstraction. No frameworks for single-use cases."

The 80% problem

Individual output surges 98%. Code review time surges 91%. The team is shipping faster but shipping more bugs. Nobody fully understands the code anymore because nobody wrote it.

This is the uncomfortable truth of AI-generated code: writing is the easy part. Understanding is the hard part. And when you don't understand the code, you can't review it, you can't debug it, and you definitely can't safely extend it.

Fix: This is where agentic code review (Chapter 11) earns its keep. Let review agents handle the volume (security, correctness, convention compliance) and focus your human attention on the changes that matter architecturally. You can't manually review 90 PRs a day, and you shouldn't try. But someone on the team needs to understand the system well enough to know when the agents are missing the bigger picture. Velocity is not the same thing as productivity.

Sycophantic agreement

The agent never pushes back. It implements whatever you ask, even when the request contradicts existing architecture, introduces a security vulnerability, or is just a bad idea. Models are trained to be helpful, not confrontational.

Fix: Instruct it to push back. "If this request conflicts with existing patterns or seems like a bad idea, say so before implementing." It won't catch everything, but it catches more than you'd expect.

Dead code accumulation

Agents add new code. They rarely clean up what the new code replaces. Old implementations, orphaned comments, unused imports: they pile up. And once they're in the codebase, the agent treats them as context and builds around them.

Fix: "When replacing code, remove the old implementation entirely. No commented-out code." Back it up with linters that catch unused imports and variables.

Chapter 15

Putting It All Together

Theory is great. Here's the actual checklist.

Step 1: Setup

mkdir -p .claude/agents .claude/skills .claude/rules docs

Write your CLAUDE.md. Start with 30 lines. Commands at the top, rules in the middle, verification at the bottom. You can always add more. Resist the urge on day one.

Create .claude/settings.json with the basics: deny destructive commands, auto-format on save, block edits on main. These three hooks prevent the most common disasters.

{
  "permissions": {
    "deny": [
      "Bash(rm -rf *)", "Bash(git push --force*)",
      "Bash(git reset --hard*)", "Edit(.env*)", "Read(.env*)"
    ]
  },
  "hooks": {
    "PostToolUse": [{
      "matcher": "Edit|Write",
      "hooks": [{
        "type": "command",
        "command": "npx prettier --write \"$CLAUDE_TOOL_INPUT_FILE_PATH\" 2>/dev/null || true"
      }]
    }]
  }
}

Step 2: Agents and skills

Build your first custom agent, a reviewer with read-only access. Build your first skill, probably /review-pr with dynamic diff injection. Use them for a week. Notice what works and what's missing.

Step 3: Specs

Write your first PRD for a real feature. Use the interview pattern: let Claude ask you questions, then produce the spec together. Start a fresh session for implementation. Notice how much smoother it is when the agent has a complete specification instead of a chat conversation to parse.

The ongoing loop

Morning:
  Check what's done. Write or refine today's spec.
  Plan mode → approve implementation plan.

Working:
  One task per session. /clear between topics.
  Subagents for investigation. Commit after each unit.

Review:
  Fresh session for code review.
  Typecheck → test → lint → build.
  PR and ship.

The checklist

CLAUDE.md exists and fits on two screens
Build, test, lint commands documented
Critical rules enforced with hooks, not just advisory text
Sensitive files blocked in deny list
At least one custom agent (start with a reviewer)
At least one skill (start with /review-pr)
Specs live in docs/ with a consistent template
Verification is automated, not manual
Everything is in version control

The gap between demo and production is never the model. It's the system around it: the constraints, the checks, the boundaries, the specs. The agent is the engine. Your job is everything else.

One more thing

Everything in this playbook (the folder structure, the CLAUDE.md patterns, the hooks, the skills, the specs, the verification pipelines, the multi-agent orchestration), I built it because I needed it myself. Redshift is built on these exact practices, and I spent a lot of time learning what works and what doesn't.

If setting all of this up sounds like a lot of work, that's because it is. That's why I built Redshift Hub, to handle the infrastructure so you can focus on doing.

redshift.build

The Agentic Playbook · Version 1.0 March 2026

Sources and further reading:

Anthropic: Best Practices for Claude Code, code.claude.com
Anthropic: Building Effective Agents, anthropic.com/engineering
Anthropic: Multi-Agent Research System, anthropic.com/engineering
Anthropic: Effective Harnesses for Long-Running Agents, anthropic.com/engineering
GitHub: Spec-Driven Development, github.blog
GitHub: Spec Kit, github.com/github/spec-kit
Addy Osmani: The Code Agent Orchestra, addyosmani.com
Addy Osmani: The 80% Problem, addyo.substack.com
OpenAI: A Practical Guide to Building Agents, openai.com
nibzard: The Agentic AI Handbook, nibzard.com
Trail of Bits: Claude Code Config, github.com/trailofbits
Sean Goedecke: Ideas in Agentic AI Tooling, seangoedecke.com

Download RedshiftClosed Beta

The AgenticPlaybook