The /feature Workflow: Plan Before You Code, Review Before You Ship

Most people use AI coding agents like autocomplete: type a request, get code back, hope it's right. This works for trivial changes. It fails as soon as a task touches more than one file, because the agent doesn't understand your codebase, doesn't ask questions, and doesn't review its own work. The first features land cleanly. Then the codebase grows, the agent makes assumptions nobody sanity-checked, and the output becomes plausible but wrong.

The fix is structural, not a longer prompt. At Mentiora we run a generative-algorithm (GA) pipeline that produces production code through a divergent-convergent loop — generate N candidate artifacts, score them against pre-declared fitness functions, kill the weak, hybridize the survivors. The pipeline ships multiple vertical-slice features into a single application at a time, with all the integration tests, schema, and contract modules written by the same worker that built the feature.

A pared-back version of that pipeline that demonstrates the core idea is open-source. We call it /feature. Last week, I walked through it at the first Claude Code for Builders: Live Builds in Zurich. The slide deck I used was itself built by an autonomous Claude Code pipeline running this exact pipeline — vertical-slice DAG, council review, fitness-function tests. The deck described the system that built the deck. Recursion all the way down.

The projector never worked. The QR-code printout I had prepared as a backup became the entire delivery channel — and it turned out not to be a backup at all, because the "projector view" and the "audience view" were two renders of the same app. The audience watched the fifteen-minute presentation on their phones through the same integrated app that should have been driving the projector. Crash-resilient and load-tested by design, because the persistence layer, the resume overlay, and the dev-time content validator all came from explicit acceptance criteria written into the requirements phase, not bolted on after a venue mishap.

This post is a technical writeup of the /feature workflow — plus an outline of the pipeline it sits inside.

What /feature actually is

/feature is a slash command for Claude Code. It's not a plugin or a framework — just two markdown files dropped into .claude/commands/ in any project.

feature.md — the slash-command prompt. A thin wrapper.
checklist.md — the workflow template. An 8-step structured checklist Claude works through.

Type /feature add dark mode support in Claude Code. The slash command:

Resolves $NAME, $SLUG, $DATE from the description.
Creates .features/$DATE-$SLUG/ in the project.
Copies the checklist template into that directory.
Creates an isolated git worktree (git worktree add) rather than a branch — so you can run arbitrarily many features in parallel without context-switching, each in its own working directory.
Starts working through the checklist, ticking boxes and writing evidence into the file on disk after every step.

When the workflow completes, the checklist stays in the repo as a permanent record of what was built, what was decided, and what the reviewers caught.

The eight steps:

Step	Name	What it does
0	Assess	Detect BUILD vs FIX mode; assess size as S, M, or L
1	Scan the Codebase	Sub-agent (`subagent_type: "Explore"`) finds reusable code, similar patterns, collision risks
2	Clarify Until Confident	ReAct loop — ask one focused question with concrete options, observe, repeat
3	Plan	Write `plan.md`. For BUILD: itemized, each tagged `[CODE]` or `[MANUAL]`. For FIX: hypothesis → evidence → verdict
4	Council Review (Plan)	Three independent sub-agents in parallel review the plan: Security, Quality, Devil's Advocate
5	Build	Implement the approved plan; run tests and linters
6	Council Review (Code)	Same three reviewers, now on the actual `git diff`
7	Create the PR	Push branch, open PR with structured body
8	Wrap Up	Print summary including what the reviewers caught

S-size features skip the scan and the plan-review council. Trivial changes don't earn the full ceremony. The size threshold is project-tunable.

Three named patterns inside the workflow are worth pulling out, because each one travels independently of the rest.

The Living Checklist

The checklist file is the state. After every step, Claude updates checkboxes and fills evidence fields in checklist.md on disk, then re-reads its own progress before the next step.

This is structurally load-bearing.

Proof-of-completion is the constraint. Every step has explicit evidence fields — Files changed:, Test result:, Security issues:. The agent cannot tick the box without writing the evidence. Silent skipping becomes impossible because the next step's check finds the missing evidence.

The checklist survives context resets. Long sessions exceed any model's working memory. The checklist file does not. A fresh Claude instance picks up exactly where the last one left off without needing to be re-briefed.

The checklist commits to the repo. Six weeks later, when a new engineer is reading a feature directory under .features/, the trail is intact: what was scanned, which library was chosen over which alternative, what the security reviewer flagged at Step 4, what the rebuttal was, what changed between Step 4 and Step 6.

The pattern generalizes beyond /feature. Any agentic workflow that runs across more than one model call benefits from disk-backed state. Every candidate, every score, every checklist tick goes to disk. Crash anywhere, resume from last green.

The Council Pattern

The most interesting structural choice in /feature is that the agent reviews its own work using independent sub-agents with different mandates, launched in parallel in fresh contexts. They do not see each other's reviews, and groupthink is engineered out.

The three reviewers are spawned with literal prompts. From checklist.md:

Reviewer 1 — Security:

"Review this implementation plan for security issues: auth gaps, injection risks, data leaks, missing input validation. Be specific — cite plan items and suggest fixes."

Reviewer 2 — Quality:

"Review this plan for code quality: unnecessary complexity, missing error handling at system boundaries, over-engineering, duplication with existing code. Be specific."

Reviewer 3 — Devil's Advocate:

"Find problems. What will break? What edge cases are missed? What's the simplest version that actually works — are we overbuilding? Be direct and specific."

After all three return, the agent:

Fixes valid security and quality findings.
Accepts or rejects each Devil's Advocate point with a one-line reason.
If a critical issue surfaces, re-runs the affected reviewer only.
Presents the updated plan plus findings to the human.

The council runs twice — once on the plan (Step 4), once on the actual git diff (Step 6). Catching a problem at Step 4 costs minutes. Catching it at Step 6 costs an unwound implementation. Catching it after deploy costs an incident.

This is not a substitute for human review. It is a substitute for the AI shipping its first draft. The engineer still reviews the PR. The PR they review has already passed three reviewers with different mandates.

The ReAct Clarification Loop

Instead of asking one round of questions and moving on, the agent runs a think-act-observe loop before planning anything:

Think — What is the biggest remaining uncertainty? (Business logic? Data model? UX? Edge cases?)
Act — Ask the user one focused question, with concrete options. No open-ended prompts.
Observe — Record the answer. Update understanding. Re-assess confidence.
Repeat — Until: "I could write a plan the user would approve without changes."

For trivial tasks, zero questions. For complex features, as many as needed. The agent decides when it has enough to plan.

This is the part most teams skip and shouldn't. The dominant failure mode of agentic coding is not bad code, it's correctly-implemented wrong specifications. The agent confidently builds something nobody asked for because nobody made it ask.

A real run: adding the Q&A panel

The audience-Q&A feature in the companion app was added by typing one prompt in Claude Code: /feature add a Q&A panel so audience members can submit questions during the talk. Here is what the resulting checklist file looked like at Step 4 (Council Review of the plan):

## Step 4: Council Review (Plan)

- [x] 3 reviewers launched in parallel (Agent tool, fresh contexts)
- [x] Security findings addressed
- [x] Quality findings addressed
- [x] Devil's advocate dispositioned

Evidence:
- Security issues:
    - HIGH: Rendered question text not HTML-escaped before display
      → Fix: sanitize via DOMPurify in QuestionCard.tsx, add render test
    - MEDIUM: No rate limiting on POST /api/qa-submit
      → Fix: 5/min/IP via Upstash, return 429 with retry-after
    - LOW: Presenter token in query string is loggable
      → Accepted: token is single-event, ephemeral; documented in runbook
- Quality issues:
    - Missing error state for failed submissions (offline / 5xx / 429)
    - Submission button has no loading state — submit-twice race possible
    - No empty state for the question list before first submission
- Devil's advocate — accepted:
    - "What if Haiku is too slow for real-time clustering?"
      → Action: pre-measure p95 latency over 1k synthetic questions.
        Result: 240 ms. Acceptable. Held.
- Devil's advocate — rejected:
    - "Why not let everyone vote on questions?"
      → Out of scope; vote-based ranking adds two days; single-event ROI low.

When the audience tried to prompt-inject the live Q&A box during the talk, the per-question classifier flagged the attempts and dropped them. That classifier exists because the security reviewer asked for it at Step 4, before a single line of UI code had been written.

What this scales into

/feature is the workflow. Inside Mentiora, it sits inside a larger generative-algorithm pipeline. The shape:

brief → requirements GA → architecture GA → experience design
                                                   │
                                                   ▼
                                          foundation scaffolding
                                                   │
                                                   ▼
                                          vertical-slice feature DAG
                                       ┌───────────┼───────────┐
                                       ▼           ▼           ▼
                                   feature_a   feature_b   feature_c
                                   (worktree)  (worktree)  (worktree)
                                       │           │           │
                                       └───────────┼───────────┘
                                                   ▼
                                          integration tests
                                                   │
                                                   ▼
                                          final verification

Each stage is divergent then convergent: generate N candidate artifacts, score them against pre-declared fitness functions, kill the weak, hybridize the survivors, repeat until the field converges. The fitness function changes by stage:

Requirements: an anti-AI-review LLM hunts for hand-wavy acceptance criteria and demands measurable outcomes.
Architecture: a judge ensemble scores candidate decompositions against project-specific lenses — boundary clarity, deployability, extensibility.
Implementation: the test suite is the fitness function. A feature lands when its acceptance-criteria tests, the integration wire-boundary checks, and the render-proof all pass. Council review catches what the tests miss.

The deck app — the one the audience read off their phones when the projector died — was produced by this pipeline end to end. Five subsystems (deck-engine, companion, content, components, lib), each pinned by a contract module that declared its public surface before any worker started writing code. Multiple vertical-slice features merged into main via squash. Six hundred and seventy-nine passing tests at lock — component, reducer, transport, schema, API route, end-to-end. Every test written by the worker that built the feature, not bolted on afterward.

Four framework-level ideas from the pipeline travel cleanly, even without the full runtime:

1. Fitness function before code. Acceptance criteria are written measurably — so an LLM can decide pass/fail — before any worker is given them. Tests are the convergence signal, not a chore that happens after.

2. Vertical slices over horizontal layers. Each feature owns its schema change, its UI, its types, and its tests. No front-end PR / back-end PR split. If the schema needs a field, the same worker adds the field and the renderer.

3. Disk-backed state. Every candidate, every score, every checklist tick goes to disk. Crash anywhere, resume from last green. The same pattern that powers the Living Checklist inside /feature powers the pipeline as a whole.

4. Two views, one click. The projector / companion split in the deck framework (same beat, forked render) is more than a UX trick. It is a pipeline pattern: design-time artifacts live alongside runtime code in the same repo, so a worker reviewing a feature can see both at once. Specs and the rendered result share a tree.

The pipeline runtime itself — the GA scoring infrastructure, the worker spawn / retreat / merge protocol, the council review templates, the per-phase checklists — is not open-sourced. /feature is the surface-level workflow that captures the spirit at single-feature granularity. Most of what's portable lives in the patterns above, not in any specific tool.

Why this matters

Mentiora builds AI agents for clients in regulated industries such as healthcare and fintech. The agents speak to customers. Customers don't get a "we'll do better next time" if the agent hallucinates a refund, invents medical advice, or commits to a contract term that wasn't sanctioned.

For that to work, the model isn't the system. The model is one component in a system that includes pre-send judges, simulation harnesses, policy enforcement layers, evaluation pipelines, and a quality platform that scores every output. None of those components are themselves AI-novel — they're conventional software, with measurable acceptance criteria, that has to be correct on the first request.

We ship that software the same way /feature ships features. Plan before code. Council before merge. Fitness function before implementation. Disk-backed state, end to end. The pipeline doesn't compress engineering discipline; it forces it. Most things only become production-ready when an engineer has been forced to write down what ready means.

Try it

/feature is two files and a slash command. MIT-licensed. Drops into any Claude Code project.

Setup (30 seconds):

BASH

mkdir -p .claude/commands
curl -o .claude/commands/feature.md \
  https://raw.githubusercontent.com/mentiora-ai/feature-workflow/main/feature.md
curl -o .claude/commands/checklist.md \
  https://raw.githubusercontent.com/mentiora-ai/feature-workflow/main/checklist.md

Then type /feature <description> in Claude Code.

Live deck
Deck source (Next.js + Firestore; pnpm dev works without secrets. See docs/HOW-IT-WAS-BUILT.md for the pipeline shape behind the codebase.)
Workflow repo

These are example shapes to adapt, not the system Mentiora runs internally. Take what's useful, reshape it for your team. If you're building agentic systems for customers in regulated industries and want a deeper look at the platform underneath, reach out: hello@mentiora.ai.

This post is based on the first Claude Code for Builders talk: Live Builds, Zurich, on 7 May 2026.