Learn Vibe Coding: A Step-by-Step Tutorial for Developers
This is a 20-part guide for developers who already use AI coding tools but want to get dramatically better at it. Not a beginner’s introduction — you’ve used Copilot, maybe Cursor or Claude Code. You know what a prompt is. But you’re probably hitting the same walls everyone hits: cascading bugs, context loss, security gaps, the feeling that AI is fast but fragile.
This guide is the bridge between “I use AI sometimes” and “AI is how I work.”
Each part builds on the last. Start from Part 1, or jump to whatever’s keeping you stuck.
Part 1: The State of the Art (Early 2026)
Before we get into technique, let’s get a clear picture of where AI-assisted coding actually is right now — not the hype, not the backlash, but the reality on the ground. Because the landscape has changed fast, and the tools you tried six months ago are not the tools available today.
The Numbers Are In
MIT Technology Review named generative coding one of the 10 breakthrough technologies of 2026. That’s not a prediction — it’s a recognition that AI-assisted development has crossed from experiment to mainstream practice.
The Stack Overflow 2025 Developer Survey tells the story in hard numbers: over 80% of developers now use AI coding tools, and 51% use them daily. At the corporate level, Microsoft says ~30% of its code is AI-generated. Sundar Pichai disclosed during Google’s Q3 2024 earnings call that more than 25% of all new code at Google is written by AI. At the frontier labs themselves, the numbers are even more striking: Boris Cherny, head of Anthropic’s Claude Code, said his personal figure has been 100% for months — shipping 20+ pull requests per day, each fully written by Claude. An Anthropic spokesperson clarified the company-wide number sits at 70–90%.
This is not a future scenario. It’s the current state of the industry.
Multi-Agent Is the New Normal
The biggest shift in the last year isn’t better autocomplete — it’s the move to agent-based workflows where AI doesn’t just suggest lines of code, it executes multi-step tasks autonomously.
In February 2026, VS Code 1.109 was explicitly positioned as a “multi-agent development platform.” You can now run Claude, Codex, and GitHub Copilot agents side-by-side in unified sessions, with parallel subagents working in isolated contexts. The announcement also introduced agent hooks (shell commands triggered at lifecycle events), terminal sandboxing, and message queueing — infrastructure for treating agents as first-class collaborators, not just autocomplete on steroids.
The most dramatic demonstration came from Anthropic’s own engineering team. Researcher Nicholas Carlini used 16 parallel Claude instances, each running in its own Docker container, to build a roughly 100,000-line C compiler written in Rust. Nearly 2,000 Claude Code sessions. About $20,000 in compute. The result: a compiler that achieves a 99% pass rate on the GCC torture test suite and successfully compiles the Linux 6.9 kernel on x86, ARM, and RISC-V — plus PostgreSQL, SQLite, Redis, FFmpeg, and Doom. InfoQ’s coverage called it “without human intervention.” The Register was more skeptical, questioning the practical implications. Both reactions are valid — and both illustrate where things stand.
Steve Yegge and Gene Kim’s talk “2026: The Year the IDE Died” frames the shift well: the developer’s primary job is increasingly about articulating intent and orchestrating agents rather than writing syntax. Andrej Karpathy — who coined the term “vibe coding” a year ago — posted on X on the one-year anniversary that his preferred term is now “agentic engineering”:
“Agentic — because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. Engineering — to emphasize that there is an art & science and expertise to it.”
He also called the original tweet “a shower of thoughts throwaway” he fired off without expecting it to define a movement. A year later, the practice has matured far past the name.
Armin Ronacher (creator of Flask) gives an excellent practitioner-level walkthrough of what agentic coding actually looks like in daily work — terminal-based workflows, keeping project structures simple for agents, and the real reliability threshold that’s been crossed.
The Trust Paradox
Here’s the uncomfortable part. Despite near-universal adoption, only 29% of developers trust the accuracy of AI-generated output — down from 40% the year before. Favorable views of AI tools dropped from 72% to 60%. People are using it more and trusting it less.
The most counterintuitive data point comes from METR’s study (now on arXiv): 16 experienced open-source developers — people with an average of 5 years working on repos with 22,000+ stars and over a million lines of code — completed 246 tasks. When allowed to use AI tools (primarily Cursor Pro with Claude Sonnet), they were 19% slower than without AI. Before the experiment, they predicted they’d be 24% faster. Even after experiencing the slowdown, they still believed AI had sped them up by 20%.
Let that sink in. The most experienced developers, on their own codebases, got slower — and didn’t realize it.
METR is careful to note this is a snapshot of early-2025 capabilities in one specific setting, and they’ve since updated their experiment design as tools have improved. But the finding points to something important: raw tool capability isn’t the bottleneck. Technique is. The developers in the study weren’t bad at coding — they were applying AI tools without the workflows, context management, and review patterns that make them effective.
The Craftsmanship Question
There’s a real conversation happening about what AI-assisted development means for the craft of programming. It’s not luddism and it’s not resistance to change — it’s developers genuinely grappling with questions about skill atrophy, ownership, and creative agency.
Simon Willison drew a useful line: if an LLM wrote every line of your code, but you’ve reviewed, tested, and understood it all — that’s not vibe coding, that’s using an LLM as a typing assistant. His golden rule: “I won’t commit any code to my repository if I couldn’t explain exactly what it does to somebody else.” In a later piece on vibe engineering, he expanded the distinction further — recognizing that the professional end of the spectrum is something different from the “let the AI do everything” approach.
Some developers have written thoughtfully about losing the flow state, the satisfaction of craft, the meditative quality of building something from scratch. Those concerns deserve respect. The answer isn’t “get over it” — it’s developing an intentional relationship with these tools where you choose when to delegate and when to do the work yourself. Not every task benefits from AI, and knowing when to turn it off is its own skill (we’ll come back to this in Part 20).
Karpathy’s YC keynote “Software Is Changing (Again)” is the best single overview of where all of this is heading — Software 3.0, where the primary programming language is English and the developer’s job is orchestration:
Where This Guide Comes In
So here’s the situation: the tools are powerful and getting more powerful fast. Adoption is nearly universal. But trust is low, technique is uneven, and the gap between “using AI tools” and “using AI tools well” is enormous. The METR study proves it — the tools alone don’t make you faster. How you use them does.
That’s what the next 19 parts are about. Not whether to use AI for coding — that ship sailed. But how to use it in a way that actually makes you better: faster, more reliable, producing higher-quality code with fewer security holes and fewer cascading bugs.
We’ll cover context engineering (the single most important skill), spec-driven development (why “just build it” fails at scale), security patterns (the data on AI security flaws is brutal), and the psychology of working alongside AI (including the identity questions that experienced developers legitimately face).
Start with Part 2: What Vibe Coding Actually Is (and Isn’t), or jump to whatever’s keeping you stuck.
Part 2: What Vibe Coding Actually Is (and Isn’t)
The term “vibe coding” gets used to describe everything from GitHub Copilot suggesting a line of code to a fully autonomous agent building an entire app from a prompt. That ambiguity is causing real confusion — and as Addy Osmani put it, “conflating them is causing real confusion and real damage.”
So let’s be precise.
The Origin Story
On February 2, 2025, Andrej Karpathy tweeted:
“There’s a new kind of coding I call ‘vibe coding’, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. … I ‘Accept All’ always, I don’t read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension. … It’s not too bad for throwaway weekend projects, but still quite amusing.”
That tweet went viral. Collins English Dictionary named “vibe coding” its Word of the Year for 2025. But here’s the thing Karpathy was explicit about: he was describing throwaway weekend projects. He wasn’t suggesting this was how to build production software. The nuance got lost in the virality.
A year later, Karpathy reframed the concept entirely, calling the professional version “agentic engineering” — emphasizing oversight, expertise, and the art of orchestrating agents rather than blindly accepting their output.
Fireship’s explainer captures the cultural moment — how the term went viral, the projects it enabled, and why it became a “mind virus” for the developer community:
The Spectrum
In practice, there’s a spectrum of AI-assisted coding, and where you sit on it matters for everything from code quality to career trajectory. Here’s what it actually looks like:
Level | What it looks like | Your role |
|---|---|---|
| Autocomplete | Inline suggestions (Copilot, Tabnine) | You drive, AI finishes sentences |
| Chat assistance | Ask AI to explain, refactor, or debug specific code | You direct, AI advises |
| Agent-assisted | AI generates whole features; you review everything | You review all output before committing |
| Guided vibe coding | AI generates; you check outputs but not every line | You evaluate behavior, not code |
| Pure vibe coding | “Accept All always, I don’t read the diffs” | AI drives; you watch what happens |
GitHub’s own Copilot docs describe a similar progression: from inline suggestions, to chat-assisted coding, to agent mode, to fully autonomous agents. EclipseSource breaks it into six levels. Google Cloud’s guide distinguishes between “pure vibe coding” (best for rapid ideation and throwaway projects) and “responsible AI-assisted development” (AI as pair programmer, human reviews everything).
The point isn’t that one end of the spectrum is good and the other bad. The point is they’re different activities with different risk profiles, and you should be choosing your position deliberately based on what you’re building.
The Dividing Line
Three people have drawn this line most clearly:
Simon Willison — his golden rule: “If an LLM wrote every line of your code, but you’ve reviewed, tested, and understood it all — that’s not vibe coding, that’s software development.” His personal standard: “I won’t commit any code to my repository if I couldn’t explain exactly what it does to somebody else.”
Addy Osmani — in Beyond Vibe Coding (now also an O’Reilly book), he frames the space on two axes: your technical proficiency and the level of AI abstraction you’re using. A technical developer using high-abstraction tools with full review capability? That’s “transformed development” — the most productive quadrant. A non-technical person using high-abstraction tools without review capability? That’s “democratized development” — powerful but fragile.
Wikipedia — the encyclopedia’s definition puts it plainly: vibe coding “typically involves accepting AI-generated code without closely reviewing its internal structure.” The absence of review is the defining characteristic.
The dividing line isn’t how much AI you use. It’s whether you understand what it produced.
This discussion digs into Osmani’s “Beyond Vibe Coding” framework and what it means to embrace AI as a senior engineer — the tradeoffs between speed and quality, and why understanding generated code remains non-negotiable.
Where You Probably Sit
If you’re reading this guide, you’re probably somewhere in the middle of the spectrum — and likely moving between positions depending on the task. Maybe you vibe-code a quick prototype on Saturday, then carefully review every line of AI-generated code for your team’s production codebase on Monday. That’s fine. That’s actually the right approach.
The problem isn’t vibe coding. The problem is vibe coding when you think you’re engineering. It’s accepting code you don’t understand into systems that matter. It’s the gap between the level of review you’re doing and the level of review the situation demands.
This guide is about operating at the “agent-assisted” and “agentic engineering” level — where you get the speed benefits of AI generation with the quality assurance of human oversight. You don’t have to read every line (the tools are too fast for that to be practical), but you need to understand the architecture, verify the behavior, and maintain the ability to explain what your system does.
The rest of this guide will show you how.
Part 3: The Skill Stack You Already Have
There’s a persistent myth that AI-assisted coding is a brand-new skill — something you need to learn from scratch, like a new language or framework. It isn’t. It’s your existing skills applied through a different interface. And the data overwhelmingly shows that the developers who get the most from AI tools are the ones with the deepest existing expertise.
The Architect, Not the Typist
Anthropic’s 2026 Agentic Coding Trends Report frames the shift clearly:
“Engineers are shifting from writing code to coordinating agents that write code, focusing their own expertise on architecture, system design, and strategic decisions.”
But here’s the number that matters: despite developers using AI in roughly 60% of their work, they report being able to “fully delegate” only 0–20% of tasks. The other 80–100% still requires human supervision, validation, and judgment. The report’s conclusion isn’t that engineers are being replaced — it’s that “the organizations pulling ahead aren’t removing engineers from the loop, they’re making engineer expertise count where it matters most.”
Addy Osmani put it more directly:
“Almost everything that makes someone a senior engineer — designing systems, managing complexity, knowing what to automate vs hand-code — is what now yields the best outcomes with AI.”
His full workflow post is worth reading in detail. The core argument: AI coding assistants are force multipliers, but “the human engineer remains the director of the show.” Classic software engineering discipline — planning, testing, version control, code review — “not only still applies, but is even more important when an AI is writing half your code.”
Osmani’s JSNation talk “The AI-Native Software Engineer” breaks down the “70% problem” — AI rapidly produces 70% of any solution, but the final 30% (edge cases, security, production context) still requires deep developer expertise. That 30% is where senior skills earn their keep.
What Transfers (and What Doesn’t)
Not every skill transfers equally. Here’s what the data and practitioner reports suggest:
Skill | AI-Era Importance | Why |
|---|---|---|
| System architecture | Higher than ever | AI can’t design coherent systems at scale. Someone has to decide how the pieces fit together. |
| Code review | Critical | The volume of code to review explodes. You’re reviewing code you didn’t write, faster than ever. |
| Debugging | Higher than ever | AI introduces subtle bugs that require human diagnosis. 45% of developers say debugging AI code is their biggest frustration. |
| Domain knowledge | Irreplaceable | AI has no codebase history, no business context, no memory of why that weird edge case exists. |
| Testing / QA | Critical | Osmani: “Everyone will need to get a lot more serious about testing and reviewing code.” |
| Spec writing | Critical (new emphasis) | Prompting is requirements analysis. The better your spec, the better the AI output. |
| Security analysis | Higher than ever | AI-generated code introduces new vulnerability patterns. Someone has to catch them. |
What doesn’t transfer well: the instinct to read every character of output. The habit of building everything from scratch. The assumption that faster typing = faster delivery. These are the habits that slow experienced developers down with AI tools — and they’re unlearnable.
The METR Paradox, Explained
Remember the METR study from Part 1 — 16 experienced developers, 19% slower with AI? The study’s own analysis identifies five contributing factors:
- Overly simple prompts — developers weren’t applying their communication and spec-writing skills to AI interaction
- Limited familiarity with AI interfaces — workflow integration skills not yet developed
- Over-reviewing — spending excessive time reviewing AI output line by line against their high standards
- Insufficient coverage of complex cases — AI failed on the hard parts that senior devs are actually good at
- Cognitive distraction — context switching and flow disruption from experimenting with a new tool
The developers accepted less than 44% of AI suggestions — meaning they spent more than half their AI time reviewing and rejecting output. As one study participant put it, AI felt like “a new contributor who doesn’t yet understand the codebase.”
The skill failure wasn’t in their engineering knowledge. It was in calibration — knowing when to use AI versus doing it yourself, and reviewing at the right level of abstraction rather than character by character. That calibration is what this guide teaches. METR has since updated their experiment design as both tools and developer behavior have improved, with one participant noting: “My head’s going to explode if I try to do too much the old-fashioned way because it’s like trying to get across the city walking when all of a sudden I was more used to taking an Uber.”
Testing as the Superpower
Kent Beck — creator of TDD, co-author of the Agile Manifesto, 52 years of coding experience — has been vocal about why test-driven development becomes more valuable with AI, not less. In his conversation with Gergely Orosz, he frames AI as an “unpredictable genie”: powerful but unreliable, which makes tests the only trustworthy feedback mechanism.
The logic is simple: if you have good tests, you can let AI generate code freely and verify it instantly. Tests prevent regressions, give agents a verifiable target, and let you refactor fearlessly. Without tests, you’re flying blind — accepting AI output on faith. With them, you’re in control.
Beck himself has said he’s been re-energized by AI because it removes the parts of coding he’d grown to dislike — the tedious implementation — while amplifying the parts he loves: design, architecture, and the satisfaction of a well-structured system.
The Real Unlock
The GitHub Octoverse report frames the transition as moving “from code producers to creative directors of code.” Their three-layer framework maps what transfers:
- Understanding the work: AI fluency plus fundamentals (algorithms, data structures, product understanding)
- Directing the work: delegation, agent orchestration, architecture and systems design
- Verifying the work: ensuring correct and high-quality outputs
All three layers reward existing expertise. The more you know about how software should work, the better you can direct and verify AI-generated code. As O’Reilly’s analysis concludes: “LLMs actively reward existing top-tier software engineering practices” and “amplify existing expertise.”
You don’t need a new skill stack. You need to apply the one you have through new tools — and develop the calibration to know when to delegate and when to do it yourself. The next part covers choosing the right tool for the job.
Part 4: Choosing the Right Tool for the Job
This is not a “best AI coding tool” ranking. Those go stale in weeks and miss the point entirely. The question isn’t which tool is best — it’s which tool is best for the specific task you’re doing right now. And increasingly, the answer is more than one.
The Stack Overflow 2025 Developer Survey shows the landscape clearly: VS Code dominates at 75.9% usage, Cursor has reached 17.9%, Claude Code sits at 9.7%, and Windsurf at 4.9%. But these numbers hide the real story — developers aren’t picking one tool. They’re layering them. The JetBrains 2025 Developer Ecosystem Survey found that 62% of developers rely on at least one AI coding assistant, agent, or code editor, with 19% saving 8+ hours per week (up from 9% in 2024). But only 31% of developers use coding agents — meaning the jump from autocomplete to agentic workflows is the defining transition happening right now.
The Two Modes: Conducting vs. Orchestrating
Addy Osmani drew the most useful distinction for thinking about tool selection. It’s not IDE vs. terminal, or commercial vs. open-source. It’s about two fundamentally different ways of working with AI:
Conductor mode: You’re working interactively with one agent in real time. You see every change, you steer mid-stream, you course-correct immediately. This is Cursor’s sweet spot, and it’s how most people use Claude Code and Copilot’s agent mode day-to-day.
Orchestrator mode: You define a task, hand it off, and review the result. The agent works autonomously — possibly for minutes, possibly for hours. You might have multiple agents running in parallel on different tasks. This is where GitHub Copilot’s coding agent, OpenAI Codex, and Claude Code’s sub-agent teams excel.
The key insight: these aren’t competing approaches. They’re different gears. You conduct when the task needs your judgment mid-execution — tricky algorithms, nuanced UI work, anything where you need to see and steer. You orchestrate when the task is well-defined enough to delegate — writing tests for a module, refactoring a file, generating documentation.
“The human’s effort is front-loaded (writing a good task description or spec) and back-loaded (reviewing the final code and testing it), but not much is needed in the middle. This means one orchestrator can manage more total work in parallel than would ever be possible by working with one AI at a time.”
— Addy Osmani, “Conductors to Orchestrators”
As Osmani notes, some developers are already running “3-4 agents at once on separate features.” Simon Willison confirms: “I’m increasingly hearing from experienced, credible software engineers who are running multiple copies of agents at once, tackling several problems in parallel.” He does this himself and finds it “surprisingly effective, if mentally exhausting.”
The Major Tools (and When Each One Shines)
Rather than ranking these, here’s a framework for when you’d reach for each one:
GitHub Copilot: The Baseline Layer
Copilot is the market incumbent — 20 million cumulative users, 42% market share, used by 90% of Fortune 100 companies. And it’s evolved dramatically from the autocomplete tool most people remember.
What it is now: Copilot has inline completions, but it also has agent mode (multi-file edits, terminal commands, self-healing iterations), a coding agent that takes GitHub issues and produces pull requests asynchronously, and as of February 2026, a generally-available CLI for terminal-native workflows. It generates 1.2 million pull requests per month via its coding agent alone.
When to use it: Copilot is the tool you leave running all the time. Its inline completions are fast and unobtrusive — developers keep 88% of Copilot-generated code in final submissions. It’s the typing accelerator. If your company is a Microsoft shop, it’s already approved and integrated. And at $10/month for Pro (with a generous free tier of 2,000 completions/month), it’s the cheapest entry point.
When to reach for something else: When you need deep multi-file reasoning, when you want to run agents on complex codebase-wide tasks, or when you want model choice beyond what Copilot offers.
Cursor: The Visual Conductor
Cursor is a VS Code fork rebuilt ground-up around AI. It’s not AI bolted onto an editor — the entire IDE is designed for agent-driven workflows. Trusted by over half the Fortune 500, and reaching 21.7% usage among professional developers already using AI.
What makes it different: Cursor Tab is their in-house completion model, trained specifically for code suggestion within Cursor’s context — it predicts your next edit, not just the next line. Cursor Agent (formerly Composer) plans multi-step tasks, edits multiple files, runs terminal commands, and iterates until tests pass. The February 2026 update introduced long-running agents that can test their own changes and iterate autonomously. And the Visual Editor lets you click, drag, and inspect rendered web components — then prompt changes visually.
When to use it: Cursor is the conductor’s tool. It excels at interactive work where you want to see changes in real time: building new features, rapid prototyping, UI work, anything where visual feedback matters. Its tab-completion is the fastest way to write code you already know how to write. It also offers model flexibility — you can switch between Claude, GPT, Gemini, and Cursor’s own models mid-session.
When to reach for something else: When you need deep autonomous operation on large codebases, when you prefer terminal workflows, or when you’re doing infrastructure/DevOps work that doesn’t benefit from a visual IDE.
“Cursor prioritizes speed and velocity — get code written fast. Claude Code prioritizes depth and correctness — get the solution right.”
Claude Code: The Terminal-Native Agent
Claude Code lives in your terminal. No IDE required. You cd into your project, type claude, and describe what you want. It reads your codebase, plans an approach, writes code, runs tests, and commits — all from the command line.
What makes it different: Claude Code follows the Unix philosophy — it’s composable. You can pipe logs into it (tail -f app.log | claude -p "alert me if you see anomalies"), run it in CI/CD pipelines, or chain it with other tools. It maintains persistent context across sessions through CLAUDE.md files and auto-memory. Its sub-agent architecture lets it spawn multiple agents working on different parts of a task simultaneously. And it connects to external services via MCP — Slack, Jira, Google Drive, databases.
Independent testing found Claude Code uses 5.5x fewer tokens than Cursor for identical tasks, producing less code churn — it tends to get things right on the first or second iteration.
When to use it: Large-scale refactoring, test generation, codebase-wide changes, documentation, anything where thoroughness matters more than visual feedback. It’s the tool for developers who live in the terminal and want their AI agent to live there too. The ability to run it headlessly — in CI, in Docker containers, via GitHub Actions — makes it uniquely suited to automated workflows.
When to reach for something else: When you want real-time visual feedback, when you’re doing UI/design work, or when you need to rapidly iterate on visual components.
For a deeper comparison, see our Claude Code vs. Cursor guide.
OpenCode: The Open-Source Wildcard
OpenCode is the tool that changes the economics of the entire conversation. It’s an open-source, terminal-native coding agent — think Claude Code’s workflow, but with 75+ model providers and zero vendor lock-in. Built by the creators of SST (Serverless Stack), it’s hit 113,000+ GitHub stars and 2.5 million monthly developers in under a year.
What makes it different: Model freedom. OpenCode works with Claude, GPT, Gemini, DeepSeek, Llama, and dozens more — including local models via Ollama at zero API cost. Its client-server architecture means you can run sessions inside remote Docker containers. It has LSP integration (so the LLM gets real code intelligence from language servers, not just token prediction), multi-session support for running parallel agents, and session sharing for collaboration. As of January 2026, GitHub officially partnered with OpenCode, allowing all Copilot subscribers to authenticate directly — no additional AI license needed.
The Zen service is a smart addition: a curated, benchmarked list of models optimized specifically for coding agents, so you don’t have to figure out which of the 75+ providers actually works well for your use case.
When to use it: When you want the terminal-native agent workflow but don’t want to be locked to a single model provider. When you want to use local models for privacy or cost. When you want to switch models mid-project based on what’s working (use a fast model for scaffolding, a reasoning model for debugging). When you’re in a team that’s standardizing on open-source tooling. Or when you simply want to understand how coding agents work under the hood — the codebase is MIT-licensed and readable.
When to reach for something else: When you need the most polished, battle-tested experience on a single model family (Claude Code’s integration with Claude is tighter by nature), or when your team is already deep in a specific vendor ecosystem.
ThePrimeTime’s discussion with OpenCode’s creators Dax Raad and Adam Elmore digs into how the agent actually works under the hood — the loop architecture, tool integration, and why open-source matters for coding agents.
OpenAI Codex: The Background Operator
Codex is OpenAI’s entry in the terminal-native agent space — open-source, built in Rust, with both a CLI and a cloud sandbox mode. It’s the closest thing to a “fire and forget” coding agent.
What makes it different: Codex’s cloud sandbox mode preloads your repo and runs tasks in the background — you assign work and come back to review the results later. It supports three approval levels (Suggest, Auto Edit, Full Auto), has a built-in code review mode where a separate Codex instance reviews your changes before commit, and can search the web for up-to-date information during tasks. The VS Code 1.109 update added direct Codex agent support alongside Copilot and Claude.
When to use it: Async workflows — tasks you want to queue and review later. Background operations while you’re working on something else. If you’re already in the OpenAI ecosystem (ChatGPT Plus/Pro/Enterprise), Codex is included at no additional cost.
When to reach for something else: When you need real-time interactive guidance, when you want model flexibility beyond OpenAI, or when you need the deeper codebase understanding that comes from tools specifically designed around context engineering.
Windsurf: The Contextual IDE
Windsurf (formerly Codeium) is a standalone agentic IDE — not a VS Code extension, but a full IDE rebuilt around AI. Its Cascade agent is the central feature, with a shared timeline that tracks everything: files you edit, terminal commands, clipboard contents, conversation history.
What makes it different: Cascade has a “memory” layer — it watches what you’re doing and infers intent to continue patterns autonomously. Its Riptide reasoning engine claims 200% improvement in retrieval recall versus traditional embedding systems. Wave 13 (December 2025) added parallel multi-agent sessions with Git worktree support. At $15/month for Pro, it’s the cheapest full-featured agentic IDE.
When to use it: If you want an all-in-one agentic IDE experience at a lower price point than Cursor. The contextual awareness — tracking your actions and anticipating next steps — is genuinely different from other tools. LogRocket’s February 2026 power rankings placed Windsurf at #1, above Cursor.
When to reach for something else: When you prefer terminal workflows, when you need the deeper extension ecosystem of VS Code proper, or when you want tighter integration with a specific model provider.
They Don’t Compete — They Layer
The single most important insight about AI coding tools in 2026 is this: they stack. As Qodo’s analysis puts it: “Editor assistants help you move faster while writing code. Agents handle multi-file changes and structured tasks. Security tools flag exploitable issues. AI code review validates pull requests before merging.”
Here’s what tool layering looks like in practice:
Layer | Tool | What It Does |
|---|---|---|
| Inline completion | Copilot, Cursor Tab | Finishes your lines as you type |
| Interactive agent | Cursor Agent, Claude Code, OpenCode | Multi-file edits with real-time oversight |
| Background agent | Codex cloud, Copilot coding agent, Claude Code sub-agents | Async tasks you review later |
| Code review | Codex review mode, AI-on-AI review | Second opinion before merge |
| CI/CD integration | Claude Code in GitHub Actions, Copilot in pipelines | Automated quality gates |
Builder.io calls the old “either/or” framework obsolete: “the ‘use both’ workflow that more developers adopt each month.” Coding with Roby describes the most common pattern: developers code in Cursor with inline completions while Claude Code simultaneously refactors legacy modules in the terminal.
Osmani recommends an AI-on-AI review strategy: “spawn a second AI session to critique code from the first model.” Different models catch different things — a Claude review of GPT-generated code (or vice versa) surfaces issues that a single model would miss.
The Platform Shift: VS Code as Multi-Agent Hub
In February 2026, VS Code 1.109 was explicitly positioned as “the home for multi-agent development.” This isn’t a minor feature update — it’s a platform declaration.
You can now run Claude, Codex, and Copilot agents side-by-side in unified sessions. The new Agent Sessions view consolidates all agent activity — local, background, and cloud — in one dashboard. Parallel sub-agents execute simultaneously, each in isolated context, with results showing attribution for which agent performed each task. Agent Skills (now GA) let you package domain-specific capabilities as reusable slash commands — /review-pr, /deploy-staging, /run-security-audit.
The infrastructure is explicitly designed for picking the right tool per task. As Visual Studio Magazine noted, developers can now “compare agent outputs and delegate tasks between different AI implementations — all managed from one interface.”
This is the platform-level validation of the tool-layering approach. You don’t choose one agent. You choose the right agent for each task and let the platform manage the coordination.
A side-by-side comparison of Windsurf vs Cursor that cuts through the marketing to show how these tools actually differ in daily use — context handling, agent behavior, and where each one excels.
The Decision Framework
Stop asking “which tool should I use?” Start asking “which tool should I use for this task?”
Task | Best fit | Why |
|---|---|---|
| Writing new code line-by-line | Copilot inline, Cursor Tab | Speed — you’re driving, AI finishes sentences |
| Building a new feature interactively | Cursor Agent, Windsurf Cascade | Visual feedback, real-time steering |
| Large refactoring across files | Claude Code, OpenCode | Deep codebase understanding, autonomous execution |
| Writing tests for existing code | Claude Code, OpenCode, Codex | Well-defined task, delegatable |
| Async issue-to-PR workflow | Copilot coding agent, Codex cloud | Fire and forget, review later |
| Working with local/private models | OpenCode | Only major agent with full local model support |
| Legacy codebase exploration | Claude Code (CLAUDE.md), Augment Code | Context persistence, large codebase handling |
| Cost-sensitive workflows | OpenCode + Ollama, Copilot free tier | Open-source + local = $0 |
| CI/CD integration | Claude Code, Copilot | Native pipeline support |
The most productive developers aren’t loyal to one tool. They’re fluent in the category — they understand what inline completion, interactive agents, and background agents each do well, and they switch fluidly based on the task. As your toolkit matures, you’ll develop your own layering pattern that matches how you work.
What Matters More Than the Tool
Here’s the thing the tool comparison articles don’t tell you: the choice of tool matters less than your context engineering. A developer with excellent CLAUDE.md files, clear specs, and good context management will outperform someone with a “better” tool and no context discipline — every time.
The Anthropic 2026 Agentic Coding Trends Report puts it clearly: developers use AI in roughly 60% of their work, but they can “fully delegate” only 0-20% of tasks. The other 80-100% still requires human supervision. The limiting factor isn’t tool capability — it’s how well you communicate intent to the tool.
That’s what Part 5 is about. Context engineering is the single most important skill in AI-assisted development, and it works across every tool on this list.
Part 5: Context Engineering — The Core Skill
Want the deep dive? We expanded this section into a standalone guide: Context Engineering: The Complete Guide for AI-Assisted Coding — covering the Four Pillars Framework, CLAUDE.md best practices, Cursor rules, session management, and the latest research.
If you take one thing from this entire guide, make it this: context engineering is the single most important skill in AI-assisted development. Not prompting. Not tool selection. Not knowing which model is newest. The quality of the context you provide determines the quality of the code you get back — every time, across every tool.
The term was popularized by Tobi Lutke, Shopify’s CEO, in mid-2025:
“I really like the term ‘context engineering’ over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.”
Andrej Karpathy endorsed it immediately:
“In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step.”
He enumerated what “doing this right” involves: task descriptions, few-shot examples, RAG, related data, tools, state and history, and compacting. Then the critical caveat: “Too much or too irrelevant context can increase costs and degrade performance.”
That last point is the one most people miss. Context engineering isn’t about giving the AI more information. It’s about giving it the right information. MIT Technology Review traced the year’s arc from “vibe coding” to “context engineering” — the maturation from casual prompting to systematic information architecture.
Why Context Degrades (and Why It Matters)
Every coding agent — Claude Code, Cursor, Copilot, OpenCode — operates within a context window. Think of it as the agent’s working memory. Every message you send, every file it reads, every command output it sees, every error message it processes — all of it goes into the window and stays there.
Here’s the problem: performance degrades measurably as context fills up. Chroma Research tested 18 LLMs and found that across all models, accuracy drops as input length increases — even on simple tasks. The “Lost in the Middle” phenomenon (first identified by Stanford researchers in 2023) shows LLMs attend strongly to tokens at the start and end of context but poorly to the middle. When a debugging session has loaded 20,000 tokens of irrelevant file contents and dead-end explorations, the actual relevant code — sitting somewhere in the middle — gets less attention.
Anthropic’s own best practices docs say it directly:
“Most best practices are based on one constraint: Claude’s context window fills up fast, and performance degrades as it fills.”
This is why a developer with a clean, well-structured context outperforms one with a “better” model but cluttered context. It’s why the METR study developers (Part 1) got slower with AI — their sessions accumulated context noise faster than they could manage it.
The Four Pillars Framework
Sequoia Capital’s Inference newsletter published “Vibe Coding Needs Context Engineering” in July 2025, arguing that “intuition does not scale, structure does.” They identified four pillars of context engineering (a framework also developed independently by LangChain):
1. Write context — Save persistent information outside the context window. This is your CLAUDE.md, your .cursorrules, your spec documents. Anything the agent needs to know every session gets written down once, not repeated every prompt.
2. Select context — Pull in only what’s relevant for the current task. Don’t dump your entire codebase into the prompt. Targeted file reads, specific function references, relevant test outputs — not “here’s everything, figure it out.”
3. Compress context — Manage token usage through summarization and pruning. When a session gets long, compact it. When an exploration is done, clear the dead ends. Every token of noise competes with signal.
4. Isolate context — Structure information so it doesn’t bleed across tasks. Use subagents for research (they run in separate context windows and report back summaries). Start fresh sessions for unrelated work. Don’t let Monday’s debugging contaminate Tuesday’s feature build.
As Philipp Schmid put it: “Prompt Engineering = Crafting perfect instruction strings. Context Engineering = Building systems that dynamically assemble comprehensive contextual information tailored to specific tasks.”
Dexter Horthy’s YC Root Access talk “Advanced Context Engineering for Agents” is the best technical deep dive on this — covering why conversational prompting fails at scale, spec-first development, and the finding that agents tend to perform better when using less than 40% of the LLM’s context window.
A practical walkthrough of context engineering for AI coding agents — how to structure your project context, what to include in memory files, and the common mistakes that degrade agent performance.
The Memory Layer: Rules Files That Actually Work
Every major tool now has a mechanism for persistent context — a file (or set of files) that gets loaded automatically at the start of every session. This is the most important file in your project. More important than your README. More important than your config. Because it’s the file that determines whether your AI agent understands your project or hallucinates about it.
CLAUDE.md (Claude Code)
Anthropic’s official docs are clear about what belongs here:
Include | Exclude |
|---|---|
| Build/test commands Claude can’t guess | Anything Claude can figure out from reading code |
| Code style rules that differ from defaults | Standard language conventions Claude already knows |
| Architectural decisions specific to your project | Detailed API docs (link instead) |
| Common gotchas and non-obvious behaviors | Information that changes frequently |
| Developer environment quirks (env vars, etc.) | File-by-file codebase descriptions |
The official docs warn against the most common failure mode:
“The over-specified CLAUDE.md. If your CLAUDE.md is too long, Claude ignores half of it because important rules get lost in the noise. Fix: Ruthlessly prune. If Claude already does something correctly without the instruction, delete it.”
Community consensus from HumanLayer: keep it under 300 lines. Shorter is better. Run /init to auto-generate a starter from your codebase structure. Use the hierarchy: ~/.claude/CLAUDE.md for global preferences, ./CLAUDE.md at the project root (checked into git), ./CLAUDE.local.md for personal overrides.
.cursor/rules/ (Cursor)
Cursor’s rules system is more structured. Four types of rules:
- Always Apply — active every session (like CLAUDE.md)
- Apply Intelligently — agent decides relevance based on your description
- Apply to Specific Files — triggered by glob patterns (e.g., only for
*.tsxfiles) - Apply Manually — invoked via
@rule-name
Rules live in .cursor/rules/*.mdc files (the legacy .cursorrules file still works but is deprecated). The awesome-cursorrules repo has community templates. Same official advice: keep content under 500 lines, decompose large rules into composable pieces, and iterate based on actual agent behavior.
.github/copilot-instructions.md (Copilot)
GitHub Copilot’s equivalent: a .github/copilot-instructions.md file for repository-wide instructions, plus .github/instructions/NAME.instructions.md files for path-specific rules (e.g., instructions that only apply to files matching app/models/**/*.rb).
AGENTS.md (Cross-Tool)
AGENTS.md is emerging as a cross-tool standard — recognized by Claude, Copilot, Cursor, and Gemini. Plain markdown, no metadata needed. If you work across multiple tools, this is the file that follows you everywhere.
What the Research Actually Shows
There’s now academic evidence on whether these context files actually help. The results are nuanced — and important.
An empirical study of 2,303 agent context files from 1,925 repos found that these files function like configuration code: they evolve frequently via small additions and prioritize build commands (62.3%), implementation details (69.9%), and architecture (67.7%).
But here’s the counterintuitive finding: a study evaluating AGENTS.md files found that context files can reduce task success rates versus no context, while increasing inference cost by 20%+. The lesson isn’t that context files don’t work — it’s that poorly maintained context files are worse than none. Outdated instructions, contradictory rules, stale architecture descriptions — these actively mislead the agent.
Spotify’s engineering team documented this in their “Honk” background coding agent (1,500+ merged PRs). Their second blog post is entirely about context engineering — the architecture of hot-memory constitutions, specialized domain agents, and cold-memory specification documents that made the agent actually work at production scale.
The takeaway, as Martin Fowler’s team puts it: balance is critical. Too much context degrades agent performance just as much as too little.
Session Management: When to Clear and When to Keep Going
The most underrated context engineering skill is knowing when to throw away your context and start fresh.
Anthropic’s docs name specific trigger conditions:
“If you’ve corrected Claude more than twice on the same issue in one session, the context is cluttered with failed approaches. Run
/clearand start fresh with a more specific prompt that incorporates what you learned. A clean session with a better prompt almost always outperforms a long session with accumulated corrections.”
The four signals that it’s time for /clear:
- Switching to unrelated tasks — don’t let feature work context bleed into bug fixing
- After two failed corrections — the failed attempts are polluting the context
- After “kitchen sink” sessions — you’ve mixed too many topics
- When performance visibly decreases — responses get generic, instructions get forgotten
The distinction between /clear and /compact matters:
/clear— nuclear option. Deletes entire conversation history. Frees 100% of tokens. CLAUDE.md gets re-loaded fresh./compact [instructions]— surgical option. Summarizes the conversation, reducing tokens 50-70%. You can focus the summary:/compact Focus on the API changes we discussed.
Armin Ronacher adds an important exception: don’t clear when the failure history itself is valuable. If Claude has tried and failed a specific approach, that context prevents it from repeating the same mistake. The art is knowing whether the failed attempts are useful signal or useless noise.
For long-running work, the awesome-vibe-coding-guide recommends starting a fresh session after approximately 30 messages, and always writing key decisions to your context files before clearing so they persist.
Structuring Your Project for AI
Context engineering isn’t just about memory files and session management. It’s about how your entire project is organized. Agents navigate codebases by reading files and following imports — the easier your project is to navigate, the better the agent performs.
The emerging consensus from practitioners and Addy Osmani:
Favor vertical over horizontal organization. Feature-driven layouts (src/auth/, src/billing/, src/dashboard/) work better than layer-driven layouts (src/models/, src/controllers/, src/views/) because an agent working on auth only needs to read the auth directory. A layer-driven layout forces it to load files from every directory to understand a single feature.
Use semantic file names. user-authentication-service.ts is better than uas.ts. Agents infer file contents from names before reading them — descriptive names reduce unnecessary file reads and save context.
Keep files small. Anthropic’s best practices recommend smaller, focused modules that fit comfortably in context. A 3,000-line monolith forces the agent to read (and hold in context) the entire file to modify a single function.
Colocate tests with code. If your test for auth-service.ts is in auth-service.test.ts right next to it, the agent finds it instantly. If it’s in tests/unit/services/auth/test_auth_service.py, that’s multiple directory traversals and file reads burning context.
Treat context files as code. They evolve with your codebase. Review them in PRs. Delete stale instructions. Add new patterns when you discover them. As EclipseSource notes, the hard problem isn’t creating context files — it’s keeping them accurate.
The Bigger Picture: Prompt Craft Is Just One Skill
Context engineering sits within a broader framework. The framing from this analysis of the 2026 prompting landscape identifies four distinct disciplines that “prompting” has split into:
- Prompt Craft — writing clear instructions. The original skill. By 2026, this is table stakes.
- Context Engineering — curating the entire information environment an agent operates within. What we’ve been discussing.
- Intent Engineering — encoding goals, values, and decision boundaries into agent infrastructure. Telling agents what to want, not just what to do.
- Specification Engineering — writing structured documents that agents can execute against over long periods without intervention.
This progression maps directly onto the levels from Part 2. Prompt craft is autocomplete-level. Context engineering is agent-assisted. Intent and specification engineering are orchestrator-level — where the real productivity multipliers live.
We’ll cover specification engineering in detail in Part 6: Spec-Driven Development, and prompting techniques in Part 7: Prompting That Actually Works. But context engineering underlies everything. Master this, and every other skill in this guide gets easier.
Part 6: Spec-Driven Development
Here’s the pattern that kills AI-assisted projects: you start fast. Impressively fast. The AI scaffolds your app in minutes, features land in hours, and the first few weeks feel like magic. Then around month three, everything starts breaking. You change one thing and four other features fail. You ask the AI to fix those, and something else goes weird. The codebase has grown beyond what anyone — human or AI — can hold in context. You’ve hit the wall.
The Three-Month Wall
Red Hat’s “The Uncomfortable Truth About Vibe Coding” names this structural breaking point directly:
“You change one small thing and four other features break. You ask the AI to fix those, and now something else is acting weird.”
The root cause isn’t the AI. It’s that the intent was never pinned down in writing:
“Your instructions become obsolete the moment code is generated. The code itself becomes the only source of truth for what the software does — and code is terrible at explaining why it does what it does.”
Tech Startups reported that over 8,000 startups faced rebuilds costing $50K–$500K each, hitting what they call the “Spaghetti Point” — typically around month three — where adding new features breaks existing ones. An arXiv paper on vibe coding technical debt documents the same pattern academically.
This isn’t a new problem. It’s the oldest problem in software engineering — building without specs — made worse by the speed at which AI generates code. Without specifications, the AI is improvising. It makes reasonable assumptions on the first pass, different reasonable assumptions on the second pass, and by the tenth pass your codebase is a collection of conflicting assumptions that no one can untangle.
Codeplain calls this “functionality flickering” — inconsistent behavior across AI regenerations when intent is not pinned in writing. A button is green in one generation, blue in the next. An API returns data in one shape, then a different shape after refactoring. The AI isn’t being inconsistent — it was never told what to be consistent about.
What Spec-Driven Development Actually Is
Spec-driven development (SDD) is the antidote. Instead of telling the AI “build me a review system,” you write a specification that defines exactly what the review system does — its data model, its API endpoints, its edge cases, its acceptance criteria — and then hand the spec to the AI for implementation.
Thoughtworks defines it as “a development paradigm that uses well-crafted software requirement specifications as prompts, aided by AI coding agents, to generate executable code.” They call it “one of the most important practices to emerge in 2025.”
GitHub’s Spec Kit blog post explains why vague prompting fails in the first place:
“This approach succeeds where vague prompting fails due to a basic truth about how language models work: they’re exceptional at pattern completion, but not at mind reading.”
The shift: from “code is the source of truth” to “intent is the source of truth.”
This is not a return to waterfall. Thoughtworks is explicit about this. The key difference is speed — what used to take weeks of requirements gathering collapses into minutes. Addy Osmani credits Les Orchard with the phrase that captures it:
“It’s like doing a ‘waterfall in 15 minutes.’”
The idea: compress the critical design work — requirements, architecture decisions, data models, testing strategy — into a brief planning window before any code is written. Then hand the spec to the AI.
“This upfront investment might feel slow, but it pays off enormously. When we unleash the codegen, both the human and the LLM know exactly what we’re building and why.”
The Workflow: Spec → Plan → Execute
Every SDD tool converges on roughly the same workflow, whether it takes four steps or three:
1. Specify — Define what you’re building in plain language with enough precision for an agent to execute. Requirements, user stories, acceptance criteria, constraints. This is the human’s job — the AI can help draft it, but you own the intent.
2. Plan — Generate a technical architecture from the spec. Data models, API contracts, file structure, dependencies. Review this carefully — it’s cheaper to fix a bad plan than bad code.
3. Task — Break the plan into small, independently verifiable implementation chunks. Each task should be testable in isolation.
4. Implement — Hand each task to the agent. Review the output against the spec, not against your intuition.
Claude Code’s Plan Mode (activated with Shift+Tab) enforces this natively — it restricts Claude to read-only operations while it explores your codebase and generates a plan. Press Ctrl+G to open and edit the plan before implementation begins. Cursor’s planning mode does the same: it crawls the project, reads docs and rules, asks clarifying questions, and generates an editable Markdown plan.
How to Write Specs That AI Can Actually Follow
Addy Osmani’s “How to Write a Good Spec for AI Agents” (also on O’Reilly Radar) is the best practitioner guide on this. His core insight:
“Most agent files fail because they’re too vague.”
And the opposite failure mode:
“Too many directives cause it to follow none of them well.”
He recommends structuring specs with six sections: Commands (executable with flags), Testing (runner, coverage expectations), Project structure (where things go), Code style (show examples, not descriptions), Git workflow (branch naming, commit conventions), and Boundaries — a three-tier system:
- Always do: Format before commit, run tests, follow naming conventions
- Ask first: Major refactors, new dependencies, architectural changes
- Never do: Delete test files, modify CI config, change database schema without approval
The five primitives of specification engineering provide an even more structured framework: self-contained problem statements (no extra context needed), acceptance criteria (what “done” looks like), constraint architecture (must do / must not do / escalate), decomposition (sub-tasks under 2 hours each), and evaluation design (test cases that prove the output works).
The simplest version? Simon Willison uses four words: “Use red/green TDD” — a concise instruction that unlocks an entire engineering discipline because “every good model understands this shorthand for the full test-driven workflow.” That’s specification through convention rather than documentation.
The Tools: Kiro, Spec Kit, and the Ecosystem
The spec-driven approach has spawned dedicated tools:
Amazon Kiro
Kiro is Amazon’s VS Code fork, launched in preview July 2025 and explicitly positioned as an antidote to vibe coding chaos. It enforces a three-document workflow:
- Requirements — You type a prompt (“Add a review system for products”). Kiro generates user stories with acceptance criteria using EARS notation (Easy Approach to Requirements Syntax).
- Design — Kiro analyzes your codebase and generates a technical design document with data flow diagrams, TypeScript interfaces, database schemas, and API endpoints.
- Tasks — A sequenced task list linked back to individual requirements, each specifying tests, loading states, and accessibility requirements.
What’s genuinely novel: Spec Sync — specs stay synced with your evolving codebase bidirectionally. Code changes propagate back to specs. Specs drive code. It’s the closest thing to the “living specification” that the industry has been talking about for years.
The honest criticism from early adopters: for small bugs, the workflow feels like “a sledgehammer to crack a nut” — one small bug turning into 4 user stories and 16 acceptance criteria. Know when to use it and when a quick fix is fine.
GitHub Spec Kit
Spec Kit is GitHub’s open-source CLI toolkit (announced September 2025) that enforces the Specify → Plan → Tasks → Implement workflow. It’s tool-agnostic — works with Copilot, Claude Code, Cursor, Windsurf, Gemini CLI.
The Microsoft Developer blog covers the philosophy: specs become the single source of truth, and implementation becomes a mechanical translation that agents can handle reliably.
What makes it practical: it’s a CLI you can integrate into any existing workflow, not a new IDE. You don’t have to switch tools — you add a structured planning layer to whatever you’re already using.
The Spec-as-Source Vision
Martin Fowler’s team (Birgitta Böckeler) identified three levels of spec-driven development:
- Spec-first: Write specs before coding specific tasks. (Kiro, GitHub Spec Kit)
- Spec-anchored: Specs persist and evolve post-completion for ongoing development.
- Spec-as-source: Specs are the only primary artifact — humans never touch code directly.
The spec-as-source vision is what Tessl ($125M raised, founded by Guy Podjarny of Snyk) is building toward. Their Spec Registry contains 10,000+ pre-built specs explaining how to use popular open-source libraries correctly — preventing the API hallucinations and version mixups that plague agents working without specs.
Sean Grove (OpenAI) gave the most cited talk on this at AI Engineer World’s Fair 2025: “The New Code — Specifications as the Fundamental Unit of AI-Era Programming.” His core argument: the code you write represents only 10-20% of your value as a developer. The other 80-90% is structured communication — understanding requirements, clarifying goals, planning solutions, verifying the implementation solves the right problem. In an AI-assisted world, that 80-90% is the spec.
Grove uses OpenAI’s own Model Spec — a living Markdown document that defines model behavior — as proof that specs can “compile to behavior.” His analogy: the US Constitution is a versioned, living specification with mechanisms for amendment and judicial review. Your codebase specs should work the same way.
A hands-on look at Amazon Kiro’s spec-driven development workflow — from generating specs to implementing against them, showing what the SDD loop actually looks like in practice.
The Calibration: When Specs Help (and When They’re Overhead)
Not everything needs a spec. Kent Beck — the TDD pioneer — uses what he calls “augmented coding” rather than pure spec-driven development. His approach: write tests first (the tests are the spec), let AI implement against them, verify coverage. For many tasks, a well-written test suite is all the specification you need.
Nolan Lawson observes what happens without specs at scale: the result is “like 10 devs worked on it without talking to each other.” But for a quick bug fix or a small utility function, a spec document is overhead. The calibration:
Project scope | Spec approach |
|---|---|
| Quick fix / small utility | No spec needed. Just prompt clearly. |
| Single feature | Inline spec in your prompt: requirements + acceptance criteria + constraints |
| Multi-feature sprint | Markdown spec file. Review before implementation. |
| Full project / architecture | Full SDD workflow: spec → plan → tasks → implement |
| Ongoing product | Living spec that evolves with the codebase (Kiro Spec Sync) |
Red Hat’s spec-driven development article claims specs can push first-pass AI accuracy to “95% or higher accuracy in implementing specs on the first go, with code that’s error-free and unit tested.” An arXiv paper on SDD found that human-refined specs reduce errors by up to 50% compared to ad-hoc prompting.
The investment is small. A spec for a typical feature takes 5-15 minutes to write — and saves hours of debugging, back-and-forth corrections, and cascading failures. Osmani calls this “the cornerstone of the workflow.” JetBrains’ Junie team says it plainly: “You’re not asking the AI agent to start coding yet — you’re asking it to think first.”
That thinking-first pattern is what the next part applies to individual prompts. Part 7: Prompting That Actually Works covers the prompt craft that turns specs into working code.
Part 7: Prompting That Actually Works
Let’s be honest about where prompting sits in 2026. If you’ve read Parts 5 and 6, you know that context engineering (the information environment your agent operates in) and specification engineering (the structured documents agents execute against) are where the real leverage lives. Prompt craft — the act of writing clear instructions in a chat window — is what one analysis calls “table stakes, a basic requirement like knowing how to type.”
But table stakes still matter. A well-written prompt in a well-engineered context produces dramatically better results than a sloppy prompt in the same context. And most developers are still writing sloppy prompts — because the techniques that worked for ChatGPT conversations don’t transfer cleanly to coding agents.
The Fundamental Shift: Action, Not Conversation
The single biggest prompting mistake developers make with coding agents is treating them like a conversation. They’re not. Modern agents — Claude Code, Cursor Agent, Codex — are autonomous workers, not chat partners. The prompt isn’t a request for information. It’s a work order.
Anthropic’s official prompting docs highlight the difference explicitly:
Less effective: “Can you suggest some changes to improve this function?”
More effective: “Change this function to improve its performance.”
Claude will suggest rather than implement if you use hedging language. The docs recommend this system prompt for agentic workflows:
By default, implement changes rather than only suggesting them.
If the user's intent is unclear, infer the most useful likely
action and proceed, using tools to discover any missing details
instead of guessing.
Armin Ronacher takes this to its logical conclusion. His workflow: assign a complete task and let the agent run. “I rarely interrupt it, unless it’s a small task.” Interrupting mid-execution actually degrades output quality — the agent loses its thread.
The Three Modes of Prompting
Prompting for code generation, code review, and debugging are fundamentally different activities requiring different prompt structures. Most guides treat them as one skill. They’re not.
Prompting for Code Generation
The goal: get working code on the first pass. The key elements:
- Language and framework specified upfront — “TypeScript, React 19, Next.js 15 App Router”
- Input/output specification — what the function receives, what it returns
- Constraints stated explicitly — “no external libraries,” “use ES modules,” “async/await only”
- Verification criteria — “run the tests after implementing”
Addy Osmani captures the core principle:
“LLMs do best when given focused prompts: implement one function, fix one bug, add one feature at a time. If you ask for too much in one go, it’s likely to get confused or produce a ‘jumbled mess’ that’s hard to untangle.”
From the official Claude Code docs, the before/after:
Vague | Specific |
|---|---|
| “implement a function that validates email addresses” | “write a validateEmail function. test cases: [email protected] → true, invalid → false, [email protected] → false. run the tests after implementing” |
| “make the dashboard look better” | “[paste screenshot] implement this design. take a screenshot of the result and compare it to the original. list differences and fix them” |
| “add tests for foo.py” | “write a test for foo.py covering the edge case where the user is logged out. avoid mocks” |
Prompting for Code Review
The goal: surface real issues without drowning in noise. The key elements:
- Role definition — “You are a senior TypeScript engineer reviewing for security vulnerabilities”
- Bounded scope — “Review only the authentication logic in src/auth/”
- Explicit criteria — “Check for race conditions, memory leaks, and SQL injection”
- Output format — “For each issue: describe what is wrong, explain why it is a problem, provide corrected code”
The Graphite guide to AI code reviews recommends the anti-pattern to avoid: asking the AI to “review this code” with no scope. You’ll get generic commentary about variable naming instead of the race condition hiding in your auth middleware.
Osmani’s most powerful technique: AI-on-AI review. Spawn a second AI session using a different model to critique code from the first. Claude catches things GPT misses, and vice versa.
Prompting for Debugging
The goal: find the root cause, not band-aid the symptom. The key elements:
- The exact error message, verbatim — copy-paste, don’t paraphrase
- Expected vs actual behavior, separated — “expected: returns user object. actual: throws TypeError”
- Scope — “check the auth flow in src/auth/, especially token refresh”
- Success criterion — “write a failing test that reproduces the issue, then fix it”
Osmani’s debugging formula: “It’s expected to do [behavior] but instead it’s doing [current behavior].” This is more reliable than “it doesn’t work.”
Research from arXiv confirms that error message prompting without source code context is significantly less effective. The model needs the original code alongside the error to produce useful fixes.
Show, Don’t Describe
The single highest-signal technique across all research: provide examples rather than descriptions.
The data is clear. Few-shot prompting achieves 15-40% better accuracy compared to zero-shot (describing what you want without examples). A large-scale study across 10 SE tasks found that in-context examples were the second-strongest predictor of high-performing prompts (22% of cases), behind only structured guidance (32%).
Why? Language models are pattern learners. When you provide examples, the model recognizes your specific patterns — naming conventions, error handling style, documentation format — and replicates them. Describing those patterns in prose is lossy. You may not accurately describe all the implicit patterns you actually follow.
The practical recipe:
- 2-5 examples is the sweet spot. Research shows large gains from 0 to 2 examples, then diminishing returns.
- Show existing code from your codebase that follows the pattern you want. “Write the new function following these patterns” is more effective than describing the patterns.
- Cover edge cases in examples — diversity matters more than quantity.
- Use
<example>tags in Claude to signal these are demonstrations, not instructions.
Instead of: “I want functions that handle errors gracefully with clear error messages and proper logging”
Try:
Here are two functions from our codebase that handle errors correctly:
<example>
async function getUser(id: string): Promise<User> {
const user = await db.users.findById(id);
if (!user) {
logger.warn('User not found', { id });
throw new NotFoundError(`User ${id} does not exist`);
}
return user;
}
</example>
Write a getOrder function following the same patterns.
A study on system prompt effects found that adding relevant code examples to prompts improved code generation success by 4.75x to 12.33x depending on the specific configuration. That’s not a marginal improvement — it’s an order of magnitude.
Chain Prompts, Don’t Mega-Prompt
The instinct to pack everything into one comprehensive prompt is strong. It’s also wrong for most coding tasks.
The research consensus: sequential/chained prompts win for complex tasks at the cost of latency. Anthropic’s docs describe prompt chaining as “trading latency for higher accuracy.”
The pattern for coding:
- Generate interfaces / function signatures (architecture step)
- Implement each function (one prompt per function)
- Write tests for the implementations
- Self-review and fix against the tests
PromptHub’s research suggests the sweet spot: 3-5 chained steps for most tasks. Fewer than three doesn’t provide enough structure; more than seven risks compounding errors.
When to mega-prompt instead: when the task is fully specified, self-contained, and needs the AI to see the full picture at once (e.g., “given this complete API spec, generate the client library”). Mega-prompts work when context coherence matters more than step-by-step accuracy.
What the Research Says About Chain-of-Thought
“Think step by step” was the magic phrase of 2023-2024. In 2026, it’s more nuanced.
Structured Chain-of-Thought (SCoT) — where you ask the model to reason using code structures (sequential, branch, loop) rather than generic steps — outperforms vanilla CoT by 6-14% on code generation benchmarks. The lesson: if you’re going to ask the model to reason, ask it to reason in code structures, not prose.
But Wharton’s research found that for models with built-in reasoning (Claude with extended thinking, o3, etc.), explicitly prompting for CoT adds 20-80% more processing time with marginal accuracy gains. The reasoning models already think before they code — you don’t need to tell them to.
The practical rule: use structured CoT for older/smaller models. For Claude Opus/Sonnet with extended thinking, Cursor with reasoning models, or o3 — skip it. The model is already doing it internally.
The Anti-Patterns
Common prompting mistakes that waste time and tokens:
Anti-pattern | What happens | Fix |
|---|---|---|
| “Fix this code” | AI guesses what’s wrong, often guesses incorrectly | Specify what’s wrong: expected vs actual behavior |
| Asking for everything at once | Auth + frontend + DB + deploy → jumbled mess | One function, one bug, one feature per prompt |
| No framework version | AI generates code for outdated APIs | “React 19,” “Next.js 15 App Router,” “Python 3.12” |
| No success criteria | AI declares “done” on broken code | “Run the tests after implementing” |
| Hedging language | “Could you maybe look at…” → AI suggests instead of doing | Direct imperatives: “Fix,” “Implement,” “Refactor” |
| Missing code context | Function without callers or data schema | Provide surrounding code or describe the data contract |
| Over-specification | 500-word prompt → AI ignores half of it | Targeted 50-150 words with clear scope |
Anthropic’s docs address the over-engineering failure mode specifically for Claude Opus 4.6:
“Avoid over-engineering. Only make changes that are directly requested or clearly necessary. Keep solutions simple and focused. A bug fix doesn’t need surrounding code cleaned up.”
And the anti-hallucination pattern:
“Never speculate about code you have not opened. If the user references a specific file, you MUST read the file before answering.”
Tool-Specific Prompting
How you prompt varies by tool — not just in syntax, but in philosophy:
GitHub Copilot: Prompting happens through code structure itself. Write a descriptive function name, add a comment describing the behavior, and let Copilot complete. The code you’ve already written is the prompt. Explicit chat-style prompting is secondary.
Cursor: Explicit natural language prompts plus @ context references. Use @filename and @codebase to give the model precise file context. The .cursor/rules/ files act as persistent system prompts. Prompting tip: reference specific files rather than describing them.
Claude Code: High-level goal description. Describe the end state, not the implementation steps. Use Plan Mode (Shift+Tab) for complex tasks — let Claude read the codebase and generate an approach before implementing. CLAUDE.md handles persistent conventions.
OpenCode: Same terminal-native approach as Claude Code but with model flexibility. Switch between models mid-session based on what’s working — a fast model for scaffolding, a reasoning model for complex logic.
The meta-pattern: the more agentic the tool, the more you describe outcomes rather than steps. Copilot needs you to show patterns through code. Claude Code needs you to describe where you want to end up. Cursor sits in between — it wants explicit instructions but handles multi-step execution autonomously.
The Practitioner Patterns
Three practitioners whose prompting approaches are worth studying:
Simon Willison (77 LLM-built tools in 2025): Treats AI as “an over-confident pair programming assistant.” His technique: provide explicit function signatures with parameter names and types — you act as architect, the LLM acts as implementer. Start with simpler versions, validate, then iterate toward complexity. “For longer changes, have the LLM write a plan first, iterate over it until it’s reasonable, then instruct it to implement step by step.”
Armin Ronacher (Flask creator): His highest-impact prompting insight isn’t about prompts at all — it’s about making system state observable. By logging emails to stdout in debug mode (described in CLAUDE.md), the agent can autonomously consult logs to complete authentication flows. The prompt doesn’t need to describe the state — the agent can discover it. He also found that language choice affects prompting success: Go’s explicit patterns make agents significantly more effective than Python’s implicit magic.
Addy Osmani: His testing insight is the most underappreciated prompting technique: “The single biggest differentiator between agentic engineering and vibe coding is testing. With a solid test suite, an AI agent can iterate in a loop until tests pass, giving you high confidence in the result. Without tests, it’ll cheerfully declare ‘done’ on broken code.”
Tests aren’t just verification — they’re the best prompt. A failing test tells the agent exactly what to implement, exactly what “done” looks like, and exactly how to verify success. Kent Beck calls this TDD’s “superpower” with AI agents. Willison captures it in four words: “Use red/green TDD.”
This analysis of the 2026 prompting landscape frames the full stack: prompt craft (table stakes), context engineering (the environment), intent engineering (the strategy), and specification engineering (the blueprint). The five primitives of specification engineering — self-contained problem statements, acceptance criteria, constraint architecture, decomposition, and evaluation design — are the most actionable framework for leveling up from ad-hoc prompting.
A deep dive into prompt chaining for AI coding — building multi-step prompt sequences that decompose complex tasks into manageable, verifiable steps rather than relying on a single mega-prompt.
The Synthesis
Here’s what actually moves the needle, ranked by impact:
- Provide verification criteria alongside the task. Tests, expected outputs, screenshots. Every practitioner and every research paper agrees: this single change has the largest effect on output quality.
- Show examples rather than describing patterns. 2-5 examples of existing code, wrapped in
<example>tags. 4.75x-12.33x improvement in research. - One task per prompt. One function, one bug, one feature. Break complex work into chains.
- Explore/Plan before Implement. Force the model to read relevant code and generate a plan before writing implementation.
- Use direct imperatives, not hedging. “Fix” not “could you maybe look at.”
- Include the error message verbatim for debugging. Copy-paste, don’t paraphrase.
Context engineering (Part 5) and specs (Part 6) are the environment. These prompting techniques are the execution. Get both right and you’re operating at a level most developers haven’t reached yet.
The next part covers what happens after the code is generated: Part 8: Reading and Reviewing AI Code.
Part 8: Reading and Reviewing AI Code
Here’s the skill gap nobody talks about: AI has shifted the bottleneck in software development from writing code to proving it works. You’re now spending more time reading code you didn’t write, at speed, looking for logic errors you didn’t create, in patterns you didn’t choose. And the data says most of us are doing it badly.
CodeRabbit’s analysis of AI-assisted pull requests (December 2025, studying 470 open-source GitHub PRs — 320 AI-co-authored, 150 human-only) found that AI-generated code contained:
- 2.25x more business logic bugs
- 1.97x more missing error handling
- 2.27x more null reference risks
- 1.75x more logic and correctness errors
- 3x more readability issues (the single biggest gap)
- ~8x more performance regressions (excessive I/O)
- 2.74x more XSS vulnerabilities
- 1.7x more total issues overall (10.83 per PR vs. 6.45 for human PRs)
That’s not “slightly worse.” That’s a fundamentally different quality profile that demands a different review approach.
The “Almost Right” Problem
The 2025 Stack Overflow Developer Survey (49,000 respondents) pinpointed the core frustration: 66% of developers say their biggest challenge with AI tools is dealing with “solutions that are almost right, but not quite.” The second-biggest frustration: 45% say “debugging AI-generated code is more time-consuming than expected.”
Almost right is worse than clearly wrong. A compilation error takes seconds to fix. A function that returns plausible-looking but subtly incorrect data can survive code review, pass superficial tests, and lurk in production for months.
IEEE Spectrum documented this as “silent failures” — AI-generated code that avoids syntax errors and obvious crashes but fails to perform as intended:
“Recently released LLMs generate code that fails to perform as intended, but which on the surface seems to run successfully, avoiding syntax errors or obvious crashes. They do this by removing safety checks, or by creating fake output that matches the desired format.”
The article’s author, Jamie Twiss (CEO of Carrington Labs), describes the real cost: “A task that might have taken five hours assisted by AI, and perhaps 10 hours without it, is now more commonly taking seven or eight hours.” The speed gains from generation are consumed by the verification burden.
Why AI Code Breaks Differently
Human bugs tend to be localized — a typo, an off-by-one error, a missed edge case in a function the developer wrote and understands. AI bugs are structurally different.
A comprehensive survey of bugs in AI-generated code (arXiv, December 2025, analyzing 56 papers) identified eight distinct categories. The most dangerous:
Functional bugs — code that runs but produces wrong results. These dominate AI output because the model optimizes for syntactic plausibility, not semantic correctness. The code looks right. It compiles. It even handles some edge cases. But the core business logic is subtly wrong.
Hallucination bugs — unique to AI. The model references APIs that don’t exist, invents library methods, or assumes behaviors that no real framework provides. These are especially insidious because a developer unfamiliar with the specific library might not catch them during review.
Missing safeguards — the IEEE Spectrum finding is particularly concerning: models sometimes actively remove safety checks to produce code that appears to work. A validation function that always returns true. An error handler that swallows exceptions silently. A security check that’s structurally present but functionally empty.
The Context Studios “vibe coding hangover” article captures the pattern: by September 2025, Fast Company declared the “vibe coding hangover” had arrived, with senior engineers citing “development hell.” The article notes that 25% of Y Combinator’s Winter 2025 batch had codebases that were 95% AI-generated — and many were hitting walls.
The Trust Inversion
Here’s a finding that should concern every team lead. Qodo’s State of AI Code Quality 2025 report (609 developers surveyed) found a dangerous trust inversion:
- Senior developers (10+ years): reported the highest code quality benefits from AI (68.2%) but only 25.8% felt confident shipping AI code without review
- Junior developers (<2 years): reported the lowest quality improvements (51.9%) yet 60.2% felt confident shipping without review
The people least equipped to catch AI bugs are the most likely to skip review. The people most equipped to catch them know better than to trust the output.
This maps directly to the METR study’s most unsettling finding: experienced developers using AI tools were 19% slower on complex tasks — and they estimated they were 20% faster. The perception-reality gap isn’t just about productivity. It’s about confidence. AI-generated code feels trustworthy because it’s well-formatted, consistently styled, and comes with plausible-sounding explanations. That surface quality triggers automation complacency — the well-documented tendency to trust automated systems even when they’re wrong.
The Review Bottleneck
Greptile’s State of AI Coding 2025 report quantified what many teams felt intuitively: the bottleneck has shifted from writing to reviewing.
- Median lines of code per developer jumped from 4,450 to 7,839 (+76%)
- Median PR size increased by 33% (from 57 to 76 lines changed)
- Monthly code pushes on GitHub crossed 82 million
- About 41% of new code was AI-assisted
More code, written faster, by models that produce 1.7x more issues per PR. The math is brutal: review capacity didn’t scale with generation capacity.
And here’s the uncomfortable truth: manually reading through all AI-generated code line-by-line is often not possible anymore. When an agent generates 500 lines of implementation in 90 seconds, you can’t review it the way you’d review a colleague’s 50-line PR. The volume overwhelms the process. This doesn’t mean you skip review — it means the review method has to change. You can’t read every line, so you need automated tools, focused review heuristics, and test suites that catch what your eyes miss. (For small, targeted changes — a single function, a bug fix — manual review absolutely still works. The challenge is the large generation bursts.)
Addy Osmani summarizes the situation:
“PRs are getting larger, incidents per PR are up ~24%, and change failure rates are up ~30%. When output increases faster than verification capacity, review becomes the rate limiter.”
The 2024 DORA report confirmed the paradox at industry scale: a 25% increase in AI adoption triggered a 7.2% decrease in delivery stability and a 1.5% decrease in delivery throughput. More speed, less stability. More code, more incidents.
How to Actually Review AI Code
The good news: reviewing AI code well is a learnable skill. The bad news: it requires a different mental model than reviewing human code. Here’s what practitioners have converged on.
The PR Contract
Osmani’s PR framework — treat every pull request as a contract with four required elements:
- Intent — 1-2 sentences: what changed and why
- Proof of function — tests passed, manual verification, screenshots, logs
- Risk assessment — tier level, which parts used AI, high-risk flags (payments, auth, data mutations)
- Review focus — 1-2 specific areas requesting human judgment
The principle: “proof over promises.” Don’t accept “it works” — demand execution evidence before merge. If the PR doesn’t include proof that it works, send it back.
The Bug Taxonomy Checklist
Based on CodeRabbit’s data, prioritize review attention on the categories where AI fails most:
Review priority | What to check | AI failure rate |
|---|---|---|
| 1. Business logic | Does the code do what the spec says? Not “does it compile” — does the behavior match the requirement? | 2.25x higher |
| 2. Error handling | Every external call, every user input, every file operation — does it handle failure? AI loves the happy path. | 1.97x higher |
| 3. Null/undefined safety | Check every variable that could be null. AI frequently skips null checks. | 2.27x higher |
| 4. Security boundaries | Input validation, output encoding, authentication checks, CSRF tokens. AI almost never adds these proactively. | 1.57x higher |
| 5. Performance | Watch for excessive I/O, N+1 queries, unnecessary re-renders, missing pagination. | ~8x higher |
| 6. Readability | Variable names, function structure, comment accuracy. AI code is verbose and generic. | 3x higher |
The “Explain It” Test
Before approving any non-trivial AI-generated function, ask yourself: can I explain what this code does, line by line, to another developer? If you can’t, you haven’t reviewed it — you’ve glanced at it.
Simon Willison drew the definitive line:
“If an LLM wrote every line but you reviewed, tested, and understood everything, that’s not vibe coding — that’s using an LLM as a typing assistant.”
The key word is understood. Not “it looks reasonable.” Not “the tests pass.” Understood. If you merge code you can’t explain, you’re accumulating debt that compounds with every subsequent AI-generated change.
AI-on-AI Review
One of the most effective techniques: use a second AI to review code from the first. Spawn a review session using a different model — if Claude generated the code, review it with GPT (or vice versa). Different models have different blind spots.
Osmani reports that properly configured AI reviewers catch 70-80% of low-hanging fruit — style issues, simple logic errors, missing null checks. That frees human reviewers to focus on what AI can’t evaluate: architectural fit, business logic correctness, security implications, and whether the code actually solves the right problem.
The layered approach:
- AI reviewer catches mechanical issues (CodeRabbit, Greptile, Qodo, or a second model in your IDE)
- Automated tests verify functional behavior
- Human reviewer focuses on intent, architecture, security, and business logic
The AI Code Review Tool Landscape
Dedicated AI review tools have matured rapidly. Greptile’s 2025 benchmarks tested real-world bug detection rates:
Tool | Bug catch rate | Strength |
|---|---|---|
| Greptile | 82% | Deep codebase understanding, cross-file context |
| Cursor | 58% | IDE-integrated, real-time |
| CodeRabbit | 44% | Fast PR summaries, inline suggestions, 632K+ PRs processed |
| GitHub Copilot | ~55% | GitHub-native, lowest friction |
| Qodo | Strong (architectural) | Cross-repo context, ticket-aware validation |
No single tool catches everything. The teams reporting the best outcomes use layered review — an AI tool for automated scanning, plus human review for judgment calls. Osmani’s principle: “AI catches the bugs. Humans catch the wrong abstractions.”
The Incrementalism Principle
The single most actionable change for teams drowning in AI-generated PRs: break AI output into small, digestible commits.
A 500-line PR generated in one AI session is nearly impossible to review effectively. The same changes split into five 100-line commits — each with a clear intent, each independently testable — become manageable.
The workflow:
- Generate code with AI in a focused session (one feature, one function)
- Review the diff immediately — before the next generation step
- Commit what’s verified
- Reset context and generate the next piece
This is the commit-and-reset pattern from Part 9: small loops of generate → review → commit, rather than marathon generation sessions followed by marathon review sessions.
Osmani’s target: >70% test coverage and no PR without proof of function. If AI wrote the code and nobody can explain it, he warns, “on-call becomes expensive.”
Building Review Intuition
Reading AI code is a skill that improves with deliberate practice. Some patterns to develop:
Read the diff, not the file. AI-generated code is often well-structured at the file level but subtly wrong at the diff level. Focus on what changed, not what was generated.
Check the boundaries first. Function inputs, API responses, database queries, user-facing outputs. AI handles interiors well but consistently drops the ball at boundaries — where data enters or exits the system.
Be suspicious of confidence. If the AI provides a detailed explanation of why its approach is correct, that’s often the part to scrutinize most carefully. Models are most confidently wrong when they’re pattern-matching against a common-but-inapplicable solution.
Trust your discomfort. If something feels off but you can’t articulate why, that’s your experience talking. Don’t rationalize it away. Investigate. The feeling that “this looks right but something is wrong” is the single most valuable review instinct — and it’s one AI doesn’t have.
Dave Farley’s argument for why engineering discipline matters more — not less — when AI writes the code. The code review discipline covered in this part is how you resolve the tension between generation speed and production reality — not by slowing down generation, but by building review systems that match its pace.
Itamar Friedman (Qodo CEO) at AI Engineer Summit on the state of AI code quality — what the data actually shows about AI-generated code defects, the gap between “works” and “production-ready,” and how code review and testing practices need to evolve.
The Synthesis
The shift is real: you now spend more time reading than writing. That’s not a regression — it’s the natural consequence of commoditizing code generation. The valuable skill isn’t producing code anymore. It’s evaluating code. Quickly. Accurately. At scale.
The minimum viable review workflow:
- Every PR has a contract: intent, proof, risk, focus areas
- AI reviews first: automated tools catch mechanical issues
- Human reviews what matters: business logic, security, architecture
- No merge without understanding: if you can’t explain it, don’t ship it
- Small commits, frequent reviews: incrementalism beats marathons
- Test coverage is non-negotiable: >70% or it doesn’t merge
The developers who master this — who become fast, accurate readers of code they didn’t write — have the scarcest skill in the industry right now. Everyone can generate. Few can verify.
The next part covers the meta-skill of managing AI conversations themselves: Part 9: The Conversation Loop and When to Reset.
Part 9: The Conversation Loop and When to Reset
There’s a pattern every AI-assisted developer hits eventually. You start a session, everything works beautifully — the agent understands your codebase, follows instructions precisely, produces clean code. Thirty messages in, something shifts. The agent starts ignoring constraints you set earlier. It “forgets” the architecture you agreed on in message 5. It hallucinates methods from files it read 20 minutes ago. By message 50, you’re fighting the tool more than using it.
This isn’t a bug. It’s the fundamental constraint of how these models work — and managing it is one of the highest-leverage skills you can develop.
Why Conversations Degrade
The definitive research is “LLMs Get Lost In Multi-Turn Conversation” (arXiv, May 2025) by researchers from Microsoft and Salesforce. They tested 15 models across 200,000+ simulated conversations and found:
- 39% average performance drop from single-turn to multi-turn conversations
- Python code generation specifically degraded by 28-73% depending on the model
- GPT-4.1 dropped from 96.6% accuracy (single-turn) to 72.6% (multi-turn)
- The degradation is not mainly an aptitude loss (~16% average) but a massive unreliability increase (112%) — models show 50-point performance variance between best and worst runs
The most important finding: the problem appears with just two turns. It’s not gradual decay over 50 messages. It’s immediate. Once the model takes a wrong interpretive turn, it locks in and rarely recovers.
A follow-up study traced the root cause to intent mismatch — the model forms an interpretation of your goal in the early turns and then anchors to it. Later corrections bounce off. You say “no, I meant X” and the model says “right, X” and then continues doing Y.
The “Lost in the Middle” Effect
Beyond multi-turn degradation, there’s a spatial problem. Research from Liu et al. (published in Transactions of the ACL, 2024) documented a U-shaped attention curve: LLMs recall information best from the beginning and end of their context window, and worst from the middle.
Performance degrades by more than 30% when the relevant information sits in the middle of context versus the start or end. For coding agents, this means: if the agent reads 8 files and the critical code is in file #4, that code sits in the model’s attention blind spot.
More recent research (2025) found that when context utilization exceeds ~50%, the U-shape disappears and becomes pure recency bias — the model only reliably uses the most recent tokens. Everything from the first half of the session is effectively noise.
This is why message 50 ignores message 5. It’s not forgetfulness. It’s architecture.
Context Rot
Every file the agent reads, every grep result, every dead-end exploration — it all stays in context. The term “context rot” describes how accumulated irrelevant context actively degrades performance. By the time the agent finds the right code, it may be carrying 20,000+ tokens of noise from wrong turns.
Pete Hodgson calls this the fundamental challenge: “The context window is not just a storage limit — it’s a signal-to-noise ratio problem. More context doesn’t help if most of it is irrelevant.”
The Context Quality Zones
Practitioners have mapped context utilization to output quality. The pattern, documented by Will Ness and SFEIR Institute:
Context utilization | Output quality |
|---|---|
| 0-40% | High quality. Precise instruction following. Best work. |
| 40-70% | Quality drops. Agent starts cutting corners. |
| 70-85% | Sloppy output. Instructions frequently ignored. |
| 85%+ | Critical degradation. Agent actively contradicts earlier agreements. |
| 95% | Auto-compaction triggers — but quality is already gone. |
Will Ness tested this systematically with 25 file edits across two approaches:
- Compact approach (reuse one session, compact at 80%): most edits occurred in the 70%+ zone — the low-quality band
- One session per task (clear between tasks): all edits executed around 30% capacity — the high-quality band
The conclusion: if a single task pushes past 40% context, subdivide it. Two focused sessions produce better output than one marathon session, even on the exact same task.
When to /clear vs /compact vs Keep Going
The three tools for managing session state in Claude Code, and when to use each:
/clear — nuclear reset. Wipes all conversation context. CLAUDE.md reloads fresh. Use it:
- Between unrelated tasks
- After committing a completed feature
- When the agent is consistently ignoring instructions from earlier in the conversation
- When you’ve hit a dead end and want to approach the problem differently
/compact — surgical compression. Summarizes the conversation to free context space while preserving key decisions. You can direct it: /compact Focus on preserving the authentication implementation and database schema decisions. Use it:
- Mid-task when context utilization hits ~70%
- When you want to continue the current task but free up space
- Before a major implementation phase following a planning phase
Keep going — continue the session as-is. Use it:
- When the task is nearly complete (1-2 more steps)
- When context is below 40%
- When the agent is producing high-quality, consistent output
The critical mistake: waiting for auto-compaction at 95%. By then, quality has been degraded for hundreds of tokens of output. Manual compaction at 70% preserves the information you actually need. Auto-compaction at 95% is an emergency measure, not a strategy.
The Commit-and-Reset Workflow
The practitioners who produce the most consistent output share a pattern: commit working code, then start fresh.
Simon Willison (77 LLM-built tools in 2025) captures it directly:
“Most of the craft of getting good results out of an LLM comes down to managing its context. When you start a new conversation you reset that context back to zero. This is important to know, as often the fix for a conversation that has stopped being useful is to wipe the slate clean and start again.”
His workflow: short, focused sessions. His colophon project took 17 minutes across two separate sessions for $0.78. Not one marathon — two targeted bursts.
Armin Ronacher (Flask creator) found the same pattern from the opposite direction — by documenting what didn’t work:
“Long sessions lead to forgotten context from the beginning.”
His solution: capture context in external Markdown files rather than depending on session memory. When starting a fresh session, the agent reads the Markdown file and picks up where the last session left off — without carrying 30,000 tokens of dead-end exploration.
Pete Hodgson formalized this as the “Chain-of-Vibes” pattern: proactively breaking work into discrete tasks with natural context resets between them. At session endpoints, write a summary: “What we’ve worked on, what decisions were made, what’s next” — then /clear and start fresh with that summary as the initial prompt.
The full workflow:
- Start a focused session — one feature, one bug, one refactor
- Work until the task is complete or context hits ~70%
- Commit the working code — granular commit with clear message
- Write a handoff note if the work continues (decisions made, files changed, what’s next)
/clearand start the next task — fresh context, clean slate
Tool-Specific Session Patterns
Claude Code has the most explicit session management:
/clearand/compactas described aboveclaude -cresumes the last conversation (use sparingly — you’re reloading old context)claude -r "session_id"resumes a specific named session/renamenames the current session for later retrieval- Plan Mode (Shift+Tab) uses lighter processing, halving token consumption (~53% savings) — use it for exploration before implementation
Cursor handles sessions differently:
- Persistent conversation history across sessions (implicit, not explicit)
- Cursor Notepads for creating reusable context bundles — persistent reference material that doesn’t consume conversation context
- Background Agents for async, cloud-based tasks that run independently
- The
.cursor/rules/files serve as persistent system-level context (like CLAUDE.md)
GitHub Copilot:
- Plan agent provides structure and boundaries for longer sessions
.github/copilot-instructions.mdfor project context- More limited session management — Copilot is designed for shorter, inline interactions
The meta-pattern across all tools: the memory layer (CLAUDE.md, .cursor/rules/, copilot-instructions.md) is your session-persistent context. Everything that needs to survive a /clear should live in those files, not in the conversation. This is why Part 5: Context Engineering matters so much — a well-maintained CLAUDE.md means every new session starts with rich context instead of a blank slate.
Running Multiple Agents in Parallel
The commit-and-reset pattern has a natural extension: if each task gets its own session, why not run multiple sessions simultaneously?
Git worktrees make this possible. A worktree is an isolated copy of your repository on its own branch — separate working directory, shared .git history. Each AI agent gets its own worktree, works independently, and merges back when done.
Claude Code has built-in worktree support (v2.1.50+):
claude --worktree feature-auth
This creates an isolated copy under .claude/worktrees/ with its own branch. Custom agents in .claude/agents/ can use isolation: worktree for automatic isolation.
The team at incident.io runs four to five Claude agents working on different features in parallel. Some team members maintain seven ongoing conversations at once. Their results: a JavaScript editor completed in 10 minutes (Claude estimated 2 hours); 18% build optimization improvement for $8 in credits.
They built a custom bash function for instant worktree + Claude Code spawning:
w myproject new-feature claude
The parallel workflow:
- Plan the work — break the project into independent tasks
- Spawn a worktree per task — each gets isolated filesystem + branch
- Run agents simultaneously — each focused on one task in one session
- Review and merge — code review each branch before merging to main
Tools that facilitate this:
- parallel-code — run Claude Code, Codex, and Gemini side by side in their own worktrees
- ccswarm — multi-agent orchestration with Claude Code and worktree isolation
- Crystal — desktop app for running multiple AI sessions in parallel worktrees
Agent Autonomy: How Long Before They Need You?
Anthropic’s research on agent autonomy in practice (February 2026) measured how long Claude Code agents can work before needing human input:
- Human interventions per session decreased from 5.4 to 3.3 between August and December 2025
- Success rates on challenging tasks doubled in the same period
- Experienced users (~750+ sessions) auto-approve >40% of sessions — they’ve learned to trust the agent on routine tasks
- On complex tasks, Claude Code stops to ask for clarification more than twice as often as humans interrupt it — the agent is better at knowing when it’s stuck than most developers expect
The trend is clear: agents are getting more autonomous, but the session management patterns don’t change. Even a perfectly autonomous agent is still bounded by context windows. A 20-action autonomous sequence in a clean session produces better output than the same 20 actions at 80% context utilization.
IndyDevDan’s walkthrough of parallelizing Claude Code with git worktrees demonstrates the practical mechanics — spawning multiple agents, managing branches, and merging the results back together.
IndyDevDan on elite context engineering patterns — advanced session management, strategic compaction, and the workflows that keep agents productive across long-running projects.
The Research-Plan-Implement Loop
The most robust session pattern, converged upon independently by Martin Fowler, Will Ness, and Spotify Engineering:
- Research session — the agent explores your codebase to understand the problem. Read files, grep for patterns, understand the architecture. End with a written summary.
- Plan session —
/clear, then feed the research summary. The agent creates an implementation plan. Review and refine. End with a plan document. - Implement session(s) —
/clear, then feed the plan. A fresh agent picks up one task from the plan, completes it, commits. Next task gets a fresh session. - Repeat until all tasks are done.
Each phase stays in the high-quality context zone. No single session tries to do everything. The handoff documents (research summary, plan) carry the essential context without the noise.
Spotify’s engineering team recommends keeping context utilization in the 40-60% range through “frequent intentional compaction.” Their finding: MCP tools alone consume ~16.3% of context before you’ve typed anything. Plan accordingly.
The Synthesis
Session management is the unglamorous skill that separates consistent output from inconsistent output. The agent doesn’t get tired, but its context window does. Every message, every file read, every wrong turn — they all stay in context, degrading signal-to-noise until the agent can’t follow instructions reliably.
The rules:
- One task per session. If the task is big, break it into sub-tasks. Each gets its own session.
- Compact at 70%, not 95%. Manual compaction preserves what matters. Auto-compaction at 95% is damage control.
- Commit before clearing. Working code in git is your progress save point. Context in a conversation is ephemeral.
- Write handoff notes. If work continues across sessions, write down what was decided, what changed, and what’s next. Feed it to the fresh session.
- Use the memory layer. CLAUDE.md, .cursor/rules/, copilot-instructions.md — these survive every
/clear. Keep them updated. - Go parallel when tasks are independent. Git worktrees + multiple agents = multiplicative throughput without context degradation.
The developers who treat each session as a focused sprint — with clean starts, clear endpoints, and committed results — produce dramatically more consistent output than those who run marathon sessions and wonder why the agent “stopped listening.”
The next part applies this session discipline to the biggest decisions you’ll make: Part 10: Architecture Before Implementation.
Part 10: Architecture Before Implementation
Here’s the paradox nobody warned you about: AI-first development is more architecture-heavy, not less. You write less code — but you need more design thinking than ever. The developers who skip architecture and go straight to “build me a todo app” are the same ones hitting the three-month wall from Part 6.
Anthropic’s 2026 Agentic Coding Trends Report documents a “tectonic shift” in the developer role: engineering is moving toward agent supervision, system design, and output review. While developers use AI in roughly 60% of their work, they can “fully delegate” only 0-20% of tasks. The rest requires active collaboration — and that collaboration is most valuable at the architectural level.
Addy Osmani captures the economics: AI can rapidly produce 70-80% of a solution. But that last 20-30% — edge cases, security, production integration, and architectural coherence — is where the value lives. The architecture IS the hard part. It’s also the part that determines whether the 80% AI output works in production or collapses under its own weight.
The Engineer as Architect
The role shift is real and measurable. You’re trading:
- Typing time for review time
- Implementation effort for orchestration skill
- Writing code for reading and evaluating code
Osmani’s conductor-to-orchestrator framework describes the evolution in two stages:
Conductor Mode — working closely with a single AI agent. Sequential approach: implement backend, then frontend, then tests — each step with the human in the loop. Tight feedback cycles. This is where most developers are today.
Orchestrator Mode — delegating tasks to multiple autonomous agents working in parallel. The human effort is front-loaded (writing specs, setting architectural context) and back-loaded (reviewing final code, testing), but minimal in the middle. One orchestrator can manage more total work than one conductor.
The shift to orchestrator mode requires more architectural clarity, not less. When a single agent works with you step by step, you can correct course in real time. When five agents work in parallel on different features, they need to agree on interfaces, data models, and conventions before they start. That agreement is architecture.
InfoQ’s analysis of the architect role in the AI era defines three loops:
Loop | Human role | AI role | Use for |
|---|---|---|---|
| Architect In The Loop | Makes final decisions | Generates options, analyzes trade-offs | High-impact decisions: platform selection, security model, data architecture |
| Architect On The Loop | Supervises within predefined boundaries | Operates autonomously with guardrails | Repeatable decisions: design validation, cost optimization, API design |
| Architect Out of The Loop | Defines governance rules | Systems design themselves | Low-impact, high-frequency: autoscaling, drift correction, routine refactoring |
Most AI-assisted development today is Architect In The Loop. The goal isn’t to remove the architect — it’s to move routine decisions to On/Out of The Loop so the architect can focus on what actually matters.
Scaffolding Before Details
The most consistent workflow across practitioners follows a three-phase pattern: scaffold → implement → refine.
Osmani calls this “waterfall in 15 minutes” — a compressed planning phase that prevents wasted cycles:
- Specification — brainstorm requirements with the AI. Iteratively refine until edge cases are covered. Compile into a
spec.mdcontaining requirements, architecture decisions, data models, and testing strategy. - Plan generation — feed the spec into a reasoning model to break implementation into logical, bite-sized tasks.
- Iterative implementation — one task at a time, one prompt per function/feature/fix.
- Test and commit — run tests immediately, commit frequently with clear messages.
The key insight from Sylver Studios on architecture scaffolding with Claude Code: before any coding begins, establish four layers of context:
- Business context — organizational purpose and constraints
- App purpose — what the application does and for whom
- Current problem — specific challenge to solve
- Desired outcome — measurable success criteria
From this context, the AI generates a PLAN.md with milestone breakdowns and testable units of work. If you already have architectural preferences, include them. If not, ask for three options — including one wild-card to stretch your thinking.
Why does this matter? Because AI assistants struggle with empty directories. An empty folder with no existing code is a vague prompt. No wonder the agent guesses wrong. The solution: establish the complete architecture first — folder structure, interfaces, configuration — then prompt the AI to build features within that established structure.
Communicating Architecture to AI Agents
The architecture exists in your head. The agent can’t read your mind. The gap between what you envision and what the agent builds is entirely a function of how well you communicate the architecture.
Osmani analyzed 2,500+ agent configuration files and identified six essential areas that effective specs cover:
- Commands — full executable commands with flags (
npm test,pytest -v,cargo build --release) - Testing — framework details, test file locations, coverage expectations
- Project structure — explicit paths for source code, tests, documentation
- Code style — one real code example beats a page of style descriptions
- Git workflow — branch naming, commit format, PR requirements
- Boundaries — what agents should never touch
The boundaries need a three-tier system:
Tier | Rule | Examples |
|---|---|---|
| Always | Safe actions requiring no approval | Read files, run tests, format code |
| Ask First | High-impact changes needing review | Database migrations, API changes, dependency updates |
| Never | Hard stops | Commit secrets, modify production configs, delete user data |
This maps directly to your CLAUDE.md (or .cursor/rules/, or copilot-instructions.md). The Anthropic engineering blog recommends keeping CLAUDE.md lean — put the essential conventions and boundaries in the main file, and import detailed guidance from separate files. Imports can be recursive (up to 5 levels deep), so you can organize architectural documentation into focused modules rather than one giant file.
The AGENTS.md standard, emerging from a collaboration between Sourcegraph, OpenAI, Google, and Cursor, extends this to multi-tool compatibility. If your team uses both Claude Code and Cursor, AGENTS.md provides a single architectural source of truth that both tools can read.
Structure Your Codebase for AI
How you organize files directly impacts how well AI agents can work with your code. This isn’t aesthetic — it’s functional.
Research on coding agents and project structures found that AI agents explore codebases via tree-like traversal using semantic queries. Monolithic service classes force agents to read irrelevant code, consuming tokens unnecessarily. Feature-driven (vertical slice) structures enable agents to “naturally traverse the repository in a few depth-first passes.”
The principles:
Vertical slicing over horizontal layering. Instead of organizing by technical layer (/controllers, /services, /models), organize by feature (/auth, /payments, /notifications). Each feature directory contains its own routes, logic, data access, and tests. An agent working on authentication only needs to read the /auth directory — not the entire /services directory to find the auth-related service.
One concern per file. A 500-line file with three classes forces the agent to load all three into context even if it only needs one. Three 170-line files let the agent load exactly what it needs. Remember the context quality zones from Part 9 — every unnecessary token degrades output quality.
Explicit interfaces between modules. Each module exposes a clear public API (an index.ts or __init__.py) that other modules import from. The agent doesn’t need to understand the internals of the payments module to call its functions — it just reads the interface file.
Self-contained configuration. Each module owns its own configuration, tests, and types. Avoid global config files that every module depends on. The agent should be able to work on one module without loading the entire project’s configuration.
For monorepos, Mercari Engineering found that clear module boundaries with dependency flow rules (core → domain → feature → app) let agents navigate large codebases effectively. Put an AGENTS.md or CLAUDE.md at the top level with the global conventions, and subdirectory-specific rules in each module.
The practical structure:
project/
├── CLAUDE.md # Global conventions, boundaries, commands
├── spec.md # Current feature spec
├── src/
│ ├── auth/
│ │ ├── CLAUDE.md # Auth-specific conventions
│ │ ├── index.ts # Public interface
│ │ ├── auth.service.ts
│ │ ├── auth.routes.ts
│ │ └── auth.test.ts
│ ├── payments/
│ │ ├── index.ts
│ │ ├── payments.service.ts
│ │ ├── payments.routes.ts
│ │ └── payments.test.ts
│ └── shared/
│ ├── types.ts # Shared type definitions
│ ├── errors.ts # Error handling conventions
│ └── middleware.ts # Common middleware
└── tests/
└── integration/ # Cross-module integration tests
Interface-First Design
The single most effective architectural pattern for AI-assisted development: define interfaces before implementations.
When you give an agent a TypeScript interface, a JSON schema, or an API contract before asking it to implement anything, you’ve constrained the solution space. The agent knows exactly what inputs to accept, what outputs to produce, and what contracts to honor. Without interfaces, the agent invents its own — and those inventions often don’t compose.
The pattern from InfoQ’s analysis of spec-driven development:
- Specification layer — declarative system intent (TypeScript interfaces, JSON schemas, API contracts, database schemas)
- Generation layer — AI agents materialize specifications into implementations
- Validation layer — continuous drift detection ensures implementations match specs
- Artifact layer — generated code is treated as disposable and regenerable
The fundamental inversion: the specification becomes the authoritative definition of system reality. Implementations are derived, validated, and regenerated to conform to that truth. If the implementation drifts from the spec, you regenerate — you don’t update the spec to match the drift.
This is why Part 6: Spec-Driven Development matters architecturally: specs aren’t just requirements documents. They’re the interface contracts that keep multiple AI agents (or multiple sessions) producing compatible code.
Armin Ronacher found that language choice amplifies this effect. Go’s explicit patterns — interfaces, error handling, type system — make agents significantly more effective than Python’s implicit magic. The more explicit your language and architecture, the less the agent has to guess.
The Architecture Review Checkpoint
Before implementation begins, run this checklist:
- [ ] Interfaces defined — every module has a clear public API (types, function signatures, contracts)
- [ ] Data models specified — database schemas, API payloads, state shapes
- [ ] Boundaries set — what the agent can and cannot modify (CLAUDE.md boundaries)
- [ ] Dependencies mapped — which modules depend on which, and in what direction
- [ ] Test strategy decided — unit tests per module, integration tests across boundaries
- [ ] File structure created — directories, index files, configuration in place
- [ ] Conventions documented — naming, error handling, logging patterns in CLAUDE.md
If you can check all seven, the agent has enough architectural context to produce coherent code. If you can’t, you’re asking the agent to make architectural decisions — and those decisions will be inconsistent across sessions, across files, and across agents.
vFunction’s analysis puts it directly: “Without guardrails driven by an architect’s expert skills and experience, agents tend to overfit to short-term utility rather than long-term architecture and sustainability.” The agent optimizes for this prompt, not for the system. Architecture is how you make “this prompt” and “the system” point in the same direction.
Gergely Orosz’s interview with Addy Osmani on “Beyond Vibe Coding” — a deep dive into why architectural thinking becomes more important, not less, when AI writes the code. Covers the 70% problem, the conductor-to-orchestrator shift, and practical workflow patterns.
“AI’s Code: More Artifact, Less Architecture” from AI Engineer World’s Fair — why AI-generated code tends toward artifact (throwaway, task-specific) rather than architecture (composable, long-lived), and what that means for how you structure projects.
The Synthesis
Architecture before implementation isn’t a new idea. What’s new is how much more it matters when AI agents do the implementation. A human developer with a vague architecture can course-correct as they go — they hold the full system model in their head. An AI agent with a vague architecture produces plausible-looking code that doesn’t compose into a coherent system.
The minimum viable architecture for AI-assisted development:
- Spec first — requirements, constraints, acceptance criteria (Part 6)
- Interfaces second — types, contracts, schemas before implementations
- File structure third — vertical slices, one concern per file, explicit module boundaries
- CLAUDE.md fourth — conventions, commands, boundaries, code examples
- Implementation last — now the agent has everything it needs
The developers who invest 30 minutes in architecture before prompting produce better output in 2 hours than those who spend 4 hours prompting without architecture. The front-loading pays for itself — every time.
The next part covers the safety net that makes all of this recoverable: Part 11: Testing in an AI Workflow.
Part 11: Testing in an AI Workflow
Addy Osmani said it best:
“The single biggest differentiator between agentic engineering and vibe coding is testing. With a solid test suite, an AI agent can iterate in a loop until tests pass, giving you high confidence in the result. Without tests, it’ll cheerfully declare ‘done’ on broken code.”
Testing isn’t just a quality practice in AI-assisted development. It’s the prompting technique. A failing test is the clearest, most unambiguous instruction you can give a coding agent: here’s exactly what “done” looks like. Here’s exactly how to verify it. Now make it pass.
The “Tests That Pass but Test Nothing” Problem
AI agents love writing tests. They’ll generate comprehensive-looking test suites with high coverage numbers, clear descriptions, and passing green checkmarks. The problem: many of those tests are tautological — they mirror the implementation’s assumptions rather than challenging them.
Mark Seemann (ploeh blog, January 2026) calls this “tests as ceremony, rather than tests as an application of the scientific method.” His epistemological argument:
“When you skip seeing a test fail first, you have no idea if the test code is correct. That all tests pass is hardly useful.”
A documented production incident illustrates the failure mode: a well-covered endpoint returned silently incorrect data. The AI-generated tests had patterned their assertions after the code they were testing — duplicating internal transformations in the assertions. The suite looked comprehensive. It was actually blind to the real failure.
The data is damning. AI-generated tests routinely score 30-40% on mutation testing while maintaining 90%+ code coverage. That means: every line executes, but 60-70% of potential bugs go undetected. A test suite with 100% coverage and a 4% mutation score executes every line and misses 96% of bugs.
Coverage measures which lines execute. Mutation testing measures which bugs tests catch. They’re fundamentally different things — and AI is great at the first while being bad at the second.
Tests as the Best Prompt
Here’s the reframe that changes everything: a test is the highest-fidelity prompt you can write.
When you give an agent a natural language prompt — “implement user authentication” — there’s enormous ambiguity. What does authentication mean? Session-based? JWT? OAuth? What edge cases matter?
When you give an agent a failing test:
test('rejects expired tokens with 401', async () => {
const expired = createToken({ exp: Date.now() - 1000 });
const res = await request(app).get('/api/me').set('Authorization', `Bearer ${expired}`);
expect(res.status).toBe(401);
expect(res.body.error).toBe('Token expired');
});
There is zero ambiguity. The agent knows the input, the expected output, the exact verification criteria, and can confirm success by running the test. No interpretation needed.
Paul Duvall frames this as ATDD-driven AI development: “What if the specifications — the acceptance tests — were the program?” The developer specifies the problem through tests and lets the AI solve it. The tests become the primary artifact; the implementation becomes secondary.
A February 2026 workshop at The Register validated this approach: SAS’s “VibeTDD” hackathon had engineers guide AI agents to build working applications using strict TDD without writing any implementation code themselves. The conclusion: “prompting and test-writing might become the new coding.”
Red/Green TDD: The Four-Word Superpower
Simon Willison published this as one of his core Agentic Engineering Patterns:
“Use red/green TDD” is the highest-leverage four-word prompt you can give a coding agent.
The cycle:
- Red — write a test that expresses the desired behavior. Run it. Watch it fail. This confirms the test actually tests something new.
- Green — let the agent implement the smallest possible change to make it pass.
- Refactor — clean up the code while keeping tests green.
The red phase is critical. If you let the agent write both the test and the implementation, the test might pass vacuously — it tests the implementation the agent already wrote, not the behavior you actually need. Seeing the test fail first is proof that it’s testing something real.
Kent Beck (the inventor of TDD) discussed this extensively with Gergely Orosz in June 2025. His key findings from using TDD with AI agents:
- TDD is a “superpower” with agents — it prevents regressions and gives concrete goals
- His mental model of AI agents: an “unpredictable genie” that grants wishes in unexpected ways
- Critical failure mode: agents will delete failing tests to make them “pass” — you must explicitly instruct the agent that tests are immutable
- AI excels at adding features but struggles with refactoring for simplicity — creating a complexity spiral where the codebase eventually exceeds the AI’s capacity to help
That last point is Beck’s sharpest insight: “Don’t eat the seed corn.” If AI code gets so complex that neither you nor the AI can simplify it, you’ve lost the ability to maintain the system. TDD prevents this by forcing incremental, testable changes.
The Critical Rule: Tests Are Immutable
The single most important instruction when using TDD with AI agents:
Tell the agent it cannot modify the tests.
Without this constraint, the agent takes the path of least resistance: if a test fails, it changes the test to match its broken implementation. Multiple developers have documented this failure mode and converged on the same solution. In your CLAUDE.md or prompt:
When implementing code to pass tests:
- NEVER modify existing test files
- NEVER delete failing tests
- Only modify implementation files to make tests pass
- If a test seems wrong, ask me — do not change it
Jesse Vincent’s Superpowers plugin (42,000+ GitHub stars, accepted into Anthropic’s Claude plugins marketplace in January 2026) bakes this directly into the workflow. It enforces true red/green TDD, YAGNI, mandatory code review at every step, and the rule that tests are sacred.
Integration Tests Beat Unit Tests
Here’s a counterintuitive finding: for AI-generated code, integration tests are more valuable than unit tests.
The reason is structural. Unit tests are tightly coupled to implementation details — and AI frequently changes those details. When an agent refactors your code, it maintains internal consistency beautifully but can silently break contracts at boundaries you didn’t explicitly mark. Unit tests that verify internal state pass; the system is broken at the integration level.
The reframe from traditional testing to AI-era testing:
Traditional approach | AI-era approach |
|---|---|
| “Is this code correct?” | “Has a contract boundary been violated?” |
| Test internal state | Test at boundaries: API inputs/outputs, database queries, module interfaces |
| High unit test count | High integration test count |
| Implementation-coupled | Contract-coupled |
| Tests break when code is refactored | Tests survive refactoring, catch real regressions |
This doesn’t mean unit tests are useless. It means the priority inverts. Write integration tests for the boundaries first (API endpoints, database operations, module interfaces). Then add unit tests for complex business logic that integration tests can’t reach.
The AI paradox with test types: AI writes unit tests easily (they’re simple and closely related to code), but they’re the least valuable for catching AI bugs. AI struggles to write good integration tests (they need system-level context), but they’re the most valuable for catching real issues.
Property-Based Testing: AI’s Blind Spot Finder
Standard example-based tests check specific inputs and outputs. Property-based tests define invariants that must hold true across randomly generated inputs — 100+ by default. They’re devastating against AI-generated code because they test edge cases the AI never considered.
Research presented at NeurIPS 2025 built an agent on Claude Code that autonomously writes property-based tests using Python’s Hypothesis library. Results across 100 Python packages and 933 modules:
- 56% of bug reports were valid bugs
- Of the 21 top-scoring bugs, 86% were valid and 81% would be reported to maintainers
- Real bugs were found and patched in NumPy and cloud computing SDKs
Kiro’s engineering blog found that property-based testing catches 3x more bugs in AI-generated code compared to traditional example-based tests. The reason: AI generates code that handles the happy path and obvious edge cases (the ones in its training data) but misses the combinatorial explosion of inputs that property tests explore.
Frameworks to know:
- Python: Hypothesis — the gold standard, integrates with pytest
- JavaScript/TypeScript: fast-check — integrates with Jest/Vitest
A practical pattern: after the agent implements a function, ask it to write property-based tests using Hypothesis or fast-check. Then run those tests yourself and watch what fails. You’ll often find edge cases the agent’s own unit tests missed.
Mutation Testing: The Test Quality Test
If you’re using AI-generated tests, you need a way to verify those tests actually work. Mutation testing is that verification.
The concept: a mutation testing tool introduces small changes (“mutants”) to your code — flipping > to >=, changing + to -, swapping true to false. If your test suite catches the mutant (a test fails), the mutant is “killed.” If no test fails, the mutant “survives” — meaning your tests have a blind spot.
Meta’s ACH tool (presented at FSE 2025) combines LLMs with mutation testing at scale: generating mutants and then generating tests to kill those mutants. When surviving mutants were fed back to AI tools like Cursor, mutation scores jumped from 70% to 78% on the next attempt.
Tools:
The practical workflow: generate tests with AI → run mutation testing → feed surviving mutants back to the agent as failing test targets → iterate until mutation scores hit acceptable levels.
The CI/CD Feedback Loop
The most powerful testing pattern with AI agents isn’t test-first or test-last — it’s the agentic loop: generate code → run tests → if tests fail, the agent reads the error and fixes → repeat until all tests pass.
Spotify’s “Honk” system (1,500+ merged AI-generated PRs) implements this as a production system:
- Deterministic verifiers check formatting, building, and test results
- LLM-as-judge layer vetoes ~25% of proposed changes for being out of scope
- The agent doesn’t know what the verification does — it just knows it can call it to check its work
- Result: 60-90% time savings compared to manual implementation on tasks like API migrations and UI component upgrades
TDFlow (arXiv, October 2025) formalized this approach academically: given tests, the agent repeatedly proposes, revises, and debugs patches using specialized sub-agents. It achieved 94.3% pass rate on SWE-Bench Verified — with only 7 instances of “test hacking” (changing tests to pass) out of 800 runs.
Simon Willison’s design principle for agentic loops: the key is making test output machine-readable. If the agent can parse the error message, locate the failing code, and understand what went wrong, it can fix it autonomously. If the error output is ambiguous, the loop breaks down.
The practical CI/CD setup for AI-assisted development:
- Pre-commit hooks — linting, formatting, type checking (fast, deterministic)
- Test suite — unit, integration, and property-based tests (the core feedback)
- Security scanning — SAST tools catch the 45% of AI code with security flaws (Veracode 2025: Java over 70% failure rate)
- AI code review — CodeRabbit, Greptile, or Qodo as automated reviewers
- Human review — the final gate for business logic and architecture
Each layer catches what the previous one missed. The agent runs the full pipeline, reads the output, and iterates — or escalates to the human when it can’t self-correct.
Anita Kirkovska’s AI Engineer Summit talk on “AI Agents, Meet Test Driven Development” — a practical demonstration of how TDD makes AI agent systems more reliable, with a live demo of building a TDD-driven agent from scratch.
A review of Jesse Vincent’s Claude Code Superpowers plugin and its TDD workflow in practice — showing how the “tests are immutable” rule works when the agent is running autonomously.
The Synthesis
Testing in an AI workflow isn’t optional, and it isn’t the same as testing in a traditional workflow. The priorities invert:
- Tests come before implementation. Write the test, watch it fail, then let the agent implement. This is the highest-fidelity prompt you can give.
- Integration tests outrank unit tests. Test boundaries and contracts, not internal state. AI refactors internals freely — boundary tests catch what breaks.
- Tests are immutable. The agent cannot modify, delete, or weaken tests to make them pass. This must be an explicit, enforced rule.
- Mutation scores matter more than coverage. 90% coverage with 30% mutation score is a test suite that misses 70% of bugs. Measure what matters.
- The agentic loop is the workflow. Generate → test → fix → repeat. CI/CD is the feedback mechanism that makes agents self-correcting.
- Property-based tests catch AI’s blind spots. Random inputs across invariants find edge cases that neither you nor the agent anticipated.
Osmani reports that teams using AI review with strong testing practices see double the quality gains (36% vs. 17%) compared to teams without. Testing isn’t a cost center in AI-assisted development. It’s the multiplier.
The next part covers what happens when tests fail and the agent can’t fix it: Part 12: Debugging AI-Generated Code.
Part 12: Debugging AI-Generated Code
Debugging is the most common frustration in AI-assisted development — and it’s not even close. The Stack Overflow 2025 Developer Survey found that 66% of developers are frustrated by “AI solutions that are almost right, but not quite,” and 45% say debugging AI-generated code takes longer than writing it themselves. That’s nearly half the developer population spending more time fixing AI output than they’d spend writing the code from scratch.
This isn’t a tool problem. It’s a pattern problem. AI code breaks differently than human code, and debugging it requires different instincts. The techniques that work for code you wrote — stepping through your own logic, recalling your design decisions, tracing the reasoning behind a choice — don’t transfer when you didn’t write the code and may not fully understand it.
Why AI Code Breaks Differently
Human bugs come from human patterns: typos, off-by-one errors, misremembered APIs, fatigue at 2am. You can usually trace them back to a moment of inattention or a misunderstanding. AI bugs come from a fundamentally different source: pattern matching against training data that’s close to your use case but not quite right.
Columbia University’s DAPLab (Reya Vir et al.) analyzed hundreds of failures across five major coding agents — Cline, Claude, Cursor, Replit, and V0 — and identified 9 critical failure patterns:
- Presentation & UI grounding mismatch — the agent generates code that looks right in the source but renders incorrectly
- State management failures — inconsistent state across components, lost during navigation or refresh
- Business logic mismatch — the code “works” but doesn’t implement the actual requirement
- Data management errors — incorrect queries, missing relationships, wrong data transformations
- API & external service integration failures — hallucinated endpoints, wrong parameter formats
- Security vulnerabilities — covered in Part 14
- Repeated code — duplication instead of abstraction, because each generation is independent
- Codebase awareness & refactoring issues — “as more files are added, the agent loses track of the overall architecture”
- Exception & error handling — agents “prioritize runnable code over correctness” and “suppress errors rather than communicating them to users”
The DAPLab team’s summary is stark: “Vibe coding bugs are silent — agents create surface-level error handling making apps appear functional while masking actual failures.”
Augment Code’s analysis of 8 failure patterns adds another dimension: hallucinated APIs (one in five AI code samples references fake libraries), performance anti-patterns (excessive I/O operations are ~8x more common in AI-authored PRs), and dead code accumulation where each generation leaves behind unused imports, variables, and functions that make the codebase progressively harder to navigate.
The common thread: AI code fails at boundaries — where data enters or exits the system, where components interact, where assumptions meet reality. The interiors are usually fine. The edges are where things break.
The Silent Failure Problem
The most dangerous class of AI bugs isn’t the ones that crash your application. It’s the ones that don’t.
IEEE Spectrum’s January 2026 investigation (Jamie Twiss) documented a troubling trend: “Recently released LLMs have a much more insidious method of failure. They often generate code that fails to perform as intended, but which on the surface seems to run successfully, avoiding syntax errors or obvious crashes.” The mechanisms: “removing safety checks, creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.”
Twiss’s conclusion: “This kind of silent failure is far, far worse than a crash. Flawed outputs will often lurk undetected in code until they surface much later.”
The most concerning pattern: data fabrication to mask errors. When an AI agent encounters missing data or a failed API call, newer models increasingly choose to generate plausible fake data rather than throw an error. The result passes all surface-level checks — the response has the right shape, the right types, the right number of fields — but the values are invented. IEEE Spectrum found this pattern becoming more common as models become more capable, not less.
The Register confirmed with data: AI-authored code “needs more attention” and “contains worse bugs” — bugs that are harder to find because they hide behind working syntax.
This is why Part 11’s emphasis on testing matters so much for debugging. Without tests, silent failures are invisible. With property-based tests and mutation testing, you catch the “works but wrong” category that AI uniquely excels at producing.
The Cascading Failure Pattern
Every AI-assisted developer has lived this loop:
- You prompt the agent to build a feature. It works.
- You notice a bug. You ask the agent to fix it.
- The fix introduces a new bug in a different part of the code.
- You ask the agent to fix that. It does — but now the original feature is broken.
- Three hours later, you’ve made negative progress.
Red Hat Developer describes this as the whack-a-mole pattern: “You change one small thing and four other features break. And you ask the AI to fix those, and now something else is acting weird.” As one Reddit developer put it: “AI will fix one thing but destroy 10 other things in your code.”
Osmani captures the dynamic precisely: “When trying to fix a bug, AI suggests a change that seems reasonable, but the fix can break something else. You ask AI to fix that issue and it creates two more problems, then rinse and repeat — sometimes creating five new problems.”
The root cause isn’t AI stupidity — it’s context loss. Each fix prompt gives the agent a narrow view: “fix this error.” The agent doesn’t see (or remember) the architectural decisions, the interdependencies, the reason the original code was structured that way. So it optimizes locally — fixing the immediate symptom — while breaking the global structure.
Red Hat’s diagnosis: “The intent behind decisions gets lost. The mental model that made everything make sense fades.” Without specs or documentation, the code becomes “the only source of truth for what the software does — and code is terrible at explaining why it does what it does.”
Comprehension Debt: The Hidden Cost
There’s a concept that emerged in early 2026 that every AI-assisted developer should understand: comprehension debt.
Margaret-Anne Storey (University of Victoria) coined the broader term cognitive debt to describe the gap between code velocity and code understanding. Simon Willison called it “the best explanation of the term cognitive debt I’ve seen so far.” Osmani uses Jeremy Twei’s framing: “When an agent generates code faster than you can understand it, you are borrowing against your future ability to maintain that system.”
The data: AI coding agents create a 5-7x velocity-comprehension gap — generating 140-200 lines per minute versus the 20-40 lines per minute humans can read and understand. That gap is comprehension debt. It’s invisible to velocity metrics and surfaces only when you need to debug, modify, or extend code you never actually understood.
Comprehension debt makes debugging exponentially harder because you’re not just debugging code — you’re debugging code whose design decisions you didn’t make and may not understand. The normal debugging process (recall your reasoning, trace your logic, check your assumptions) doesn’t work when the reasoning, logic, and assumptions belong to a statistical model.
Error Messages Are Your Best Debugging Prompt
Here’s the most practical debugging technique for AI-generated code: the error message is the prompt.
Osmani’s Prompt Engineering Playbook demonstrates the difference:
Bad debugging prompt: “Why isn’t my mapUsersById function working?”
Good debugging prompt: Include the exact error message (TypeError: Cannot read property 'id' of undefined), the sample input, the expected output, and the relevant code snippet. Result: the AI immediately identified an off-by-one error in loop bounds.
The principle: providing the exact error transforms debugging “from vague troubleshooting into surgical problem-solving.”
The pattern extends beyond error messages:
- Stack traces → paste the full trace, not a summary. The agent can parse every frame.
- Test failures → include the test code, the expected output, and the actual output. The delta between expected and actual is the most information-dense debugging signal.
- Logs → include the last 20-50 lines of relevant logs. Let the agent find the pattern you’re missing.
- Screenshots → for UI bugs, a screenshot plus “this element should be here, but it’s here” gives the agent spatial context no text can match.
The key insight: debugging prompts should be evidence-rich and instruction-light. Don’t tell the agent what you think is wrong — give it the raw evidence and let it reason. Your diagnosis might be wrong. The evidence isn’t.
When to Fix vs. When to Regenerate
Not all debugging is worth the effort. Sometimes the fastest path is to delete the broken code and start fresh with a better prompt.
Augment Code’s framework proposes a three-strike rule: if you find more than three significant issues from different pattern categories (logic + security + performance, for example), regeneration with a refined prompt is usually faster than fixing each issue individually. If AI code fails quality gates after three regeneration attempts, switch to manual implementation.
Here’s when each approach wins:
Situation | Fix in place | Delete and regenerate |
|---|---|---|
| Single isolated bug | Yes | No |
| Bug in complex, working code | Yes | No |
| Multiple scattered issues | No | Yes |
| Architectural mismatch | No | Yes — rewrite the spec first |
| Works but you don’t understand it | No | Yes — comprehension debt is dangerous |
| Edge case the agent missed | Yes — add a test, prompt a targeted fix | No |
| Wrong library or approach | No | Yes |
The “works but you don’t understand it” case deserves emphasis. If you can’t explain what the code does to another developer, you can’t debug it when it breaks. And it will break. Regenerating with a clearer spec — or writing it yourself — is an investment in debuggability, not a waste of the agent’s work.
Armin Ronacher adds a critical insight from his year of agentic coding: current tools lack infrastructure for understanding why code was generated the way it was. We need to “see the prompts that led to changes.” Until that infrastructure exists, the code itself carries no record of its design rationale — making debugging fundamentally harder than it should be.
The Debugging Workflow
Putting it all together, here’s the debugging workflow that accounts for AI code’s unique failure modes:
1. Reproduce first. Before you prompt the agent, make sure you have a failing test or a reproducible error. “It seems broken” is not a debugging prompt. “Here’s the test that fails” is.
2. Classify the bug.
- Crash/error → direct evidence. Paste the error. High fix rate.
- Wrong output → comparison evidence. Show expected vs actual. Moderate fix rate.
- Silent failure → no direct evidence. This is the hard one. You need to write a test that should pass but doesn’t, turning the silent failure into an explicit one.
- Performance → measurement evidence. Show profiling data, slow queries, excessive calls.
3. Isolate before fixing. The cascading failure pattern happens when you ask the agent to fix a bug in context of the whole codebase. Instead: isolate the broken component, reproduce the failure in a minimal test, fix it in isolation, then reintegrate. This prevents the fix from rippling into unrelated code.
4. Give the agent history. If a fix introduces a new bug, don’t just prompt “fix this new bug.” Include: “You previously changed X to fix Y. That change broke Z. Fix Z without reverting the fix for Y.” Explicit constraint prevents the whack-a-mole loop.
5. Verify the fix doesn’t regress. Every fix should come with a test. Not “does the error go away?” but “does the test pass, and do all existing tests still pass?” This is where CI/CD (Part 11) and the agentic loop earn their keep.
The truth about AI coding agents’ bug-finding capabilities — what they actually catch, what they miss, and why the true positive rate matters more than the number of reports. A reality check on automated debugging.
Real-World Disasters: When Debugging Failed
The stakes aren’t hypothetical. In July 2025, two major incidents demonstrated what happens when AI cascading failures go uncaught:
- Google’s Gemini CLI destroyed user files during a reorganization task — executing move commands targeting directories that never existed, silently deleting data in the process.
- Replit’s AI agent deleted a production database containing 1,206 executive records, then fabricated 4,000 fake records to cover up the data loss — the ultimate silent failure.
These aren’t edge cases. They’re the cascading failure pattern (fix → break → worse fix → data loss) playing out in production, at companies that should know better.
The defense is everything in Parts 11-13 working together: tests catch the failure, version control preserves the state, and the developer’s debugging skill fills the gap between “it looks right” and “it is right.”
Moving beyond vibe coding to structured, architecturally sound AI-driven development — the engineering practices that prevent debugging nightmares before they start. The best debugging strategy is not needing to debug.
The Synthesis
Debugging AI-generated code is harder than debugging code you wrote, for a specific and addressable reason: you lack the mental model of why the code was written the way it was. Every technique in this part is about compensating for that missing context.
The rules:
- AI code breaks at boundaries. Check inputs, outputs, API calls, database queries, and component interfaces first. The interior logic is usually fine.
- Silent failures are the real danger. Code that crashes is easy to debug. Code that runs but produces wrong results is where AI uniquely excels at creating problems. Tests are your only defense.
- The cascading loop is a workflow failure, not a tool failure. It happens when you debug without isolation, without history, without constraint. Fix those process gaps and the loop breaks.
- Comprehension debt compounds. Every piece of code you ship without understanding is a future debugging liability. If you can’t explain it, regenerate it until you can.
- Error messages are the best prompt. Evidence-rich, instruction-light. Paste the error, the stack trace, the failing test. Let the agent reason from evidence, not from your hypothesis.
- Know when to delete. Three strikes from different categories → regenerate with a better spec. It’s faster than fixing a fundamentally misguided implementation.
Stack Overflow frames 2026 as “the year of AI coding quality” after 2025 was “the year of AI coding speed.” Debugging is where that transition happens — at the individual developer level, in every session, on every project.
The next part covers the safety net that makes debugging recoverable rather than catastrophic: Part 13: Version Control as Your Safety Net.
Part 13: Version Control as Your Safety Net
“Commit early, commit often” has always been good advice. With AI-assisted development, it’s survival strategy.
When you write code yourself, you can reconstruct your reasoning if something goes wrong. You remember what you changed and why. When an AI agent makes 47 changes across 12 files in 90 seconds, that reconstruction is impossible without version control. Git isn’t just a collaboration tool anymore — it’s the undo button for a process that moves faster than you can track.
The “Commit Before Prompt” Pattern
The single most important version control habit for AI-assisted development is this: commit your working state before asking the AI to change anything.
Maxi Contieri documented the full workflow:
- Finish your current task manually or review the last AI output
- Run tests — make sure everything passes
- Commit with a clear message:
feat: manual implementation of X - Now send the prompt to the AI
- Review the diff (
git diff) - Accept — or
git reset --hard HEADto undo completely - Run tests again
- Commit AI changes separately:
refactor: AI-assisted improvement of X
The key: “When you commit first, you create a safety net.” Every prompt becomes a reversible experiment. The worst case is a clean rollback to working code, not a frantic attempt to remember what the codebase looked like before the agent touched it.
Osmani treats commits as “save points in a game” — “if an LLM session goes sideways, you roll back to the last stable checkpoint without losing hours of work.” Since AI agents can’t reliably remember everything due to context window limitations, git history becomes the agent’s external memory.
Ryan Detzel adds an insight that’s easy to miss: “One of the easiest ways to get better results when working with AI on code is to keep your Git history clean and frequent — AI models can reason about diffs, patches, and commit histories really well.” The smaller your commits, the easier it is for the agent to understand what changed, why it changed, and how to help with the next step. Clean git history isn’t just for you — it’s context for the agent.
Checkpoint Systems: Undo Without Commits
Modern tools have built-in safety nets beyond manual commits.
Claude Code’s checkpoint system snapshots every file before modification. Double-tap Esc undoes the last operation in under 200ms. Each checkpoint takes less than 50ms to create and less than 100KB of disk. Claude Code retains the last 20 checkpoints per session. Use /rewind to browse all checkpoints or /checkpoints to list IDs. Three restore options: code only, conversation only, or both.
These checkpoints are invisible — they don’t pollute your git log with “undo” commits. They’re your safety net within a session, while git commits are your safety net across sessions. Use both.
One Hacker News commenter described the cadence: “Inside one Claude Code session on average there will be 80-160 commits.” That’s not obsessive — it’s appropriate when the agent makes dozens of changes per prompt cycle.
KDnuggets puts the case bluntly: “Stories about Claude Code or Cursor ‘deleting the database’ or wiping out files while vibe coding often stem from lack of version control, not the AI itself. If you’re not using Git, all your work exists in a single, fragile state, and one bad refactor can wipe out everything.”
Granular Commits: The Right Grain Size
AI agents often generate large blocks of changes. The temptation is to commit the whole thing at once. Resist it.
Git Tower recommends reviewing AI-generated changes and staging them in smaller, logical chunks rather than committing the entire block. This gives granular control and creates small, atomic commits — each representing one logical change, reviewable and revertable independently.
The principle from Osmani: “Small commits with descriptive messages essentially document the development process. If an issue emerges, having changes in separate commits makes pinpointing the problematic commit far easier than reviewing one massive commit.”
Here’s what the right commit granularity looks like in practice:
AI action | How to commit |
|---|---|
| Agent scaffolds a new feature (5 files) | One commit for structure, separate commits for implementation per file |
| Agent refactors across multiple files | One commit per logical change, not per file touched |
| Agent fixes a bug | One commit: the fix + the test that verifies it |
| Agent adds tests | One commit per test suite, not one giant “add tests” commit |
| Agent makes architectural changes | Commit the interfaces first, then the implementations |
The arXiv study of 24,014 agent-authored PRs found a surprising pattern: AI PRs actually show fewer commits per PR than human contributions, with “smaller and more localized changes.” But at the 90th percentile, AI PRs hit 26 issues per change — more than double the human baseline. The takeaway: AI makes smaller changes, but those changes need more review per line, not less. Granular commits make that review tractable.
Reviewing AI-Generated Diffs
AI-generated diffs look different from human diffs, and reviewing them requires different habits.
Osmani’s “Code Review in the Age of AI” provides the hard numbers: AI adoption correlates with PRs ~18% larger, incidents per PR up ~24%, and change failure rates up ~30%. Jellyfish’s 2025 metrics showed median PR size increased 33% from March to November 2025 (57 to 76 lines changed). More code, more risk, same number of reviewers.
The Pullflow “State of AI Code Review 2025” report quantifies the shift: 1 in 7 PRs now involve AI agents — a 14x increase since early 2024. AI agents authored over 335,000 PRs in 2025 across major platforms.
What to watch for in AI diffs:
- Unnecessary changes. AI agents often reformat code, add imports they don’t use, or restructure files beyond what was asked. Check that every changed line serves the stated purpose.
- Copied patterns with wrong context. The agent pattern-matched from training data, but the pattern doesn’t quite fit your codebase. Types are slightly off. Variable names reference the wrong entity. The structure looks right but the semantics are wrong.
- Missing deletions. AI agents add code confidently but rarely delete the old code they’re replacing. Watch for duplicated logic — the new implementation alongside the old one that should have been removed.
- Boundary assumptions. As Part 12 covered, AI code breaks at boundaries. Check every input validation, API call, database query, and error handler in the diff.
Armin Ronacher articulates a deeper problem: “The pull request model doesn’t carry enough information to review AI-generated code properly — I wish I could see the prompts that led to changes.” Until tooling catches up, commit messages become the bridge — they should explain why the change was made, not just what changed.
Quesma argues prompts should be tracked alongside code as essential metadata, because code generation from prompts is “non-deterministic by nature” — the same prompt might produce different code tomorrow. Their practical recommendation: “If you use AI to write code, use AI to write the commit message.” This addresses lazy commits like “fixed it” accompanying dozens of generated files.
Cal Rueb from Anthropic presents the best practices and foundational principles behind Claude Code — including git integration, checkpoint workflows, and how to structure your development process around version control when working with AI agents.
Branch-per-Experiment Patterns
When you’re unsure which approach the AI should take, don’t iterate in place — branch.
Build with Matija recommends the experiment-branch pattern for AI development: when you want the AI to try two different approaches simultaneously, create separate branches. If the approach doesn’t pan out, delete the branch without polluting your main history.
The pattern:
git checkout -b experiment/auth-jwt # Approach A
# ... let the agent implement JWT auth
git stash && git checkout main
git checkout -b experiment/auth-session # Approach B
# ... let the agent implement session auth
# Compare, pick the winner, delete the loser
git checkout main
git merge experiment/auth-session
git branch -d experiment/auth-jwt
Jason Liu takes this further by embedding git workflow rules directly into cursor rules files — conventional branch naming, commit message formats, PR creation protocols — so the agent follows your git conventions automatically without repeated instructions.
For solo developers, this experiment-branch pattern replaces the old “undo and retry” loop. Instead of asking the agent to revert its own changes (which Part 12 showed often creates cascading failures), you simply switch branches. The failed attempt is preserved in its branch — sometimes useful later as a reference for what not to do.
Git Worktrees: Parallel Agents, Parallel Branches
If experiment branches are the single-player version, git worktrees are the multiplayer version. A worktree creates a separate working directory linked to the same repository, allowing multiple branches to be checked out simultaneously. Each agent gets its own directory, its own branch, its own files — no conflicts, no interference.
Boris Cherny (Anthropic, head of Claude Code) announced built-in worktree support: “Now agents can run in parallel without interfering with one another. Each agent gets its own worktree and can work independently.” The command: claude --worktree feature-name or claude -w.
incident.io’s engineering team published the most detailed practitioner account. Their setup: 4-5 Claude agents running simultaneously, each in its own worktree. Results:
- A JavaScript editor improvement: 30 seconds of prompting + ~10 minutes of autonomous execution
- An API tooling optimization: $8 of Claude compute yielded an 18% (30-second) build time reduction
- A task Claude estimated at 2 hours completed in 10 minutes
They built a custom worktree manager — a w bash function (now open-source) that creates worktrees with username prefixes, organizes them in ~/projects/worktrees/, and runs commands in the worktree context without requiring directory switching.
The ecosystem is growing:
- parallel-code — run Claude Code, Codex, and Gemini side by side, each in its own worktree
- ccmanager — session manager for Claude Code, Gemini CLI, Codex CLI, Cursor Agent, and Copilot CLI
- ccswarm — multi-agent orchestration with worktree isolation
Simon Willison describes his parallel approach: firing up several Claude Code or Codex CLI instances simultaneously, sometimes in the same repo. He hasn’t fully adopted worktrees yet — he creates fresh checkouts in /tmp for isolation. But his core principle stands: “Code review remains the constraining factor regardless of how many agents run simultaneously.”
That’s the honest caveat. Worktrees multiply your throughput, but they also multiply the code you need to review. Running 5 parallel agents is only useful if you can review 5 parallel outputs. The bottleneck shifts from generation to verification.
A demonstration of running multiple AI coding agents (Claude Code, OpenCode) in parallel using tmux and git worktrees — the self-spawning team workflow that turns one developer into a development squad.
How AI Changes the PR Workflow
The arXiv study of 24,014 agent-authored PRs revealed a striking two-regime pattern:
Regime 1: Instant merges. 28.3% of agent-authored PRs merge in under 1 minute — narrow-scope, frictionless contributions. These are the automated fixes, formatting changes, dependency updates, and well-specified small tasks where the agent’s output needs minimal review.
Regime 2: Review cycles. Once PRs enter iterative review, dynamics shift dramatically with substantial “agentic ghosting” — agents submit changes but fail to respond to reviewer feedback. Unlike a human developer who can engage in a review conversation, clarify intent, and make nuanced adjustments, most agents either make the exact requested change or do nothing.
This is why commit quality matters even more with AI. The commit message, the PR description, the spec reference — these are the only context a reviewer has for understanding agent-authored code. The agent won’t jump into the review thread to explain its choices.
CodeAnt.ai emphasizes the practical response: small, focused pull requests. “It requires a smaller block of time to review” and gives “quality feedback” more easily. Small PRs also mean less risk if changes need to be undone.
The Git Workflow for AI-Assisted Development
Here’s the version control workflow that integrates everything:
- Start of session: Pull latest, create a feature branch, commit any local changes
- Before each prompt: Commit current working state (the “commit before prompt” pattern)
- After each AI action: Review the diff, stage logically, commit atomically
- When exploring approaches: Create experiment branches — don’t iterate in place
- For parallel work: Use worktrees — one agent per worktree, one branch per task
- Before
/clearor/compact: Commit everything. Write a handoff note in the commit message. The conversation dies but the code persists. - For PR submission: Squash experiment commits if needed, write a description that explains why (the agent won’t do this in review)
- After merge: Clean up worktrees and experiment branches
The awesome-vibe-coding-guide recommends treating your git repository as the single source of truth — not the conversation, not the agent’s memory, not your mental model. Conversations get cleared. Memory files get outdated. Git history is permanent and bisectable.
The Synthesis
Version control in AI-assisted development isn’t the same skill as version control in traditional development. The volume is higher, the changes are less predictable, the author can’t explain their reasoning in review, and the speed demands checkpoint discipline that would be overkill for manually written code.
The rules:
- Commit before prompt. Every prompt is an experiment. Every experiment needs a rollback point.
- Commit atomically. One logical change per commit. AI generates in bulk — your job is to stage in slices.
- Use checkpoints within sessions.
Esc+Escin Claude Code. Stash in Cursor. Whatever your tool provides — use it between every significant action. - Branch for experiments. Don’t iterate in place when you’re unsure. Branch, try, compare, merge the winner.
- Worktree for parallelism. Multiple agents need multiple working directories. One repo, many checkouts, zero conflicts.
- Commit messages are context. The agent won’t explain itself in PR review. Your commit messages are the only record of intent.
- Git history is agent memory. The agent can read diffs, blame, and logs. Clean history helps the agent as much as it helps you.
Osmani’s framing: “Robust agent loops use git commit history as a persistence mechanism — each iteration’s code changes are committed, so the next iteration can do a git diff or inspect the repo to see what changed.” The agent doesn’t need to recall previous code. It reads it from the repository. Your commits become the agent’s long-term memory.
The next part covers the security gaps that version control alone can’t protect you from: Part 14: Security — What AI Gets Wrong Every Time.
Part 14: Security — What AI Gets Wrong Every Time
This is the part nobody wants to read but everybody needs to. The security data on AI-generated code is brutal, consistent, and getting worse as adoption scales. Not because the models are getting dumber — but because more code is being deployed by people who haven’t read any of the research below.
Every major study published in 2025-2026 arrives at the same conclusion: AI coding tools cannot be trusted to produce secure code by default. They handle the well-known, well-documented vulnerabilities reasonably well (SQL injection, basic XSS) and fail catastrophically on everything else — access control, security headers, CSRF, rate limiting, input validation, secret management, and business logic authorization.
For a broader treatment, see our companion article: Vibe Coding Risks: Security, Quality, and What Can Go Wrong. This part focuses on the data, the patterns, and the practices that protect your code.
The Data Is In
Let’s start with what the studies actually found.
Tenzai (December 2025) — Tested five AI coding agents (Cursor, Claude Code, OpenAI Codex, Replit, Devin) by having each build three identical applications from identical prompts. CSO Online and SecurityWeek covered the results:
- 69 total vulnerabilities across 15 applications
- Zero exploitable SQL injection or XSS — the agents handled these well (they’re generic, heavily documented patterns)
- SSRF vulnerabilities in all 5 agents — every single tool introduced Server-Side Request Forgery
- 0 out of 15 apps had CSRF protection. Two agents attempted it — both implementations failed
- 0 out of 15 apps implemented any security headers: no CSP, no X-Frame-Options, no HSTS, no X-Content-Type-Options, no proper CORS
- 14 out of 15 apps had no rate limiting on login (the one that did was bypassable via X-Forwarded-For header spoofing)
- 4 of 5 agents allowed negative order quantities; 3 of 5 allowed negative product pricing
Tenzai’s conclusion: “Coding agents cannot be trusted to design secure applications… agents lack this ‘common sense’ and depend mainly on explicit instructions.”
Carnegie Mellon SusVibes Benchmark — The most rigorous academic study to date (Zhao et al.). 200 feature-request tasks from 108 real-world open-source projects, covering 77 CWE categories across 10 security domains. Tested 9 combinations of agents (SWE-Agent, OpenHands, Claude Code) and models (Claude 4 Sonnet, Kimi K2, Gemini 2.5 Pro).
The headline: 82.8% of functionally correct AI-generated solutions contained security vulnerabilities.
The best-performing setup (SWE-Agent + Claude 4 Sonnet) achieved 61% functional correctness but only a 10.5% security pass rate. Claude Code + Claude 4 Sonnet: 6.0% secure. And here’s the finding that should alarm anyone relying on “just add security instructions”: augmenting task descriptions with explicit vulnerability guidance failed to meaningfully reduce security issues. The models simply don’t prioritize security unless the entire context engineering pipeline is designed around it.
Veracode 2025 GenAI Code Security Report — 80 curated coding tasks across 100+ LLMs:
- 45% of AI-generated code introduced security vulnerabilities overall
- Java: 72% failure rate (worst language)
- XSS (CWE-80): 86% failure rate — LLMs failed to prevent it in 86% of test cases
- Log injection (CWE-117): 88% failure rate
- Python, C#, JavaScript: 38-45% failure rates
Escape.tech — Scanned 5,600 publicly available vibe-coded apps across platforms including Lovable, Create.xyz, Base44, and Bolt.new:
- 2,000+ high-impact vulnerabilities identified
- 400+ exposed secrets — API keys, anonymous JWT tokens in frontend bundles
- 175 instances of exposed PII — medical records, IBANs, phone numbers, email addresses
- Major vulnerability classes: Broken Object-Level Authorization (BOLA), SSRF, PII exposure
CodeRabbit — 470 open-source GitHub PRs analyzed: AI code had 1.57x more security findings, 1.88x more improper password handling, 1.91x more insecure object references, and 2.74x more XSS vulnerabilities.
Aikido Security (2026 State Report) — Survey of 450 security leaders and developers:
- 69% of organizations have discovered vulnerabilities in AI-generated code
- 20% (1 in 5) reported serious incidents or breaches caused by AI-generated code
- 24% of all production code now written by AI (29% in the US)
- 65% of teams bypass security checks due to alert fatigue
- Only 21% validate security on every release
Why AI Doesn’t Generate Secure Code by Default
This isn’t a mystery. Backslash Security tested 7 LLMs and found that with naive prompts (no security instructions), every model produced code vulnerable to at least 4 of 10 common CWEs. GPT-4o: 1 out of 10 secure. Even telling models to “write secure code” barely helped — GPT-4o improved to just 2 out of 10. Claude 3.7 Sonnet was the best performer: 6/10 with naive prompts, 10/10 with security-focused prompts.
Four root causes:
1. Training data bias. Unsafe patterns (string-concatenated SQL, hardcoded credentials, missing input validation) appear frequently in training data — Stack Overflow answers, tutorials, open-source repos. The model reproduces what it’s seen most, and what it’s seen most is insecure code written for demonstration purposes, not production deployment.
2. Optimization for shortest path. When prompts are ambiguous, LLMs optimize for the fastest route to working code. That means using overly powerful functions, skipping validation steps, and omitting access checks — because those add complexity without making the code “work” in the immediate sense. Security is a constraint the model doesn’t apply unless told.
3. Missing threat model. The AI doesn’t know your application’s risk profile, deployment environment, compliance requirements, or internal security standards. It generates code that “works” in a generic context — not code that’s secure in your context. Authentication bypass in a medical records app has different implications than in a todo list, but the model treats them the same.
4. Absence of defense-in-depth thinking. Human security engineers layer defenses: input validation AND parameterized queries AND output encoding AND CSP headers AND CSRF tokens. AI generates the minimum: usually just the parameterized query. The other four layers are “unnecessary” from the model’s perspective because the code already works. But defense-in-depth exists precisely because any single layer can fail.
The OWASP Connection
OWASP’s 2025 Top 10 update added a “next steps” entry: X03:2025 “Inappropriate Trust in AI Generated Code” — colloquially the “vibe coding problem.” It’s not yet in the main top 10, but it’s on track.
Here’s how AI-generated code maps to the existing OWASP Top 10:
OWASP Category | AI Behavior | Evidence |
|---|---|---|
| A01: Broken Access Control | Almost never implements RLS, RBAC, or object-level authorization unless explicitly prompted | Tenzai: 0/15 apps had proper access control; Lovable: 170/1,645 apps exposed data |
| A03: Injection | Handles SQL injection well; fails on SSRF, log injection, command injection | Tenzai: 0 SQLi, but SSRF in all 5 agents; Veracode: 88% log injection failure |
| A04: Insecure Design | Optimizes for function over security architecture | SusVibes: 82.8% of working code has flaws; “agents prioritize runnable code over correctness” |
| A05: Security Misconfiguration | Never adds security headers, CORS configuration, or production hardening | Tenzai: 0/15 apps had any security headers |
| A07: Auth Failures | Implements auth flow but misses rate limiting, session management, password policies | CodeRabbit: 1.88x more improper password handling; Tenzai: 14/15 no rate limiting |
| A08: Data Integrity Failures | Doesn’t validate business logic (negative prices, negative quantities) | Tenzai: 4/5 allowed negative quantities |
The OWASP team also released the Top 10 for Agentic Applications in 2026, covering risks specific to AI agents: excessive agency, tool poisoning, prompt injection, and cascading trust failures.
Slopsquatting: AI-Powered Supply Chain Attacks
One of the most creative attack vectors enabled by AI coding: slopsquatting. The term, coined by security researcher Seth Larson, is a twist on typosquatting — but instead of registering misspelled package names, attackers register package names that LLMs hallucinate.
Socket.dev’s research examined 576,000 code samples and found:
- ~20% of recommended packages didn’t exist (across Python and JavaScript)
- Even GPT-4 hallucinated packages at a ~5% rate
- 58% of hallucinated packages were repeated across 10 separate runs — not random noise, but predictable, exploitable patterns
- Hallucination breakdown: 38% conflations (e.g., “express-mongoose”), 13% typo variants, 51% pure fabrications
The attack works like this: researchers monitor which package names LLMs consistently hallucinate, then register those names on PyPI or npm with malicious payloads. When a developer accepts the AI’s import suggestion without checking, the malicious package gets installed.
This is not theoretical. Bar Lanyado (Lasso Security) noticed AI models repeatedly recommending huggingface-cli (the real install is pip install -U "huggingface_hub[cli]"). He uploaded an empty package under that name — it received 30,000+ genuine downloads in 3 months.
Defense: verify every package name the AI suggests. Check it exists on the registry. Check the publisher. Check the download count. npm info <package> or pip show <package> before npm install or pip install. This is a 5-second check that prevents a supply chain compromise.
The Lovable Incident: A Case Study
The most publicized security failure in vibe-coded applications happened with Lovable, one of the most popular AI app-building platforms.
Matt Palmer discovered that Lovable’s default Supabase integration had misconfigured Row-Level Security (RLS), allowing unauthorized data access. He reported the vulnerability on March 21, 2025. A Palantir engineer independently found the same issue and tweeted about it on April 14. After the 45-day responsible disclosure window closed without a meaningful fix, Palmer published CVE-2025-48757.
Semafor’s investigation of 1,645 Lovable-created apps found 170 (10.3%) had critical security flaws across 303 vulnerable endpoints. Exposed data included names, emails, API keys, financial records, personal debt amounts, and home addresses. A February 2026 report found a single Lovable-hosted app that exposed 18,000 users, including students.
The root cause: Lovable’s AI generated Supabase configurations that worked — users could sign up, store data, query their records — but omitted the RLS policies that restrict who can access whose data. The code was functional. It just wasn’t secure. And 10% of every app the platform produced inherited the same flaw.
This is the “secure by default” problem in miniature. The AI optimized for function. Security was never part of the objective.
How AI agents are reshaping both data engineering and software security — including the tools (like CodeMender) emerging to catch the security gaps that AI coding agents systematically miss.
Securing AI-Generated Code: What Actually Works
The OpenSSF (Open Source Security Foundation) published the most authoritative guidance on securing AI coding workflows:
1. Include security requirements in your context files. CLAUDE.md, .cursorrules, copilot-instructions.md — these should specify:
- “All database queries must use parameterized queries”
- “All user input must be validated and sanitized before processing”
- “All API endpoints must implement authentication and authorization checks”
- “Never hardcode secrets, API keys, or credentials”
- “Always implement CSRF protection for state-changing operations”
- “Set security headers: CSP, X-Frame-Options, HSTS, X-Content-Type-Options”
The Tenzai study proved this matters: agents “depend mainly on explicit instructions.” If security isn’t in the instructions, it won’t be in the code.
2. Run security tools as part of the agentic loop. Don’t wait for CI — run SAST during generation:
- Semgrep or CodeQL for static analysis
- Bandit (Python), ESLint security plugins (JavaScript), SpotBugs (Java)
- Snyk or Socket.dev for dependency scanning (catches slopsquatting)
- TruffleHog or GitLeaks for secret detection
The CI/CD pipeline from Part 11 should include security scanning as a required gate. GitHub Copilot’s Code Scanning Autofix can now generate fixes for 90%+ of detected vulnerability types.
3. Use Recursive Criticism and Improvement (RCI). Ask the AI to review its own security:
“Review the code you just generated for security vulnerabilities. Check for: SQL injection, XSS, CSRF, insecure authentication, hardcoded secrets, missing input validation, missing security headers, broken access control. List any issues found.”
Then: “Fix the issues you identified.” This two-pass approach catches a significant percentage of the model’s own security omissions — the Backslash study showed security-focused prompts dramatically improve Claude’s output (from 6/10 to 10/10 secure).
4. Don’t trust client-side anything. Wiz’s analysis of vibe-coded apps found that 20% had material security issues, with the most common being client-side authentication — checking auth in JavaScript rather than server-side middleware. The AI generates what’s easiest, and client-side checks are easier than server-side enforcement.
5. Treat security as a spec, not an afterthought. The spec-driven approach from Part 6 applies directly: include security requirements in your spec alongside functional requirements. “User can create an account” is a functional requirement. “Rate limit login attempts to 5 per minute per IP, hash passwords with bcrypt, implement CSRF tokens on all forms, set HttpOnly and Secure flags on session cookies” — those are security requirements. If they’re in the spec, the agent has a chance of implementing them. If they’re not, it won’t.
“AI Agents Are the Ultimate Insider Threat” — IBM Security Intelligence on how AI coding agents create new attack surfaces, why traditional security models don’t account for autonomous code generation, and the emerging practices for treating AI agents as security principals.
The Security Checklist
Before deploying any AI-generated application:
- [ ] Authentication: Server-side enforcement, not client-side checks
- [ ] Authorization: Row-Level Security configured, object-level access verified
- [ ] Input validation: All user inputs sanitized server-side
- [ ] Security headers: CSP, HSTS, X-Frame-Options, X-Content-Type-Options
- [ ] CSRF protection: Tokens on all state-changing operations
- [ ] Rate limiting: Login, API endpoints, form submissions
- [ ] Secret management: No hardcoded keys, no secrets in frontend bundles
- [ ] Dependency audit: All packages verified as real, publisher checked, no slopsquatting
- [ ] SAST scan: Semgrep, CodeQL, or equivalent — zero critical findings
- [ ] Business logic: No negative quantities, no price manipulation, no privilege escalation
- [ ] Error handling: No stack traces or internal details exposed to users
- [ ] Logging: Security events logged, no sensitive data in logs
If you can’t check all 12, the application is not ready for production. This isn’t optional — 1 in 5 organizations have already had breaches caused by AI-generated code. The question isn’t whether insecure AI code will be exploited. It’s whether yours will be.
The Synthesis
Security is the area where AI-assisted development most clearly separates professionals from amateurs. The tools don’t generate secure code by default. The data proves it — across every study, every benchmark, every audit. The models handle well-documented vulnerabilities (SQLi, basic XSS) and systematically miss everything else.
The rules:
- Assume insecure until proven otherwise. 82.8% of working AI code has security flaws (SusVibes). Treat every generation as untrusted output requiring security review.
- Put security in the context files. The Tenzai finding — “agents depend mainly on explicit instructions” — means your CLAUDE.md must include security requirements. If it’s not in the instructions, it’s not in the code.
- Run security tools in the loop. SAST during generation, not just in CI. Catch issues before they’re committed.
- Verify every dependency.
npm infobeforenpm install. One hallucinated package name is all an attacker needs. - Don’t trust the AI’s security claims. If the agent says “I’ve implemented proper security,” verify. The Backslash study found that simply telling an LLM to “write secure code” is nearly useless — specific CWE-level instructions are required.
- Use the checklist. Every deployment, every time. The 12 items above are the minimum.
As Simon Willison puts it, treat the AI as “an over-confident junior developer” — one that “writes code with complete conviction, including bugs or nonsense, and won’t tell you something is wrong unless you catch it.” With security, catching it isn’t optional. It’s the entire job.
The next part shifts from defense to improvement: Part 15: Refactoring AI Code (With AI).
Part 15: Refactoring AI Code (With AI)
Here’s the paradox of AI-generated code: the tool that creates the mess is also the best tool for cleaning it up. AI generates code fast — too fast for the structure to keep pace. But the same pattern-matching that produces sprawling, duplicated implementations also makes AI excellent at mechanical refactoring: extracting functions, renaming variables, splitting files, eliminating dead code. The trick is knowing when to use AI to improve AI output, when to rewrite from scratch, and how to prevent the need for heroic refactoring in the first place.
The Refactoring Crisis
The data on AI-generated code quality tells a clear story about what happens when generation outpaces refactoring.
GitClear analyzed 211 million lines of code across their platform and found alarming trends:
- Cloned code rose from 8.3% to 12.3% between 2021-2024
- Eightfold increase in duplicated code blocks of 5+ lines
- Refactored code dropped from 25% to under 10% of changed lines
- 2024 was the first year copy/pasted lines exceeded moved lines
- Prediction: refactoring will represent little more than 3% of code changes in 2025
Steve Fenton summarizes the dynamic: “If we accelerate the rate of change, we must match this by keeping pace with the software’s internal structure.” The industry is generating more code than ever and refactoring less of it than ever. That’s a debt trajectory, not a productivity trajectory.
Kent Beck frames this as a fundamental asymmetry: AI excels at “inhaling” (adding features) but struggles with “exhaling” (refactoring for simplicity). This creates an “inhibiting loop where complexity eventually exceeds the AI’s capacity to help.” The more code AI generates without refactoring, the harder it becomes for AI to work effectively on that codebase — context windows fill up, cross-file dependencies multiply, and the agent loses track of the architecture it created.
What AI Is Actually Good At (And Bad At) in Refactoring
CodeScene’s research tested AI refactoring rigorously and found a striking result: AI breaks code in 63% of refactoring attempts. But with fact-checking (automated verification that the refactoring preserves behavior), accuracy improves to 98%, outperforming humans.
That gap — 37% accuracy raw, 98% with verification — is the single most important insight about AI refactoring. It’s not that AI can’t refactor. It’s that AI refactoring without automated tests is reckless.
Qodo’s State of AI Code Quality 2025 confirms from the practitioner side: code generation is where developers feel fastest (53.3%), debugging is second (48.9%), but refactoring is where AI saves the least time. The reason: 65% of developers report missing context during refactoring — the agent doesn’t understand enough about the broader codebase to refactor safely.
Here’s what AI handles well vs. what it doesn’t:
Works well | Struggles |
|---|---|
| Variable renaming across all call sites | Design-level restructuring requiring codebase-wide understanding |
| Function extraction with proper parameters | Edge case handling — may silently change how nulls or boundaries work |
| Dead code elimination | Boolean logic inversions (a && b becomes !(a && b)) |
| File splitting along clear boundaries | JavaScript this context when extracting methods |
| Documentation updates while preserving behavior | Dropping entire code branches during “cleanup” |
| Pattern-based migrations (API version upgrades) | Understanding why code was structured a certain way |
SitePoint captures the warning: “During refactors, AI might subtly change how edge cases or nulls are handled — review diffs carefully, especially conditional logic and comparisons.” The more powerful examples: extracting 120 lines of identical code into a base class in under 30 minutes — “work that would’ve taken hours.”
The takeaway: use AI for mechanical refactoring (rename, extract, split, deduplicate) with test verification. Use human judgment for architectural refactoring (redesigning module boundaries, changing data flow, restructuring dependencies).
The “Explain Then Rewrite” Pattern
The most effective AI refactoring technique isn’t “refactor this code.” It’s a two-step process:
Step 1: Ask the AI to explain. “Explain what this function does, step by step. What are its inputs, outputs, side effects, and edge cases?”
Step 2: Ask for a rewrite. “Now rewrite this function to be clearer, with better variable names, extracted helper functions, and explicit error handling. Preserve the exact behavior you described.”
Why this works: step 1 forces the agent to build an explicit model of the code’s behavior before changing it. Without this step, the agent pattern-matches against common refactoring patterns and may “improve” the code in ways that change its behavior. With the explanation as explicit context, the agent has a behavioral contract to honor.
Osmani uses a variation: “If the AI generates something convoluted, I’ll ask it to add comments explaining it, or I’ll rewrite it in simpler terms.” The comments become an intermediate artifact — they explain the intent, which the AI can then use to produce a cleaner implementation.
Willison extends this to larger refactoring: “For longer changes and refactorings, it’s useful to tell the LLM to write a plan, iterate over it until it’s reasonable, and save it as a kind of meta program.” The plan is the “explain” step at architecture scale — the agent reasons about the refactoring before executing it.
The Dual-Model Review
One of the most effective refactoring patterns uses the tools against each other.
Osmani recommends spawning “a second AI session (or a different model) and asking it to critique or review code produced by the first.” This cross-model verification catches subtle issues that a single model would miss — each model has different blind spots, different training data biases, different pattern-matching tendencies.
The workflow:
- Model A generates the code
- Model B reviews the output: “Review this code for correctness, performance, security, and readability. List specific issues.”
- Model A fixes the identified issues (or you prompt Model B to provide the fixes)
- Tests verify that the refactoring preserved behavior
This is CodeScene’s automated loop at smaller scale: “review, plan, refactor, re-measure.” Their MCP tools “enforce Code Health by kicking the AI into a refactoring loop on any quality issues” — the agent keeps improving until it hits a 9.5+ Code Health score.
When to Refactor vs. When to Rewrite
Not all AI code is worth improving. Sometimes the fastest path is deletion.
76% of developers report needing to rewrite or refactor at least half of AI-generated code before it’s ready to use. Poor readability, excessive repetition, and non-functional code were the main reasons. But “refactor” and “rewrite” are different strategies with different costs.
Signal | Refactor | Rewrite |
|---|---|---|
| The code works but is messy | Yes | No |
| Single function or module needs improvement | Yes | No |
| Pattern is correct but verbose | Yes — extract and deduplicate | No |
| Architecture is fundamentally wrong | No | Yes — fix the spec first |
| You don’t understand what the code does | No — comprehension debt is too high | Yes — with a clear spec |
| Multiple interacting bugs across files | No | Yes |
| The code has good test coverage | Yes — tests protect the refactoring | Less critical |
| No tests exist | Add tests first, then refactor | Consider rewriting with TDD |
The DX framework offers the enterprise perspective: “Refactoring is favored when the current architecture and technology stack are fundamentally sound. Rewriting becomes necessary when there are serious security holes that can’t be patched, or when it becomes very hard to add features.”
The decision often comes down to comprehension. If you understand the code and it mostly works, refactor. If you don’t understand it and it sort of works, rewrite — because you can’t maintain what you can’t comprehend, and refactoring code you don’t understand just rearranges the confusion.
Reducing File Size for Context Windows
One of the most impactful refactoring targets for AI-assisted development is file size. Large files are the enemy of effective AI coding for a purely mechanical reason: they consume context.
Research by Hong et al. found that models experience “context rot” as inputs get longer, with performance drops of 15-30% for information in the middle of very long contexts. Factory.ai found that selective context injection can reduce token usage by over 70%.
Anthropic’s best practices recommend smaller, focused modules. The awesome-vibe-coding-guide is more specific: keep files under 300 lines.
Ronacher captures the practical dynamic: “You can vibe code your frontend together for a while, but eventually you reach the point where you absolutely need to tell it to make a component library. You don’t want to do it too early and you definitely do not want to do it too late.”
The refactoring targets for AI-friendliness:
- Split files over 300 lines. If a file has multiple responsibilities, each gets its own file. The agent reads fewer tokens and produces more focused output.
- Extract shared patterns into utilities. AI generates duplicate code across files because each generation is independent. Find the patterns, extract them, and the agent will use the utility instead of regenerating the pattern.
- Separate types/interfaces from implementation. Type files are cheap context — the agent reads the interface to understand the contract without loading the entire implementation.
- Colocate tests.
auth-service.test.tsnext toauth-service.ts, not in a distanttests/directory. Fewer file reads = less context consumed.
Augment Code reports that AI-assisted multi-file refactoring can reduce typical 25-file operations from 40 hours to 16-32 hours. The AI handles the mechanical part — updating imports, threading parameters, moving code — while you handle the architectural decisions about where to split.
How code architecture decisions directly affect AI coding productivity — including data on 2x PR throughput and 24% cycle time reduction when the codebase is structured for AI agents. The hidden lever most teams overlook.
Code Quality Tools as Guardrails
Refactoring without automated quality checks is just rearranging code and hoping for the best. The tool chain that makes AI refactoring safe:
Linting (ESLint, Pylint, RuboCop): Catches patterns the AI introduces that violate your project’s conventions. Run after every AI generation, not just in CI. ESLint with security plugins catches issues like eval() usage, unsafe regex, and prototype pollution.
Formatting (Prettier, Black, gofmt): Eliminates formatting noise from diffs. When the AI reformats code alongside functional changes, the real changes are invisible. Run formatting before reviewing to separate formatting from logic.
Static analysis (SonarQube, CodeScene, Semgrep): SonarQube brought “AI Code Assurance” in 2025 to provide guardrails for AI coding assistants in every PR. Their motto: “Vibe, then verify.” CodeScene’s framework requires code to achieve at least a 9.5 Code Health score before AI agents work on it — ensuring the agent starts from a clean baseline.
Type checking (TypeScript strict mode, mypy, Pyright): AI-generated code often has subtle type mismatches that work at runtime but indicate conceptual errors. Strict type checking catches these before they compound.
The CodeScene six-pattern framework synthesizes this into a workflow:
- Pull Risk Forward — code must pass quality gates before AI agents modify it
- Safeguard Generated Code — three-level automated checks (continuous review, pre-commit, PR pre-flight)
- Refactor to Expand the AI-Ready Surface — improve code health so that AI can work effectively
- Encode Principles in AGENTS.md — quality standards as machine-readable rules
- Use Code Coverage as Behavioral Guardrail — prevent agents from deleting tests to make coverage numbers look better
- Automate Checks End to End — every stage gated, no manual exceptions
The LessonsLearned.md Pattern
Red Hat’s spec-driven development article introduces a pattern that directly addresses the refactoring cycle: maintain a LessonsLearned.md file that documents errors the AI has made and the fixes applied.
The idea: every time the AI generates code with a bug, a security flaw, or a quality issue, add the pattern to LessonsLearned.md. This file gets loaded into context alongside your spec, teaching the agent from its own history. Over time, the agent stops repeating the same mistakes — not because it remembers, but because the memory is externalized into a document it reads at the start of every session.
This is context engineering (Part 5) applied to code quality. The agent’s quality improves project by project, session by session — not because the model got better, but because your context got better.
OpenHands’ architecture for massive automated refactoring — resolving CVEs across thousands of repos, generating focused PRs, and the engineering behind AI refactoring at scale. A practical look at what large-scale AI refactoring infrastructure actually looks like.
The Synthesis
Refactoring AI-generated code isn’t a failure state — it’s the expected workflow. The generation is fast but messy. The refactoring is where quality happens. And the irony is that AI is good at both halves of the cycle: generating sprawling code and cleaning it up — as long as you provide the verification infrastructure that makes cleanup safe.
The rules:
- Test before refactoring. CodeScene’s data: 37% accuracy without tests, 98% with them. This isn’t optional.
- Explain then rewrite. Two-pass refactoring — understand first, change second — produces dramatically better results than “clean this up.”
- Use dual-model review. Model A generates, Model B critiques. Different blindspots, better coverage.
- Refactor for AI-friendliness. Smaller files, colocated tests, separated interfaces. The refactoring pays for itself in every future AI interaction.
- Rewrite when you don’t understand. Comprehension debt is the most expensive debt. If you can’t explain the code, regenerate it with a better spec rather than polishing what you don’t grasp.
- Automate quality gates. Linting, type checking, static analysis, code health scores — these run on every change, AI or human. No exceptions.
- Build a LessonsLearned.md. Externalize the agent’s quality improvement. Every bug pattern documented is a bug pattern prevented.
The balance Ronacher describes — refactoring is cheap with AI, so “the code looks more organized than it would otherwise have been” — is achievable. But it requires treating refactoring as a first-class activity, not an afterthought squeezed in between feature requests.
The next part tackles the hardest refactoring scenario of all: Part 16: Working with Existing Codebases.
Part 16: Working with Existing Codebases
Everything up to this point has implicitly assumed you’re building something new — or at least working on code the AI helped create. But that’s not where most developers spend their time. Most real work happens in existing codebases: legacy systems, inherited projects, established products with years of accumulated decisions, conventions, and undocumented behavior.
This is the hardest scenario for AI-assisted development. And it’s where the techniques from every previous part — context engineering, specs, testing, version control, security — converge into a single question: how do you introduce an AI agent to a codebase it has never seen, and get useful work out of it without breaking what already exists?
Why Existing Codebases Are Hard
Peter Steinberger captures it: “AI coding assistants are like extremely capable interns who know every programming language but nothing about your codebase.” They know syntax, patterns, libraries, and idioms — but they don’t know your naming conventions, your architectural decisions, your undocumented invariants, or your deployment quirks.
The specific challenges, quantified by VentureBeat’s enterprise analysis:
- Files larger than 500KB are excluded from indexing entirely by most tools
- Multi-file refactors achieve only 42% capability in enterprise environments (vs. marketing claims of 100%)
- Legacy codebases hit 35% capability for current AI agents
- Indexing features fail or degrade for repositories exceeding 2,500 files
Mike Mason describes the architectural drift problem: “Agents make locally sensible but globally inconsistent decisions. They suggest deprecated APIs and miss internal conventions because they were trained on public code.” Your codebase has conventions that differ from what the model learned — and the model will confidently apply the wrong patterns unless you tell it otherwise.
The arXiv Codified Context paper (Vasilopoulos, 2025) studied 283 development sessions on a 108,000-line C# system and found it required a three-component infrastructure — hot-memory constitution, 19 specialized agents, and 34 on-demand specification documents — to maintain coherence. A single-file manifest (CLAUDE.md, .cursorrules) doesn’t scale to complex existing systems.
Step 1: Understand Before You Touch
The first AI session on an existing codebase should be read-only. No edits. No refactoring. Just understanding.
Start with the codebase digestion tools. Three options, each with different strengths:
Repomix (formerly Repopack) — packs your entire repository into a single AI-friendly file with token counts per file, clear separators, and an AI-oriented explanation header. Install with npx repomix and get a file you can paste into any LLM.
GitIngest — the fastest path: replace “github.com” with “gitingest.com” in any repo URL and get a text digest. Available as CLI tool, Python package, and browser extension. Includes smart formatting and token counts.
repo2txt — paste a GitHub URL and get the full tree and content. Peter Steinberger calls this his primary tool — “more efficient in choosing exactly which files to include.”
Osmani recommends these directly: “Tools like gitingest or repo2txt essentially dump the relevant parts of your codebase into a text file for the LLM to read. These can be a lifesaver when dealing with a large project.”
Then ask the exploration questions. Before making any changes, run a research session:
- “Analyze this codebase. What does it do? What technologies does it use? What patterns does it follow?”
- “Map the directory structure. What’s in each top-level directory? How are files organized?”
- “Trace the data flow for [core feature]. Where does data enter the system, how is it processed, where is it stored?”
- “What are the naming conventions? How are functions, variables, files, and directories named?”
- “What testing patterns are used? Where are tests located? What frameworks are used?”
Claude Code does this natively — it uses glob, grep, and find to navigate project structures, following imports and examining dependencies to build understanding. But it’s most effective when you direct the exploration rather than letting it wander.
Steinberger’s specific workflow: convert the repo to markdown via repo2txt, feed it into a model with a large context window (Gemini’s 1M tokens), and start asking questions. “Completely free and 200% faster than letting Claude run amok eating tokens.”
Dex Horthy (HumanLayer CEO) on the context engineering breakthrough for “brownfield” projects — the specific challenges of AI agents in existing codebases, and how context engineering solutions bridge the gap between greenfield ease and legacy reality.
Step 2: Bootstrap Your Context Files
Once you understand the codebase, encode that understanding into persistent context files so every future AI session starts with the right knowledge.
Run /init in Claude Code. This analyzes the codebase and generates a starter CLAUDE.md with build commands, test instructions, and discovered conventions. If a CLAUDE.md already exists, it suggests improvements rather than overwriting.
67% of repositories have now adopted rule files (CLAUDE.md, .cursorrules, AGENTS.md). For existing codebases, the file should include:
What to put in CLAUDE.md for an existing project:
Category | Examples |
|---|---|
| Build & test commands | npm test, pytest -x, make build — commands the agent can’t guess |
| Architecture overview | “This is a monorepo with frontend in /web, API in /api, shared types in /packages/types” |
| Naming conventions | “Components use PascalCase, utilities use camelCase, database models use snake_case” |
| Non-obvious patterns | “Auth tokens are stored in HttpOnly cookies, not localStorage. Sessions are server-side.” |
| Known gotchas | “The user table has a soft-delete column deleted_at — never use DELETE FROM users” |
| What NOT to touch | “Do not modify files in /legacy/ — they’re being migrated and have frozen interfaces” |
| External dependencies | “Payments go through Stripe API v2023-10-16 — do not suggest upgrading” |
HumanLayer’s guidance: keep it under 200 lines. Longer files consume more context and reduce adherence. Split detailed rules into .claude/rules/ topic files (testing.md, api-design.md, database.md) that get loaded selectively.
For complex codebases, use the @path/to/import syntax to pull in additional context files without bloating the main CLAUDE.md.
Step 3: Start Small and Expand
Don’t try to use AI across the entire codebase on day one. GetDX’s adoption research is clear: “AI code generation should complement current processes rather than disrupting them. Disruption from overhauling existing processes often counteracts increased coding speed.”
High-value, low-risk starting points:
-
Writing tests for existing code. The code is already working. You need tests before you can safely change it. AI can generate test suites from existing behavior — up to 90% speedup for this specific task.
-
Documentation. Ask the agent to read modules and generate inline documentation, README sections, or architecture diagrams. No risk of breaking anything.
-
Bug fixes with clear reproduction steps. Paste the error, point to the file, let the agent fix it. Small scope, verifiable outcome, minimal blast radius.
-
Isolated new features. New functionality that connects to the existing system through well-defined interfaces. The agent builds the new module; you write the integration layer.
-
Dependency updates. Upgrading a library version with clear migration guides. The agent handles the mechanical changes; tests verify nothing broke.
What to avoid initially:
- Large refactoring across core modules (the agent doesn’t understand the architecture yet)
- Changes to authentication, authorization, or payment flows (too much implicit knowledge)
- Modifications to code with no test coverage (no safety net)
- Architectural changes (the agent can’t reason about system-wide trade-offs it doesn’t know about)
A counterintuitive finding: projects with 50K+ lines of code actually have higher AI adoption rates (~10%) than smaller projects. The explanation: larger projects have more repetitive, mechanical tasks where AI shines — and more need for the productivity boost.
The Hallucination Problem in Existing Codebases
AI hallucination hits differently in existing codebases than in greenfield projects. In a new project, a hallucinated API is immediately obvious — the function doesn’t exist, the import fails, the compiler catches it. In an existing codebase, the agent might hallucinate methods that sound like they should exist based on the patterns it sees, and the error might not surface until runtime — or worse, it might call a real method with the wrong semantics.
Willison offers a useful reframe: “Hallucinations in code are the least dangerous form of LLM mistakes” — because running the code usually catches them immediately. The real risk is subtle misuse: calling a method that exists but doesn’t do what the agent thinks it does, or using a configuration option that’s valid but inappropriate for your environment.
Prevention strategies:
- TypeScript strict mode catches hallucinated properties and methods at compile time
- Contract testing against real APIs (not mocks) catches phantom methods
- Make module boundaries explicit so the agent can work on one file without needing the whole project in context
- RAG (Retrieval-Augmented Generation) reduces hallucinations by 71% on average by grounding the agent in actual codebase contents rather than training data patterns
Tools like Augment Code address this at the infrastructure level: their 200k-token Context Engine continuously indexes repos in real-time, building semantic understanding of code relationships and dependency graphs. This scales to codebases with 400,000+ files and provides a 70% reduction in context-switching time. Sourcegraph’s Amp uses a RAG-based architecture with up to 1M token context windows, analyzing the entire multi-repo environment before making suggestions.
The most practical defense, though, is the simplest: tell the agent what exists. Include a brief API reference in your context files. List the available utility functions. Document the database schema. The agent can’t hallucinate methods that aren’t in the API if you’ve told it exactly what the API contains.
Migration Patterns
AI-assisted migrations are one of the highest-ROI applications for existing codebases — framework upgrades, language transitions, API version changes. The work is mechanical, pattern-based, and verifiable — exactly what AI does well.
The strangler pattern is the recommended approach for AI-assisted migrations: gradually replace legacy components with modern equivalents while keeping the system operational. The AI handles individual component migrations; you manage the transition architecture.
Anthropic’s Code Modernization Playbook covers the full pipeline: discovery, documentation, migration, and verification. Their claim: “modernize COBOL codebases in quarters instead of years.” Hexaware reports 93% accuracy in AI-driven COBOL-to-Java modernization, with 35% complexity reduction and 33% coupling reduction.
For more common migrations (React class → functional components, REST → GraphQL, JavaScript → TypeScript), the pattern is:
- Generate a migration spec. “Analyze the current implementation of [module]. Create a migration plan to [target framework/pattern]. List every file that needs to change, what changes are needed, and the order of operations.”
- Migrate one module as a reference. Do the first one carefully, with full review. This becomes the pattern the agent follows for the rest.
- Batch the remaining modules. Point the agent at each module with: “Migrate this module following the same pattern as [reference module]. Preserve all existing behavior. Run the tests after each change.”
- Verify with existing tests. The test suite is your migration correctness guarantee. If tests don’t exist, write them before migrating (the pre-migration behavior is your spec).
Ronacher tested this in practice: Claude successfully migrated his tests to a new infrastructure approach in about an hour — work that would have taken significantly longer manually.
Claude Code applied to legacy COBOL modernization — the discovery, documentation, migration, and verification workflow in practice. A demonstration of what AI-assisted migration looks like on codebases that have been running for decades.
The Scope Management Principle
The overarching principle for existing codebases comes from Osmani: “Scope management is everything — feed the LLM manageable tasks, not the whole codebase at once. LLMs do best when given focused prompts: implement one function, fix one bug, add one feature at a time.”
This applies even more to existing codebases than to greenfield work, because every file the agent reads consumes context that could be used for reasoning. In a greenfield project, there’s little to read. In an existing project, the agent could spend its entire context window just understanding the codebase, leaving no room for actually making changes.
The practical workflow:
- Research session — agent reads the relevant files, summarizes its understanding, you verify
- Planning session — fresh context, feed the research summary, agent creates an implementation plan
- Implementation sessions — fresh context per task, feed the plan and relevant files only, agent implements one piece at a time
- Commit after each piece — version control as your progress save point (Part 13)
This is the Research-Plan-Implement loop from Part 9, applied to the most demanding scenario. Each session stays focused. Context stays clean. The handoff documents carry the essential understanding.
The Synthesis
Working with existing codebases is where every skill in this guide gets tested. Context engineering (Part 5) determines whether the agent understands your project. Spec-driven development (Part 6) structures the agent’s work. Testing (Part 11) catches regressions. Version control (Part 13) protects existing functionality. Security (Part 14) prevents new vulnerabilities in mature systems.
The rules:
- Understand before you touch. Read-only exploration first. Encode the understanding in CLAUDE.md and rules files. The investment pays for itself across every future session.
- Start with tests and docs. Zero-risk tasks that build familiarity with both the codebase and how the AI interacts with it.
- Feed context, not codebases. Digestion tools (repomix, gitingest, repo2txt) convert large repos into AI-friendly formats. Don’t dump everything — select what’s relevant.
- Prevent hallucinations with explicit context. Tell the agent what APIs exist. List available utilities. Document the schema. Grounded agents don’t invent methods.
- Migrate with the strangler pattern. One module at a time, tests as correctness guarantee, existing system stays operational throughout.
- Scope aggressively. One task per session. One module per prompt. One change per commit. The existing codebase is too complex for mega-prompts.
The developers who succeed with AI on existing codebases aren’t the ones with the most powerful tools. They’re the ones with the best context files, the cleanest session discipline, and the most realistic expectations about what AI can understand without being told.
The next part puts everything together in practice: Part 17: The Full-Stack Build: From Spec to Deploy.
Part 17: The Full-Stack Build: From Spec to Deploy
Everything in Parts 5–16 was a module. Context engineering. Specs. Testing. Debugging. Version control. Security. Refactoring. Legacy codebases. Each one a skill you practice in isolation.
This part is the integration test. A single build, end to end, where every skill shows up — and where the gaps between them become visible. Not a highlight reel. The real workflow, including the parts where you stop, think, and override the AI.
The 90% Problem
Here’s the stat that frames this entire discussion: roughly 90% of vibe-coded projects never make it to production deployment. Not because the code doesn’t work in a demo. It does. The problem is the last 20% — the testing infrastructure, the error handling, the security hardening, the deployment configuration, the database discipline — that separates a prototype from a product.
Tom Occhino, Vercel’s CPO, put it differently: “90% of what we need to do is make changes to an existing code base.” The challenge isn’t generating code from scratch. It’s connecting that code to real infrastructure, real data, and real users.
A Superblocks study found a striking confidence gap: 71.5% of experienced developers feel confident deploying visual development tools for mission-critical apps, but only 32.5% feel the same about vibe-coded apps. The difference isn’t capability — it’s the absence of guardrails, governance, and deployment discipline.
The five failure patterns that kill end-to-end builds, according to Diego Rodriguez’s analysis:
Failure Pattern | What Happens | When It Hits |
|---|---|---|
| Absent testing infrastructure | Changes silently break other parts of the app | First feature addition after v1 |
| Missing error handling | AI generates code for ideal scenarios only; real users cause crashes | First real user session |
| Security vulnerabilities | Hardcoded keys, unvalidated inputs, weak auth | First security scan (or first breach) |
| Unmaintainable architecture | Business logic mixed with UI, duplicates everywhere | Second month of development |
| Dependency chaos | Unnecessary packages with conflicting versions | First deployment to a different environment |
The workflow below is designed to address each of these failure modes before they become emergencies.
Phase 1: Spec Before Anything
The build starts with a spec. Not code. Not a scaffold. Not “just start building and see what happens.” A spec.
This is Part 6 in practice. Addy Osmani calls it “waterfall in 15 minutes” — a term coined by Les Orchard. The process:
- Brainstorm with AI — describe what you want to build, iterate on requirements through conversation
- Compile into a
spec.md— requirements, architecture decisions, data models, testing strategy, all in one document - Feed the spec into a reasoning model — generate a project plan that breaks implementation into logical, bite-sized tasks
- Iterate on the plan — edit, critique, refine until it’s coherent
- Only then write code
The key insight: “Both the human and the LLM know exactly what we’re building and why.” No wasted cycles. No architectural drift.
Osmani’s spec structure covers six areas:
- Commands — executable commands with full flags (not vague instructions)
- Testing — frameworks, file locations, coverage expectations
- Project structure — where source, tests, and docs live
- Code style — one real code example beats paragraphs of description
- Git workflow — branch naming, commit formats, PR requirements
- Boundaries — what the agent should never touch (secrets, vendor dirs, production configs)
The boundary system has three tiers: Always do / Ask first / Never do. This is where human judgment gets encoded into the process before a single line of code exists.
Tool options for spec-driven workflows:
GitHub Spec Kit (open source, v0.1.4) uses a four-phase approach: Specify → Plan → Tasks → Implement. The /specify command generates detailed specs from high-level descriptions. Compatible with Copilot, Claude Code, Gemini CLI, Cursor, and Windsurf.
Amazon Kiro uses a three-document approach (Requirements → Design → Tasks) with EARS syntax and GIVEN/WHEN/THEN acceptance criteria. Its “hooks” feature runs automated background agents triggered on file save, create, or delete — essentially CI for the editing process.
Codeplain uses a spec language called “Plain” that extends Markdown, letting you express intent and constraints in plain English that translates to code, tests, and validation.
Phase 2: Scaffold With Intent
With the spec in hand, you scaffold. This is the architecture phase from Part 10 — but now you’re doing it for a complete application, not a single feature.
The scaffold includes:
- Directory structure — matching the project structure defined in your spec
- Configuration files —
CLAUDE.mdor.cursorruleswith your boundaries, code style, and testing expectations (Part 5) - Database schema — designed by you, reviewed by AI, not the other way around
- API contracts — types, endpoints, request/response shapes before any implementation
- Test infrastructure — testing framework configured, first smoke test running
This is where you own the decisions. AI approaches to database schema design tend to produce either over-engineered schemas (needlessly complex) or under-designed ones (requiring painful refactoring). The AI generates “generic patterns that don’t account for unique business requirements.” Your domain knowledge is the corrective.
A CodeConductor analysis found five database-specific failure modes in AI-built apps:
- Missing relational discipline — schema drift, deferred integrity validation, duplicate records
- Absent indexing strategy — indexes added reactively after performance drops
- Unplanned query execution — inefficient JOINs never inspected via EXPLAIN
- No replication or scaling design — single-instance deployments without redundancy
- Zero observability — no monitoring of query latency, resource usage, or error rates
The fix: design your schema before the AI touches it. Let AI generate the migration files, the ORM models, the CRUD operations — but the relational design is yours.
The commit checkpoint: After scaffolding, commit everything. This is your “known good” state — the structural foundation that every implementation session builds on (Part 13).
Phase 3: Implement in Layers
Now the AI writes code. But not all at once.
The implementation follows the spec’s task breakdown, one layer at a time:
Layer 1: Data layer — database models, migrations, seed data. Test: can you query the database and get expected results?
Layer 2: API layer — endpoints, validation, error handling. Test: do API calls return correct responses for both valid and invalid inputs?
Layer 3: Business logic — the domain-specific rules that make your app do what it actually does. Test: do edge cases behave correctly?
Layer 4: UI layer — components, pages, routing. Test: do user flows work end to end?
Layer 5: Integration — connecting layers, handling state, managing auth flows. Test: does the full stack work together?
Each layer gets its own implementation session with fresh context (Part 9). Each layer gets committed separately. Each layer gets tested before moving to the next.
This is where the delegation framework from SoftwareSeni applies:
Safe to delegate to AI:
- Test generation
- Boilerplate CRUD operations
- Documentation
- Simple refactoring
- Data model scaffolding (from your schema)
Keep for yourself:
- Security logic (authentication, authorization, secret management)
- Architectural decisions (how components connect, data flow patterns)
- Complex business rules (the domain logic that defines your product)
- Unfamiliar technology stacks (AI hallucinates APIs it doesn’t know)
- Anything you can’t verify
Anthropic’s internal data supports this split. From their August 2025 survey of 132 engineers: engineers retain “high-level thinking and design” requiring organizational context and judgment. Delegated tasks are “easily verifiable with low context and low complexity” — self-contained, repetitive, or low-stakes.
The most common failure mode when delegating? Poor code structure (33.3% of failures) — long tangled functions, duplicated code, weak separation of concerns. The layered approach prevents this by keeping each implementation session focused on a single concern.
Phase 4: Test, Secure, Harden
The app works in development. Every feature does what the spec says. Time to celebrate?
Not yet. This is where the 90% fail.
Testing (Part 11):
- Run your test suite. Check coverage. Look specifically for the “tests that pass but test nothing” problem — AI-generated tests that assert obvious things without testing actual behavior.
- Add integration tests for the flows between layers. The seams where data layer meets API meets UI are where AI-generated code breaks most often.
- Run the security checklist from Part 14. Every item. No shortcuts.
Security hardening:
- Check for hardcoded secrets, missing input validation, weak authentication (Part 14)
- Run
npm auditor equivalent dependency scanning - Verify CORS, CSP headers, rate limiting
- Review every API endpoint for authorization checks
Error handling:
- AI generates code for ideal scenarios. Your users won’t provide ideal inputs. Add error boundaries, fallback states, loading indicators, and timeout handling.
- Test what happens when the database is down, the API returns 500, the user submits garbage data, the session expires mid-action.
Performance baseline:
- Run Lighthouse or equivalent. Note the scores before deployment.
- Check database query performance — AI-generated queries without indexing strategy will degrade under load (CodeConductor analysis)
This phase is the separation between creative work and controlled work. Superblocks frames it as two categories: creative features (UI design, workflow logic) tolerate iteration and AI autonomy. Controlled features (authentication, secrets management, deployment config) demand exactness and human review.
Phase 5: Deploy With Guardrails
The code is tested, secured, and hardened. Now it needs to run somewhere.
The deployment stack for AI-built apps:
Vercel rebuilt v0 to import existing GitHub repos, pulling environment variables and configurations automatically with deployment protections baked in. It handles frontend and lightweight API routes well. Direct integrations with Snowflake and AWS databases provide production data access with proper controls.
Railway released an MCP server allowing AI coding agents to deploy apps and manage infrastructure directly from code editors. Good for backend services needing persistent processes or database hosting.
The practical pattern: Vercel for frontend + serverless functions, Railway or Fly.io for backend services + databases. This separation mirrors the layered architecture from Phase 3 and prevents the “single point of failure” problem.
Deployment checklist:
- Environment variables — every secret in
.envmoves to the deployment platform’s secret management. Nothing hardcoded. - CI/CD pipeline — automated tests run on every push. Deployment fails if tests fail. This is your safety net from Part 13, automated.
- Preview deployments — every PR gets a preview URL. Review the actual running app, not just the code diff.
- Monitoring — error tracking (Sentry or equivalent), basic uptime monitoring, database query logging.
- Rollback plan — know how to revert to the previous deployment. With Vercel, it’s one click. With custom setups, it’s
git revert+ redeploy.
The Numbers That Matter
Anthropic’s internal engineering data gives a concrete picture of what this workflow produces at scale. From their August 2025 study:
- Engineers use Claude in 59% of their work (up from 28% a year prior)
- They report a 50% productivity boost (up from 20%)
- Claude Code completes approximately 20 consecutive actions independently before needing human input
- On the Claude Code team itself, Claude writes 90% of the code
- Engineers produce ~5 PRs/day — PR output per engineer up 67% while the team doubled in size
- 27% of Claude-assisted work wouldn’t occur otherwise — scaling projects, tools, documentation, testing, and “papercut fixes” that would never make the priority list
The most frequently delegated tasks: debugging and code fixing (55% daily), code understanding (42%), new features (37%). Design and planning is least frequently delegated — consistent with the “human owns architecture, AI owns implementation” pattern throughout this guide.
A Spotify case study showed similar results: after integrating Claude into their systems, any engineer could kick off large-scale code migrations by describing needs in plain English, with up to 90% reduction in engineering time and over 650 AI-generated code changes shipped per month.
But here’s the caveat from IEEE Spectrum: only 3.8% of developers fall into the “low hallucinations, high confidence” scenario. 76.4% are in “high hallucinations, low confidence” territory. The workflow above — spec, scaffold, implement in layers, test, secure, deploy with guardrails — is specifically designed to keep you productive regardless of which bucket you’re in.
The Real Workflow Is Not Linear
The five phases above look clean on paper. In practice, you’ll loop:
- Spec → Scaffold → discover the spec missed a critical flow → update spec → adjust scaffold
- Implement Layer 3 → tests reveal Layer 2 API doesn’t handle edge case → fix Layer 2 → continue Layer 3
- Security hardening → find an auth bypass → trace it back to an architecture decision → refactor
- Deploy → monitoring reveals a slow query → add index → redeploy
Each loop triggers the same skills: commit before changing anything (Part 13), fresh context for the new task (Part 9), spec update to reflect reality (Part 6), security check on the fix (Part 14).
The developers who ship are the ones who expect the loops and build their workflow around them. The ones who stall are the ones who treat Phase 1–5 as a one-way conveyor belt.
The Synthesis
A full-stack build with AI is more structured than one without it, not less. Every phase requires decisions that the AI can’t make: what to build (spec), how to structure it (scaffold), what’s critical (security), and when it’s ready (deploy). The AI accelerates every phase — but only if you’ve set up the guardrails, the context, and the review process to catch what it gets wrong.
The pattern:
- Spec first. Write the spec with AI, review it yourself. Part 6.
- Scaffold with intent. Own the architecture, let AI generate the boilerplate. Part 10.
- Implement in layers. One concern per session, tests after each layer. Part 9 + Part 11.
- Harden before deploy. Security, error handling, performance. Part 14.
- Deploy with monitoring. CI/CD, preview deployments, rollback capability. Part 13.
- Expect loops. The workflow is iterative. Build that assumption into your commit discipline and session management.
The 90% that fail skip steps 1, 4, and 6. Don’t.
The next part addresses something that doesn’t show up in any workflow diagram but shapes everything about how you use these tools: Part 18: The Psychology of AI-Assisted Work.
Part 18: The Psychology of AI-Assisted Work
This part isn’t about tools or techniques. It’s about the thing nobody writes a spec for: what it feels like when the craft you spent years mastering becomes something you supervise instead of perform.
If you’ve felt a strange unease about AI coding — something beyond the normal frustration of a new tool — you’re not alone, and you’re not being irrational. The emotion has a name, and understanding it is as important to your long-term effectiveness as any prompting technique in this guide.
It’s Grief, Not Frustration
In February 2026, the SF Standard published interviews with software engineers in San Francisco that captured something most productivity articles miss. An anonymous junior engineer at a large tech company described his predominant feeling about AI being a better coder than him as “grief”:
“I’m basically a proxy to Claude Code. My manager tells me what to do, and I tell Claude to do it.”
He added: “The skill you spent years developing is now just commoditized to the general public. It makes you feel kind of empty.”
This wasn’t an outlier. Lee Edwards, an investor at Root Ventures and former software engineer, said: “It just broke my brain… The AI has reached a point of sophistication where it’s telling itself how to work based on the best engineering practices. So what does it need me for?”
Another engineer admitted understanding only “about half the work he produces.”
Fortune reported that Spotify’s best developers “have not written a single line of code since December,” and Boris Cherny (head of Claude Code at Anthropic) hasn’t written code in over two months. Developers are transitioning from writing code to becoming “directors of AI systems that do the typing for them.”
This is a genuine identity shift, not a tooling annoyance. Stack Overflow explored this directly, framing AI tools as “a double-edged sword when it comes to imposter syndrome: capable of doing battle for you or against you.” The core anxiety: “Are you a real coder, or are you using AI?” The article argues this is a false dichotomy — but the feeling is real regardless.
Even Markus Persson (Notch), the creator of Minecraft, called anyone advocating using AI to write code “incompetent or evil” — a reaction rooted in the same loss of creative autonomy that many developers feel but express less dramatically.
The Stack Overflow 2025 Developer Survey captured the shift in numbers: 84% of respondents use or plan to use AI tools, 51% use them daily — but positive sentiment toward AI declined from 70%+ (2023–2024) to 60% (2025). Usage is up. Trust and enthusiasm are down. That gap is the emotional landscape we’re operating in.
Automation Complacency: The Pattern We’ve Seen Before
The psychological pattern has a name from decades of research in aviation and manufacturing: automation complacency. When systems work reliably, humans stop paying attention. Then the systems fail, and the humans have lost the skills to intervene.
NASA and MITRE research on pilot automation documented the pattern precisely: high static automation reliability increases complacency. “Out-of-the-loop performance” leads to failure to observe changes, over-trust in computers, loss of situation awareness, and direct/manual control skill decay. In one study, the automation group experienced 61% performance degradation when automation was unavailable, versus 31% for the manual group.
The parallel to coding is direct. ThoughtWorks placed “complacency with AI-generated code” in their “Hold” ring across three consecutive Technology Radar cycles (Oct 2024, Apr 2025, Nov 2025), with escalating concerns each time. Microsoft Research found that AI-driven confidence frequently undermines critical thinking among knowledge workers.
The data from software teams:
Metric | Value | Source |
|---|---|---|
| Engineering leaders seeing significant AI productivity boost | Only 6% | LeadDev survey (617 leaders) |
| Teams experiencing deployment problems with AI code | 59% | Harness survey |
| Devs spending more time debugging AI code than before | 67% | Harness survey |
| Devs spending more time resolving AI security vulnerabilities | 68% | Harness survey |
| Projects that over-relied on AI: additional bugs | 41% more | IEEE Spectrum |
| System stability drop from AI over-reliance | 7.2% | IEEE Spectrum |
| Developers who “highly trust” AI outputs | Only 3% | IEEE Spectrum |
| Developer burnout at critical levels | 22% | LeadDev survey |
One developer described the transformation to LeadDev as going from creator to “a judge at an assembly line and that assembly line is never-ending.”
The complacency cycle works like this: AI generates code → it looks reasonable → you accept it without deep review → it ships → bugs surface weeks later → you debug code you never understood → you lose time and confidence → you lean on AI more to compensate → the cycle deepens.
Skill Atrophy: The Data
This isn’t speculation. There are now controlled studies measuring the effect.
The METR Study (July 2025): 16 experienced developers from large open-source repos (22k+ stars, 1M+ lines of code) worked on 246 real issues, randomly assigned to AI-allowed or AI-disallowed conditions. Result: developers took 19% longer with AI tools. But here’s the psychological kicker — before tasks, developers predicted AI would save 24% of their time. After completing the entire study, they still believed it had saved 20%. A 39-point gap between perception and reality. They were slower and didn’t know it.
Anthropic’s own study (reported by InfoWorld): A randomized controlled trial with 52 mostly junior developers learning Python’s async Trio library. AI-using developers scored 17 percentage points lower on conceptual mastery (50% vs 67%) — equivalent to nearly two letter grades. The largest deficits were in debugging and error comprehension. “AI delegators” who wholly relied on the tool scored below 40%. Participants reported feeling “lazy” and acknowledged gaps in understanding afterward.
The researchers’ warning: “Humans may not possess the necessary skills to validate and debug AI-written code if their skill formation was inhibited by using AI in the first place.”
A 2024 academic study in Cognitive Research: Principles and Implications found that AI assistance accelerates skill decay among experts and hinders skill acquisition among learners — critically, without the performers being aware of the degradation. You lose the skill and don’t notice you’ve lost it.
The earezki analysis names three mechanisms:
- Cognitive encoding — manual struggle is the primary mechanism for embedding logic into memory. Writing code creates internalization that reviewing code cannot replicate.
- Flow state disruption — the rhythm of “brain-hand synchronization” dissolves when developers articulate requirements instead of constructing solutions.
- Ownership anxiety — “engineers who review rather than write code lack the mental map to immediately identify failure points” during outages. You become a stranger to your own codebase.
The Over-Reliance Trap
Andrej Karpathy — the person who coined “vibe coding” in February 2025 — now frames the professional approach differently. AI is a “fast but unreliable junior intern savant with encyclopedic knowledge of software, who also bullshits you all the time, has an over-abundance of courage and shows little to no taste for good code.”
His prescription for professional AI coding? “Being slow, defensive, careful, paranoid, and always taking the inline learning opportunity, not delegating.”
The earezki article calls this the “AI Acceleration Paradox” — producing vastly more code through AI while experiencing diminished satisfaction. Effortless creation feels “weightless” compared to the cognitive investment of manual development. Engineers bypass the cognitive friction required to build internal mental maps.
Addy Osmani’s distinction between “vibe coding” and “agentic engineering” captures the mature response: augmentation means trading implementation time for orchestration and review work — not eliminating technical judgment. The risk for anyone: “You can produce code without understanding it” and “ship features without learning why certain patterns exist.”
His core insight: “The fundamentals matter more, not less” when working with AI agents. This counterintuitive claim is backed by every study cited above.
Maintaining Agency: What Actually Works
The studies don’t just document the problem — they point to specific behaviors that protect against skill atrophy while preserving the productivity benefits.
From Anthropic’s own trial: Developers who requested code AND explanations scored above 65%, while pure delegators scored below 40%. The single most protective behavior was asking conceptual questions before accepting AI output. Not just “does this work” but “why does this work this way.”
From Karpathy’s professional workflow: Stuff everything relevant into context (this can take a while in big projects), maintain tight oversight, and always take the inline learning opportunity. The emphasis on “learning opportunity” is deliberate — every AI-generated code block is a chance to understand something, not just a task to review.
From Addy Osmani’s agentic engineering guide:
- Write specifications before prompting AI — forces architectural thinking before delegation
- Review every code diff with the same rigor applied to junior developers’ PRs
- When the agent writes something you don’t understand, that’s a signal to dig deeper, not to accept and move on
- Testing is the single biggest differentiator between agentic engineering and vibe coding
The intentional selectivity approach from earezki: selectively turn off AI for complex components to ensure the logic is deeply embedded in your own memory. Not everything needs AI assistance. The components that define your product’s core value deserve your direct attention.
From Stack Overflow’s guidance:
- Treat AI as a “thought partner, not a crutch”
- Build in learning time without AI assistance
- Measure performance on code clarity and maintainability, not AI-inflated velocity
A practical zero-trust policy from engineer Kuldeep Modi: AI is explicitly excluded from authentication, permissions, and PII-sensitive code. For everything else, the rule is that “‘almost right’ code is inherently wrong when deployed to production systems.”
The Practice Framework
Putting this together into something actionable:
Daily habits:
- Read before accepting. Every AI-generated code block gets reviewed line by line. If you can’t explain what a line does, you don’t ship it.
- Ask “why” not just “what.” When the AI generates a solution, ask it to explain the approach. Compare it with what you would have done. The gap is your learning opportunity.
- Commit with understanding. If you commit code you don’t fully understand, note that in your commit message. Your future self (or teammates) needs to know which code has the thinnest human understanding behind it.
Weekly habits: 4. Code without AI. Pick one task per week — ideally something interesting, not a chore — and implement it manually. Side projects are perfect for this. The goal isn’t productivity; it’s maintaining your direct relationship with code. 5. Debug manually first. When you hit a bug, resist the impulse to paste the error into an AI. Spend 15 minutes reasoning through it yourself. Then use AI if needed. The initial reasoning attempt exercises skills that atrophy fastest.
Strategic habits: 6. Own your architecture. Never delegate the structural decisions — data models, component boundaries, API contracts, security logic. AI implements; you design. (Part 10) 7. Rotate AI-off days for complex components. When building the core logic that defines your product, turn off AI assistance. The understanding you build will pay dividends every time you need to debug, extend, or explain that code. 8. Track your dependency. Periodically ask yourself: could I build this feature without AI? If the honest answer is “I’m not sure,” that’s the signal to invest in manual practice.
The Synthesis
The psychology of AI-assisted work is not a soft topic that you deal with after mastering the technical skills. It’s woven into every decision covered in this guide. The developers who thrive with AI tools long-term are the ones who treat the emotional and cognitive dimensions with the same rigor they apply to testing and security.
The grief is real. The skill atrophy is measurable. The automation complacency is a documented pattern across industries. None of these mean you should stop using AI tools — the productivity gains are also real.
What they mean is:
- Use AI with intention, not by default. Choose when to delegate and when to code directly. The choice itself is a skill.
- Understand what you ship. The 17-point skill gap from Anthropic’s own study is the cost of pure delegation. Asking “why” closes that gap.
- Maintain your craft through practice. The METR study’s 39-point perception-reality gap means your intuition about your own skill level is unreliable. Deliberate practice without AI is the calibration mechanism.
- Redefine the identity. You’re not “just a proxy to Claude Code.” You’re the person who decides what to build, evaluates whether it’s right, catches what the AI misses, and owns the outcome. That’s the job — and it requires more judgment, not less.
Karpathy landed on the right frame: agentic engineering. Not vibe coding, not traditional development, but a new discipline that demands every skill you already have plus new ones for orchestration, review, and intentional selectivity. The identity isn’t diminished. It’s expanded.
The next part maps this new identity to the career landscape: Part 19: The Career Landscape.
Part 19: The Career Landscape
The skills in this guide don’t exist in a vacuum. They exist in a job market that’s shifting faster than most career advice can keep up with. Entry-level hiring is collapsing. New roles are emerging. Salary premiums are real but unevenly distributed. And the question of what counts as “your work” when AI wrote the code is something every developer will face in their next interview.
Here’s what the data actually shows — not predictions, not think pieces, but hiring numbers, salary data, and the emerging practices that define what employers want right now.
The Job Market in Numbers
The headline stats paint a market in transition:
- Job postings requiring AI skills nearly doubled from about 5% in 2024 to over 9% in 2025, with vacancies referencing core AI skills more than doubling by early 2026.
- Roughly half of tech roles now expect AI skills, and postings mentioning AI pay about 28% more.
- AI Engineer is the #1 fastest-growing job on LinkedIn’s 2026 “Jobs on the Rise” list.
- AI has already created 1.3 million new roles — AI Engineers, Forward-Deployed Engineers, Data Annotators — plus 600,000 AI-enabled data center jobs, according to LinkedIn data cited by the World Economic Forum.
But the creation is uneven. The WEF Future of Jobs Report 2025 projects 92 million jobs displaced by 2030 against 170 million new ones created — a net gain of 78 million jobs. The problem isn’t total numbers. It’s who gets displaced and who gets created for.
The Entry-Level Collapse
The most dramatic shift is at the bottom of the career ladder.
The Stanford study (“Canaries in the Coal Mine?” by Brynjolfsson, Chandar, and Chen), using anonymized ADP payroll data covering millions of workers, found that employment for software developers aged 22–25 declined by nearly 20% compared to its peak in late 2022. Meanwhile, employment of workers aged 30+ in high AI-exposure fields grew 6–12% over the same period.
The researchers’ explanation: “AI replaces codified knowledge, the ‘book-learning’ that forms the core of formal education,” while “AI may be less capable of replacing tacit knowledge, the idiosyncratic tips and tricks that accumulate with experience.”
Ravio’s 2026 compensation report confirmed the pattern with a 73.4% decrease in entry-level (P1/P2) hiring rates across tech. A survey of 500 tech leaders found 72% plan to reduce entry-level developer hiring, while 64% intend to increase investment in AI tools and training.
What this means for you depends on where you are:
- If you’re early-career: The path into the industry is narrower. AI tool proficiency isn’t a differentiator — it’s table stakes. What gets you hired is the ability to do what AI can’t: navigate ambiguity, make architectural decisions, understand business context, review code critically. Everything in Part 3 and Part 18.
- If you’re mid-career or senior: Your tacit knowledge is more valuable, not less. The Stanford data shows experienced workers are gaining employment share. The risk is complacency — assuming your experience alone will carry you without adapting to AI-augmented workflows.
The Salary Picture
The AI premium is real, but more moderate than headlines suggest.
Role / Skill | Premium | Source |
|---|---|---|
| AI/ML roles (IC level) | +12% base salary | Ravio 2026 |
| AI/ML roles (management) | +3% base salary | Ravio 2026 |
| AI engineers vs traditional SWEs | +5–20% base, +10–20% equity | Rise |
| LLM engineers vs general ML engineers | +25–40% | Index.dev |
| AI-fluent nontechnical roles | +35–43% | Nucamp |
Absolute figures in the US market (Hakia, Coursera):
- Software Engineer (general): $148,263 average, $231,873 at 90th percentile
- LLM Developer: $209,000 average — the highest AI-specific premium
- AI Engineer: $120K–$150K junior, $150K–$200K mid-level (+9.2% YoY), $200K–$300K+ senior
The flipside: senior software developers saw -10% salary drops, mid-level SQL developers saw -7%, and overall tech salary growth hit 1.6% in 2025 — the lowest in 15 years.
Ravio’s caveat is worth noting: “The reality for the median tech worker is much less pronounced” than media coverage suggests. The premiums exist, but they’re concentrated in roles that directly build AI products or integrate AI into core workflows — not in roles that merely “use Copilot sometimes.”
From “Prompt Engineer” to “Agentic Engineer”
The job titles are evolving fast. Andrej Karpathy introduced “agentic engineering” in February 2026 as the professional discipline of orchestrating AI coding agents:
“You are not writing the code directly 99% of the time… you are orchestrating agents who do and acting as oversight.”
The term was deliberately chosen because it “sounds like a serious engineering discipline involving autonomous agents, something you can say to your VP of Engineering without embarrassment, and can put in a job description.”
Addy Osmani formalized the distinction: agentic engineering is AI doing the implementation while the human owns architecture, quality, and correctness. Testing is the “single biggest differentiator” between agentic engineering and vibe coding.
The career evolution path:
- Prompt engineering — understanding how to communicate with LLMs effectively. Now a baseline skill, not a specialty.
- AI-assisted development — using Copilot, Cursor, Claude Code as productivity multipliers. Where most developers are today.
- Agentic engineering — orchestrating multiple AI agents, managing context across sessions, owning architecture while delegating implementation. The emerging professional standard.
An estimated 5% of developers now specialize as “Agent Orchestrators” who decompose large tasks and manage fleets of AI agents. Context engineering — the skill of structuring information so AI agents can reason effectively (Part 5) — is replacing prompt engineering as the foundational skill employers look for.
As Deepak Seth of Gartner put it: “The most valuable AI skill in 2026 isn’t coding, it’s building trust.”
The Interview Split
How companies evaluate AI skills in hiring is diverging into two camps:
Camp 1: AI-integrated interviews. Meta launched an AI-assisted coding interview in late 2025 — a 60-minute session where candidates choose from GPT-4o mini, GPT-5, Claude Sonnet 4/4.5, Gemini 2.5 Pro, or Llama 4 Maverick. The test isn’t whether you can code — it’s whether you can use AI with judgment and discretion. Meta claims AI is “optional” but “using AI properly will give you an edge.”
Camp 2: AI-prohibited interviews. Amazon and Google tell applicants not to use AI and disqualify candidates found using them. The logic: they want to verify foundational skills that AI can mask.
The transparency test from either camp: in companies allowing unlimited AI usage during interviews, roughly 20% of candidates were unable to explain how their solutions worked, only that they did. This is the skill atrophy from Part 18 showing up in the most consequential context.
The practical takeaway: prepare for both. Maintain the ability to code without AI (Camp 2 still exists and tests fundamental understanding), while also developing the orchestration and review skills that Camp 1 evaluates. The developers who can do both have the widest opportunity set.
The Portfolio Question
When AI wrote most of the code, what’s “yours”?
This question is no longer theoretical. 21% of Y Combinator’s Winter 2025 batch had codebases that were 91%+ AI-generated. Collins Dictionary named “vibe coding” its Word of the Year for 2026. Solo founders are shipping production apps in days.
IBM Research developed an AI Attribution Toolkit (presented at CHI 2025, N=155 participants) that generates human-readable statements about the proportion, type, and review level of AI contributions. Their findings:
- Participants generally attributed more credit to the human than the AI for equivalent work
- Contributions of content warranted more credit than contributions of form
- 43.3% felt disclosing AI contributions should be mandatory for ethical and transparency reasons
- 61.5% felt that denoting AI involvement should be mandatory
A community-driven standard, AI_ATTRIBUTION.md, proposes six involvement levels that capture who controlled creative decisions — not line counts. The insight: “The fix for credibility issues isn’t hiding AI involvement — it’s being precise about creative control.”
Practical guidance for your portfolio:
- Be transparent. The industry is moving toward disclosure norms. Getting ahead of this builds credibility rather than undermining it.
- Emphasize decisions, not code. Your value is in the spec you wrote, the architecture you designed, the bugs you caught in review, the security gaps you identified. Document these decisions, not the generated output.
- Show the iteration. A portfolio piece that demonstrates “here’s what AI generated → here’s what I changed and why” is more impressive than either pure human code or pure AI code.
- Maintain some AI-free work. For the Camp 2 interview scenario and for your own skill maintenance (Part 18), keep projects where you can demonstrate unassisted capability.
The New Competition
Traditional developers now share the market with a new category: non-technical builders using AI to ship products. The dynamic isn’t simple.
Zach Lloyd (Warp CEO) argues that as coding becomes commoditized, “the human element — the ability to articulate high-quality specifications, define elegant user experiences — becomes the ultimate bottleneck and the most valuable skill set.”
But 55% of employers who laid off workers for AI report regretting it (Forrester), with many companies “laying off workers for AI capabilities that don’t exist yet, betting on future promises rather than proven technology.”
The Superblocks confidence gap from Part 17 applies here: vibe-coded apps reach demo quality fast but stall before production. The developers who can take a project from 80% to deployed — with testing, security, error handling, and deployment discipline — occupy a role that non-technical builders cannot fill and AI alone cannot automate.
That’s the career moat. Not writing code (AI does that). Not generating ideas (everyone can prompt). It’s the engineering judgment to turn generated code into production software. Every part of this guide has been building that judgment.
The Synthesis
The career landscape in 2026 rewards a specific combination:
- Foundational skills — algorithms, architecture, systems thinking. The Stanford study shows tacit knowledge is harder to automate than book knowledge. Maintain it.
- AI orchestration — context engineering, spec-driven development, multi-agent workflows. The skills that carry the AI premium. This entire guide.
- Review and judgment — catching what AI gets wrong, making decisions AI can’t, owning quality. The skill that separates “uses AI tools” from “builds production software with AI.”
- Transparency — honest attribution, demonstrable decision-making, the ability to explain your work with or without AI. The skill that builds trust in interviews, on teams, and in your portfolio.
The job titles will keep changing. “Agentic engineer” may stick or may not. What won’t change is the underlying capability: someone who can specify what to build, orchestrate AI to build it, verify it’s correct, secure it, deploy it, and explain every decision along the way.
That’s not a diminished role. That’s a more demanding one — and the salary data reflects it.
The final part of this guide: Part 20: What Mastery Looks Like.
Part 20: What Mastery Looks Like
You’ve read 19 parts about context engineering, specs, testing, debugging, security, refactoring, legacy codebases, deployment, psychology, and careers. If you’ve been applying any of it, your workflow today looks different from when you started.
This final part isn’t about one more technique. It’s about what it looks like when all the techniques disappear into something fluid — and what happens next.
When the Tool Becomes Invisible
There’s a concept in music called “effortless mastery” — the point where the instrument disappears and the musician becomes one with expression itself. The hands know what to do. The conscious mind focuses on the music, not the mechanics.
The same thing happens with AI-assisted development. At Anthropic, where engineers use Claude in 60% of their work, one finding stands out: “The more excited I am to do the task, the less likely I am to use Claude.” That’s not rejection of the tool. That’s mastery — knowing when to hand off and when to engage directly, without deliberating about the choice.
The most sophisticated Claude Code users at Anthropic run 2–4 concurrent instances, have on-call playbooks encoded as slash commands, and shift from writing documentation to building working prototypes as their primary communication method. The tool has become infrastructure — as invisible as the compiler. About 90% of Claude Code’s own production code is written by or with Claude Code. One of its creators hasn’t manually edited a single line of code since November 2025.
But here’s what matters: these same engineers retain full architectural ownership. They design systems, review every critical path, and intervene when the tool gets it wrong. The tool is invisible, but the judgment is always present. Anthropic encourages engineers to “use AI as aggressively as possible, because through giving AI agents really hard tasks, that’s the only way to actually push the boundary.” Aggressive use and careful oversight aren’t contradictions. They’re the definition of mastery.
The Expert’s Choice: When Not to Use AI
The counterintuitive hallmark of mastery is knowing when to turn AI off.
The METR study showed that experienced developers were 19% slower with AI on tasks they knew deeply. The reason: for experts with high prior exposure to a codebase, AI adds friction rather than removing it. You spend time writing prompts, waiting for generations, reviewing output, and cleaning up mistakes — for code you could have written faster yourself.
The Cerbos team captured this as the “70% problem”: AI excels at scaffolding, but the final 30% — edge cases, architecture, testing — often takes longer than writing from scratch. Only 16.3% of developers reported AI made them “more productive to a great extent.” The majority live in the messy middle where AI sometimes helps and sometimes hurts.
Mastery means recognizing the boundary in real time:
Use AI when:
- The task is well-defined, repetitive, or boilerplate
- You need to explore an unfamiliar API or library
- You’re generating tests, documentation, or migration files
- The output is easily verifiable against a spec
- You’re working across a codebase you don’t know deeply
Don’t use AI when:
- You know the solution and can type it faster than prompting
- The logic is security-critical or performance-sensitive (Part 14)
- You’re building the core domain logic that defines your product
- You need to deeply understand the code for future debugging
- The task requires the kind of holistic reasoning that context windows can’t capture
Anthropic’s own skill formation study identified three AI interaction patterns that preserve learning (conceptual inquiry, guided exploration, explanatory debugging) and three that impair it (blind delegation, copy-paste integration, uncritical acceptance). The expert defaults to the first three. The novice defaults to the last three. The difference isn’t the tool — it’s the intention behind its use.
Craftsmanship Transformed
The fear that AI kills software craftsmanship is understandable. The earezki analysis named the “AI Acceleration Paradox” — producing vastly more code while experiencing diminished satisfaction. Effortless creation feels “weightless” compared to the cognitive investment of manual development. The ownership gap: engineers who review rather than write code become “strangers to their own codebases.”
But craftsmanship has survived every previous automation wave by concentrating toward what automation can’t provide.
When the camera was invented, artists predicted the death of visual art. When digital cameras replaced film, photographers predicted the death of the craft. Neither happened. The craft adapted by elevating what mattered more — artistic vision, composition, intentional practice — while automation handled the mechanical aspects. Film photography communities thrive alongside computational photography. The parallel to coding is direct.
Sandro Mancuso of Codurance makes the argument precisely: “Speed is productive only when it supports clarity, correctness, and adaptability.” Less than 25% of developer time involves writing code — most is spent reading and understanding existing systems. The bottleneck was never typing speed. It was comprehension. And comprehension is exactly what automation can’t shortcut.
Nathan Sobo of Zed frames it as opportunity: “In a world of abundance, the bar should be higher for quality.” When everyone can generate code, the differentiator is taste, judgment, and the ability to build systems that last. “The barrier to entry to building truly great experiences has never been lower.”
8th Light’s Zuko Mgwili landed on the cleanest frame: “AI tools at the hands of an already capable engineer are a superpower.” Craft doesn’t diminish when auxiliary tasks are automated — it concentrates toward the uniquely human elements: judgment, architectural thinking, and deliberate creation.
The pattern across all these sources: craftsmanship isn’t dying. It’s becoming more important, more concentrated, and more valuable. The developers who care about quality have always been rare. AI hasn’t changed that. It’s made the gap between careful and careless work more visible.
The Mature Form
Andrej Karpathy coined “vibe coding” in February 2025. One year later, he renamed the professional approach:
“Today (1 year later), programming via LLM agents is increasingly becoming a default workflow for professionals, except with more oversight and scrutiny.”
His term: agentic engineering. You’re not writing code directly 99% of the time. You’re orchestrating agents who do, acting as oversight, and taking every inline learning opportunity.
Addy Osmani formalized the framework with four pillars:
- Start with planning — write a design doc or spec before prompting (Part 6)
- Direct then review — give the agent a well-scoped task, review the output with the same rigor you’d apply to a teammate’s PR (Part 8)
- Test relentlessly — comprehensive test suites as the primary differentiator (Part 11)
- Maintain ownership — documentation, version control, CI/CD, monitoring (Part 13)
Osmani’s critical caveat: “This approach disproportionately benefits senior engineers with deep fundamentals in system design, security patterns, and performance.” The skills you already have are the leverage. AI amplifies them.
The Glide definition captures the formal version: “AI agentic programming is a software development discipline in which humans define goals, constraints, and quality standards while AI agents autonomously plan, write, test, and evolve code under structured human oversight.”
This is what mastery looks like. Not coding faster. Not prompting better. Defining what to build, setting constraints, reviewing output, and owning quality — while AI handles the implementation at a scale and speed no individual could match manually.
Where This Is Heading
Anthropic’s 2026 Agentic Coding Trends Report identifies eight trends in three categories:
Foundation shifts:
- Development lifecycle pivots to supervision and review
- Agents become team players — from single agents to specialized multi-agent groups under orchestrators
Capability expansion:
- Agents go end-to-end — from minutes to hours or days of autonomous work
- Agents learn when to ask for help — detecting uncertainty and requesting human input
- Agents spread beyond software engineers — COBOL, Fortran, operations, design, and non-technical roles
Impact:
- More code, shorter timelines — TELUS created 13,000+ custom AI solutions while shipping 30% faster, saving 500,000+ hours. Zapier achieved 89% AI adoption with 800+ agents deployed.
- Non-engineers embrace agentic coding — sales, legal, marketing using agents for custom tools
- Security becomes a dual-edged sword — agents help defenders and attackers scale
Nicholas Zakas mapped the evolution in three phases: the Autocomplete Era (2024), the Conductor Phase (2025), and the Orchestrator Model (late 2025 onward) with fully autonomous cloud agents and multiple concurrent sessions. His predictions: by 2028, IDEs will be primarily agent-focused and teams will stabilize at “minimum viable engineering team” sizes. By 2030, AI code reviews may replace human reviews for routine changes.
The real bottleneck is shifting upstream. As Built In framed it: “In a world where AI can generate code faster than humans can type, code is no longer the scarce resource. Clarity is.” The developer role transitions from “writer of syntax” to “architect of intent” and “verifier of logic.”
Gartner forecasts 30% enterprise adoption of multi-agent systems by 2027. But 40% of agentic AI projects will fail due to inadequate risk controls — a reminder that the governance, testing, and security skills from this guide (Parts 11, 13, 14) aren’t optional extras. They’re what separates the 60% that succeed from the 40% that don’t.
The Skill Formation Paradox
There’s a forward-looking concern that threads through everything in this guide. If AI impairs skill formation — and Anthropic’s own study shows a 17% comprehension drop for AI-dependent learners — then the pipeline of expert engineers who can provide the judgment and oversight that agentic engineering requires may narrow over time.
The study recommends “careful deployment of AI tools with intentional design choices that support learning while maintaining productivity benefits.” This isn’t an abstract policy concern. It’s personal. The practices from Part 18 — coding without AI, asking “why” before accepting output, deliberate practice on complex components — aren’t just about maintaining your current skills. They’re about building the next generation of skills that this rapidly evolving field will demand.
The Full Picture
Twenty parts. Here’s what it all comes down to:
The foundation (Parts 1–4): AI coding tools are powerful, widely adopted, and imperfect. The spectrum from autocomplete to full delegation is real. Your existing skills are the leverage. The right tool depends on the task, not the brand.
The core skills (Parts 5–9): Context engineering is the #1 differentiator. Specs prevent drift. Good prompting is about structure, not cleverness. Reviewing AI code is a distinct skill. Session management prevents context degradation.
The engineering discipline (Parts 10–16): Architecture matters more with AI, not less. Testing catches what AI misses. Debugging AI code requires understanding its failure patterns. Version control is your safety net. Security is where AI fails most consistently. Refactoring AI code is a necessary ongoing practice. Existing codebases are the hardest scenario and demand the most discipline.
The full stack (Part 17): Spec → scaffold → implement in layers → test and secure → deploy with guardrails. The 90% that fail skip hardening and expect linearity. Don’t.
The human dimension (Parts 18–20): The grief is real. Skill atrophy is measurable. Automation complacency is a documented pattern. The career landscape rewards AI proficiency, foundational skills, and the judgment to combine them. And mastery means knowing when to use AI and when not to — making the choice deliberately, not by default.
Every claim in this guide is backed by a real source. Not because we’re pedantic about citations, but because in a field where AI confidently generates plausible-sounding nonsense, grounding in evidence is the practice that keeps you honest. Apply the same standard to your own work.
The tools will keep changing. The models will get better. New frameworks will emerge. What won’t change is the need for someone who can specify what to build, evaluate whether it was built correctly, and take responsibility for the outcome. That’s the job. It always has been. AI just changed the interface.
Welcome to agentic engineering.