The Pattern We All Use, But Shouldn’t Celebrate
In the Planner-Executor pattern, a human planner decomposes a problem into small steps and provides strategic coherence; an AI executor is pointed at each step for tactical implementation.
This pattern appears in nearly every ambitious project built with today’s conversational interfaces. The development of the ‘cursed’ programming language, for instance, was a three-month exercise in this loop. The developer acted as project manager, feeding Claude a continuous stream of high-level tasks: “create a compiler,” “add editor extensions,” “implement a new language feature.” The model executed, the developer tested and integrated, and the cycle repeated. The project’s success depended entirely on the developer’s constant direction to maintain long-term coherence.
This pattern is so common it’s now prescribed as a best practice. In a post on using Claude for coding, Boris Tane explicitly advises against open-ended prompting in ‘How I Use Claude Code’. His method: first write a detailed, structured plan with implementation notes, then feed it to the model section by section. The human-authored plan is the source of truth. The AI is a tool for filling in blanks, not for charting the course.
This is micromanagement, equivalent to giving a brilliant but distractible junior engineer a checklist of small functions to write because you can’t trust them with the entire feature spec. The pattern mistakes the map for the territory. It celebrates the execution of small steps while ignoring that a human still has to draw the map, hold the compass, and point the way at every turn.
The Hidden Cost of Manual Scaffolding
The pattern’s primary cost is cognitive load. You become the external executive function for a powerful but stateless tool. You must hold the entire project context—architecture, dependencies, long-term goals, remaining tasks—in your head. The model can write code for step_4b, but only you know how it connects to step_1a and what it implies for step_7c.
The human bottlenecks both speed and complexity. A project can only move as fast as you can type the next instruction and verify the last output. The system’s complexity is capped not by the model’s coding ability, but by the limits of your own working memory. When the context grows too large, the project stalls. The workflow doesn’t scale; it’s a single-player mode for software development.
The pattern is brittle. If the human planner gets distracted or has an off day, the project’s coherence degrades. The model has no persistent state or understanding of the overarching goal beyond the current prompt. It cannot correct a flawed plan or suggest a better architecture. It will diligently execute a series of locally-optimal steps that lead to a globally-incoherent result if that’s what the plan dictates. The burden of maintaining architectural integrity rests on the planner’s constant attention.
The pattern is expensive, consuming a senior engineer’s focused time. The hours spent breaking down problems, writing step-by-step plans, and verifying outputs are hours not spent on higher-level architectural decisions. This reduces frontier models to advanced autocomplete, chaining them to a manual process that resembles artisanal craft more than software engineering.
The Frontier Is Already Past This
The standard justification for the Planner-Executor pattern is that models lack the long-term coherence to manage a complex project autonomously. This is true for models used through standard chat UIs. The failure, however, is not in the model’s core capability but in the tooling we use to access it.
This limitation is not fundamental. For example, one report showed an agent backed by the Kimi-K2.6 model operating for an extended period on operations tasks with minimal human oversight. This demonstrates a capability—sustained, goal-directed work—that is absent from mainstream coding assistants.
This creates a sharp disconnect. We have evidence of models capable of sustained, autonomous operation, yet the daily experience of engineers using tools built on models like GPT-4o or Claude 3 Opus is one of constant intervention. The problem isn’t that autonomous capability is impossible; it’s that it is not yet a feature of mainstream tools. The chat window is a poor interface for long-running, stateful tasks, forcing the human into the loop by its design.
The gap is one of tooling, not fundamental model capability. We have powerful engines but are still trying to steer them with a tiller. The Planner-Executor pattern is a symptom of this tooling gap. We build manual scaffolding around models because our tools don’t provide the automated scaffolding required for autonomy.
Closing the Tooling Gap: Measure Mean Time Between Interventions
To move past this, we need to change how we measure progress. The current focus on task completion rates on isolated benchmarks is misleading. An agent that can refactor a single file on command is useful, but it says nothing about its ability to contribute to a real-world project. The true measure of an agentic system’s value in software engineering is its autonomy.
We propose a different metric: Mean Time Between Interventions (MTBI). MTBI measures how long a system can work productively towards a high-level goal before a human needs to step in to correct its course, provide a new sub-plan, or unstick it from a loop. This is the real-world metric that matters. A low MTBI means the human is still the planner. A high MTBI means the system is beginning to exhibit genuine autonomy.
To ground this, consider a common software task: refactoring a large Python script to use a new database library and its modern idioms. Using a state-of-the-art tool like the Claude 3 Opus web UI, the model makes progress on a chunk of the file, but then requires a new prompt to tackle the next section, clarification on an ambiguous class, or a correction after losing the context of an earlier change. The workflow is a series of short sprints, with a human planner required at the end of each one.
The goal for the next generation of AI engineering tools should be to drive this number up. We need to move from an MTBI measured in minutes to one measured in hours, and eventually, days. This requires tools that are more than chat windows. We need systems with persistent state, access to a filesystem, and the ability to run tests and self-correct based on the output. The goal is to build an environment where the model can pursue a high-level objective like ‘refactor this service to use the new auth library’ over hours or days, not a tool that requires a new prompt for every file it touches. When our tools’ MTBI is measured in days, not minutes, we’ll know we’ve moved beyond the Planner-Executor crutch.