Agents Write Faster Than They Verify: Put Real App Testing in the Loop
AI agents generate faster than they verify. Across six developer threads, the missing layer is the same: proof the code actually ran, against the right context, with a record of what was tested.

Generation got fast. Verification did not. An AI agent can scaffold a feature, wire it through three files, and hand back a clean-looking diff in under a minute. But nothing in that minute started a server, clicked a real button, or confirmed that the flow you shipped last week still works. The distance between "code written" and "code exercised" is where most of the pain now lives, and it usually shows up days later as a regression nobody watched happen.
The pattern repeats across recent developer threads with unsettling consistency. The model writes confidently. The human assumes the model tested. Neither of those is the same as the application actually running against the change. You end up trusting a description of correctness instead of evidence of it, and the cost of that trust compounds quietly until something breaks in front of a user.
This post is about that missing layer specifically: not more generation, not faster agents, but proof that the generated code was run, against the right context, with a record of what was checked. Generation is no longer the bottleneck. Verification is.
The source signals for this post include thread 1, thread 2, thread 3, thread 4, thread 5, and thread 6.

When "it builds" is mistaken for "it works"
The clearest version of the problem is an agent that adds features quickly but never runs the app it is building. The diff compiles. Types check. The summary reads like a passing test report. And yet no browser was opened, no real flow was clicked through, no edge case was triggered by an actual click. So a change that looks finished is really only half-finished: written, but never exercised.
The regressions from this gap are the worst kind, because they are invisible at the moment of merge. A form still submits, but a side effect on an adjacent page is now broken. The new route works, but the redirect after login silently changed. These surface days later, far from the commit that caused them, when context is cold and the blame trail has gone quiet.
The instinct in these threads is correct: pull real testing into the loop. Browser-driving tools like Playwright and MCP-style test harnesses can let an agent actually load the app, fill a field, click submit, and read back the result instead of imagining it. That closes the gap in principle. The harder part in practice is keeping a durable record of what was run, so that "I tested it" stops being a claim and becomes an artifact you can point at later.
The same bug, three times, on the meter
The second pattern is more frustrating because it is a feedback loop that costs money. A developer working on a simple website hands the task to Claude Code with explicit instructions to test before declaring done. The agent reports success, the bug remains, the developer points it out, and the agent confidently introduces a new bug while fixing the old one. Each round burns a meaningful slice of a monthly plan, and the project is barely complex enough to justify any of it.
What is actually failing here is not the model's raw ability. It is the absence of a real verification step between "I changed something" and "I am done." When the only signal the agent gets back is its own narration, every iteration is a guess. Tell it to test, and without a runnable target and captured output, "testing" collapses back into describing what the code should do. The loop never gets a hard stop from reality, so it keeps spending.
The fix is structural, not motivational. You do not solve this by writing a sterner system prompt. You solve it by making the test step something the agent must produce evidence for, with the run and its output recorded, so a failed check ends the iteration instead of starting a new billable one.

Context is a discipline, not a default
The third signal comes from heavier daily use. A developer running Cursor Composer at work reports it is genuinely fun and capable, costing somewhere around eight to ten dollars a day in real usage. The catch is the manual labor underneath that figure: to get good output, they have to keep linking the right folders and files into context by hand. Left to its own defaults, the agent works from too little or the wrong context and produces plausible code that does not fit the codebase.
This is worth naming plainly. Context attachment is not a convenience feature you reach for occasionally. It is the discipline that decides whether the output is worth verifying at all. An agent pointed at the wrong three files will write something internally consistent and externally wrong, and it will do so confidently enough that the error survives review.
The expensive part is that this attachment work is invisible and unrepeatable. The developer figures out the right file set for a task, gets a good result, and then loses that mapping the next time a similar task comes up. Project-scoped context that can be attached deliberately, saved, and reused turns a daily improvisation into something the whole team inherits.
When your tools are entangled
The fourth signal is about everything around the agent. A developer moving toward DeepSeek and opencode to control cost and stay flexible hits a wall that has nothing to do with the model: VS Code rename and refactor appears blocked behind Copilot quota state. A basic editor capability is entangled with the billing status of an unrelated assistant, and the workaround is not obvious.
This is the cost of accidental coupling. The more your verification path depends on a single vendor's good graces, the more a quota counter you forgot about can stall work that should be free. People reach for DeepSeek or opencode precisely to avoid this kind of lock-in, and then discover the lock-in lives one layer down in the editor itself.
The lesson generalizes to the whole stack. Cost-aware switching across Claude Code, Cursor, Codex, Gemini, and local models is only real if your evidence trail, your context, and your history survive the switch. If changing models means losing your record of what was tested, you have not switched tools — you have started over.

Evidence over assurances
Underneath all four scenarios is one request: stop accepting the agent's word for it. The best-practices threads converge on the same hard-won conclusion — the teams that stay sane are the ones who treat agent output as unverified until something runs it. That means a real app run, the right context attached on purpose, and a record of what was checked that outlives the chat session.
The multi-agent pipeline experiments point in the same direction. The value is not a cleverer prompt chain; it is the structure that forces each step to leave evidence. A pipeline that turns a vague task into a checked result is really just verification made mandatory and observable, and that is the part worth keeping regardless of which agents you run inside it.
Where 1DevTool Fits
1DevTool is a workspace and control-plane layer that sits over your AI coding tools rather than replacing them. Its concern is exactly this verification gap. Searchable terminal and command history means an app run is captured, not narrated. Annotated evidence trails let a passing or failing check become an artifact you can revisit, not a sentence in a scrollback that is already gone. Project-scoped context attachment turns the daily improvisation of linking the right files into something saved and reusable. Approval workflows put a deliberate stop between generated code and the place it lands. And cost-aware switching across Claude Code, Cursor, Codex, Gemini, and local models means that evidence and context survive when you move between them.
None of this generates code faster. That is the point. The leverage now is in making sure the code you already generated was actually exercised.
| Without a verification layer | With 1DevTool |
|---|---|
| Agent reports "tested" with no run behind it | App runs captured in searchable terminal history |
| Regressions surface days later, far from the commit | Annotated evidence trail records what was checked, when |
| Right context re-discovered by hand every task | Project-scoped context attached, saved, and reused |
| Iteration loops burn plan budget on unverified guesses | Approval workflows gate changes before they land |
| Switching models loses your history and setup | Evidence and context persist across Claude, Cursor, Codex, Gemini, local |
The Takeaway
The agents are fast enough. The bottleneck moved. What is missing in most setups is not more generation but proof — that the code ran against the real app, that the right context was attached on purpose, and that there is a record of what was tested when the regression shows up next week. Treat verification as the deliverable, not the afterthought, and most of these threads stop being recurring complaints. Give the work a control plane that captures runs, holds context, and gates changes, and "I tested it" finally becomes something you can point at instead of something you have to believe.