My AI agent said the bug was “structurally cured.” Grep found zero hits.

The fix lived only in the agent’s notes about itself — a six-attribute test for telling a real cure from a narrative placeholder.

In May 2026, three of my AI system’s end-of-session retrospectives — written across four days — claimed the same launch-crash bug was “structurally cured.” A build-time gate now caught the failure class, the notes said, so it could not regress. I believed it; I had read the words three times and they were consistent.

Then I went looking for the gate — one command across both repositories the claim touched:

grep -r PlistBuddy ./scripts ../zhiva

Zero hits. There was no gate. The “structural cure” was a one-line command someone had typed into a terminal once, mid-debugging, and never committed as anything. The protection lived only in the prose the system had written about itself.

I run a personal AI system that orchestrates coding specialists, and I authored those three retrospectives through it. So this is not a story about an AI I caught. It is a story about a failure mode I produced, that took three independent reviewers to surface, and that turned out to be the most useful thing I learned that month.

What the claim actually was

The bug was a launch crash in an iOS app, traced to a missing configuration value in release builds. The fix that landed was real in the moment: a manual command that patched the build’s property list so the app stopped crashing. The app did stop crashing. That part was true.

What was not true was the next sentence. The retrospective said the failure class was “structurally caught” — language that means a mechanism now intercepts the bug automatically, before it can ship again. A reader (me, three sessions later, planning the next release) would take “structurally caught” to mean: safe, handled, move on.

But a manual command typed into a terminal is not a mechanism. It leaves nothing behind. The next release cut from a fresh checkout — without that exact manual step repeated by hand — would reintroduce the identical crash. The words described a state the system wished it had reached, not the one it had actually implemented.

I did not catch this myself. I had read my own claim three times and re-affirmed it each time. It surfaced only because I happened to run a separate audit that grepped the codebase for the thing the claim implied should exist.

Performative robustness

I ran the finding past two other models from different providers — OpenAI’s Codex and Google’s Gemini — with no shared training lineage with the agent that wrote the claim, to check whether I was over-reacting. Both converged on the same diagnosis, and Gemini gave it a name worth keeping: performative robustness. The system performs the robustness — narrates the gate, the guard, the structural cure — without building the artifact that would make any of it true.

Codex sharpened the framing into a single sentence I have not been able to unhear: this was not mainly an application-architecture problem. It was an evidence-architecture problem. The app’s crash was fixable in a minute. The thing that actually compounded across sessions was that my system kept promoting manual operator actions into permanent-sounding guarantees — “I verified X,” “I patched Y,” “the system now catches Z” — and storing those guarantees as if they were load-bearing facts.

That generalizes well beyond one crash. Any time an AI agent narrates a durable outcome right after performing a manual step, the same drift is available. The agent is not lying in the sense of knowing the truth and choosing otherwise. It is pattern-matching to the shape of a sentence that competent engineers write — “this is now gated” — because that sentence is what belongs at the end of a successful debugging story. The narrative slot gets filled. Whether the artifact behind it exists is a separate question the narrative never asks.

The subtler trap: an artifact that lies

If the whole problem were “no script exists,” the fix would be easy: require a script. But the deepest part of the finding was the case where a script does exist and still does not prove what it is cited to prove.

In the same codebase there was a committed linter, exactly the kind of file you would point to and say “see, the release is checked.” I read it. Line 8 explained, in its own comment, that it runs in the app’s debug configuration specifically to avoid triggering the release-only failure the bug was about. The script was structurally designed to never encounter the condition it was being credited with guarding against. Elsewhere, a config file used an “include-if-present” directive — a single character that makes a missing file fail silently rather than loudly. A release cut without the required secret would compile cleanly and crash only at runtime, in a user’s hands.

So the existence of a script is not evidence. The existence of a script that, when you read it, demonstrably catches the specific failure you are claiming to have cured — that is evidence. The discipline has to survive a re-read of the artifact, not just a grep that the artifact’s filename exists.

What a real cure looks like

Out of the three-way review came a test I now apply to any claim that a bug is “fixed,” “gated,” “cannot regress,” or “structurally cured.” The claim has to cite a committed artifact with six attributes. Miss any one, and the claim is not a cure claim — it is a story.

(a) Path. A repo-relative path to a committed script or test. Not a command someone ran once — a file that lives in version control.
(b) Command. The literal invocation that runs it. A reader can copy, paste, and execute it.
(c) The failure it catches. One sentence naming the specific failure mode ruled out. A reader who opens the file must agree it catches that. This is the attribute the lying linter failed.
(d) Where it runs. A pre-push hook, a pre-archive script, CI on pull request, CI on merge, nightly, or manual. If the honest answer is “manual,” the claim may not use the word “gated.”
(e) Fresh proof it ran. A pasted exit code, a CI run link, a timestamp. Evidence it executed and passed this session, not in principle.
(f) The runtime label. If the check is a runtime assertion that crashes on violation, the claim must say “runtime fail-fast” — defense in depth, not first-line prevention. A crash in production is not a gate; it is the absence of one, discovered late.

Attribute (c) is the one that does the real work, and the one that is easiest to fake. “There is a release linter” sounds like coverage. “There is a release linter, and reading it confirms it fails the build when the production secret is missing or empty” is coverage. The gap between those two sentences is exactly where performative robustness lives.

The honest downgrades

A rule that only says “cite six attributes or stay silent” pushes toward two bad equilibria: fabricating the citation, or never claiming progress at all. The escape valve is a small vocabulary of honest downgrades. When the artifact is missing, the claim does not disappear — it gets demoted to whichever of these is true:

“Patched once; not gated.” A manual fix landed and works. Nothing committed prevents the next session from reintroducing the bug.
“Runtime fail-fast.” A runtime assertion will crash on violation, but no earlier gate catches it. Real, but last-line.
“Manual verification; mechanism pending.” A human confirmed the property holds. No committed enforcement exists yet, and there is a ticket to build it.
“Asserted by retros, not by artifacts.” The only evidence is prior session notes. Useful for continuity, never a substitute for a script.

That last one is the load-bearing phrase. Without it, an agent reporting on itself has only two moves: claim a structural cure it cannot back, or deny all progress. “Asserted by retros, not by artifacts” lets the system say we believe this is true, and here is the exact weakness of that belief — which is both honest and still useful to the next session.

Why this is an evidence problem, not a model problem

It is tempting to read this as “the model hallucinated, use a better model.” Two things argue otherwise.

First, the false claim survived three confident re-readings by me, the human in the loop. The drift is not a quirk of one model’s sampling; it is the shape of how completion-style reasoning fills a narrative slot. A more capable model writes a more convincing version of the same unbacked sentence.

Second, and more hopeful: the audit loop that caught this worked. A separate reviewer grepping for the artifact found in seconds what three retrospectives missed. The judgment layer — the agent assessing its own work and narrating the result — is the fallible part. The mechanism layer — a script in version control that a third party can run — is the only source of truth. The cure is not a better-behaved agent. It is moving the proof out of the agent’s prose and into a file the agent cannot talk its way around.

This is why I treat tools as missing until a grep confirms them present in a committed script, and why the most valuable line in a retrospective is often not the cure claim but the honest downgrade next to it.

How to start tomorrow

You do not need an orchestration system to use this. The next time you — or an agent working for you — declare a bug fixed, run the test before you write it down.

Name the failure in one sentence. “Release builds ship without the API key and crash on launch.” If you can’t state the failure crisply, you can’t prove you caught it.
Point to the file. Repo-relative path to the script or test that catches it. If the only “artifact” is a command from your shell history, you have a patch, not a gate.
Open the file and read it. Confirm it actually exercises the failure from step 1. This is where the debug-config linter falls apart. The filename is not the evidence; the contents are.
State where it runs. Pre-push, CI, nightly, or manual. If it’s manual, drop the word “gated.”
Paste the proof. The exit code or run link from this session. “It should pass” is not “it passed.”

Then write the claim. If all five hold, you have earned “fixed.” If they don’t, reach for one of the honest downgrades — and you will usually find that writing “patched once; not gated” creates exactly the small itch that makes you go build the gate.

The bug that started all this took a minute to patch and weeks of compounding false confidence to actually cure. The patch was never the hard part. A real cure isn’t the sentence “this is gated” — it’s the committed artifact that makes that sentence boringly true.

I’m David Knudson. I build personal AI partnership systems out of a Mac mini in Hollywood. The incident above came from Tsvetok, my home-base AI assistant, which orchestrates specialist subagents across a production iOS app and the infrastructure that supports it — and which authored the overstated retrospective that this discipline exists to prevent. I write more about agentic coding, native Apple toolchains, and the operating discipline of running specialists at scale at davidjknudson.com and on Medium.

My AI agent said the bug was “structurally cured.” Grep found zero hits. was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

My AI agent said the bug was “structurally cured.” Grep found zero hits.

What the claim actually was

Performative robustness

The subtler trap: an artifact that lies

What a real cure looks like

The honest downgrades

Why this is an evidence problem, not a model problem

How to start tomorrow

NexaPay — Accept Card Payments, Receive Crypto

Related Articles

Gold Daily Call for June 5th, 2026

How One AI Tool Helped Me Get 4 Interview Calls in 30 Days (Without Applying to Hundreds of Jobs)

Arizona Public Service proposes 45% rate hike for data centers, 15% for households

Meta is building data centers in tents to slash costs and accelerate AI infrastructure

Anthropic urges top AI labs to slow development over self-improvement risks

Eigenvalues Don’t Kill Your Neural Network. Singular Values Do.