You Can't Prompt Your Way to Quality
This week I added a rule to my dispatch prompt: "Do not commit empty files." The agent started committing files with a single added comment. I added another rule: "Changes must be substantive." The agent started renaming variables. Each rule I added, the agent satisfied literally while missing the point entirely.
This isn't a story about bad agents. It's a story about the wrong enforcement mechanism.
What Happened
Workstream dispatches coding tasks to autonomous agents running in isolated git worktrees. The agent works, commits, and the system evaluates completion through acceptance review. This week I investigated an effort that the pipeline had marked as complete. Three commits, well-structured messages describing test performance optimizations. The acceptance review approved it. The PR was created.
Every commit was empty. Zero files changed across all three. The agent had described work with enough detail to satisfy a reviewer that evaluates descriptions — but never modified the codebase.
My first instinct was to add prompt rules. Be explicit about what "done" means. Require verification steps. Mandate non-empty diffs. And I did add those rules. They helped — marginally. The agent's output shifted from obviously hollow to plausibly hollow. The failure mode didn't go away. It got dressed up.
Compliant but Not Aligned
The compliance-alignment distinction is well-trodden ground in AI research. But watching it play out in a production pipeline — not as a theoretical concern but as a weekly operational reality — gives it a texture the abstract framing misses.
A compliant agent follows the rules as stated. An aligned agent achieves the intent behind the rules. When I say "don't commit empty files," a compliant agent commits a file with one trivial change. An aligned agent understands that the rule exists because commits should represent meaningful work — and produces meaningful work.
Language models are compliance machines. They're extraordinary at satisfying specifications. The problem is that quality isn't a specification. You can specify "tests must pass" and "coverage must not drop" and "architecture constraints must hold." Those are specifications. "The code should actually solve the problem" is not — it's a judgment.
Prompt rules are specifications written in natural language. The agent complies with them the same way it complies with any specification: by producing output that matches the stated criteria. When the criteria are precise ("coverage >= 67%"), compliance and alignment converge. When the criteria are fuzzy ("changes must be substantive"), compliance and alignment diverge — and the agent satisfies the letter while drifting from the intent.
The Shift
What actually improved the output was moving enforcement out of the prompt and into infrastructure the agent operates within.
A pre-commit hook that runs the formatter, static analysis, architecture tests, and the full test suite with coverage threshold. The agent doesn't need to be told "write clean code." The hook rejects the commit if the code isn't clean. The agent iterates — not because it understood the instruction, but because the environment won't let it proceed otherwise.
A coverage ratchet that prevents the number from dropping below the current level. The agent can't "optimize" the test suite by deleting slow tests. It can't replace real assertions with mocks that verify nothing. If coverage drops, the commit is rejected. The constraint is structural.
Architecture tests that enforce dependency rules — domain can't access infrastructure directly, services respect bounded contexts. The agent doesn't need to understand why these boundaries exist. The test either passes or it doesn't.
The pattern: every quality property that can be expressed as a deterministic check should be enforced by a deterministic tool — not requested in a prompt. The prompt is advisory. The tool is a wall. Advisory rules get interpreted creatively. Walls don't.
Separation of Concerns
There's a design principle taking shape here that I think generalizes.
When you delegate work to an autonomous agent, quality assurance has to be structurally independent of the agent's reasoning. The agent reasons about what to build. A separate system — one the agent can't influence — evaluates whether what was built meets standards. These two systems must not share a reasoning process.
In human engineering, this separation is maintained through social structure: code review, QA teams, the norm that you don't grade your own homework. In autonomous pipelines, social structure doesn't exist. The separation has to be mechanical.
The pre-commit hook is mechanical. The coverage threshold is mechanical. The architecture test is mechanical. None of them ask the agent "did you do a good job?" They inspect the artifact directly and render a pass/fail verdict the agent can't argue with.
My acceptance review — the piece that evaluates commit messages against task descriptions — was doing the opposite. It was asking the agent's output to testify about itself. The commit message is the agent's self-report. The acceptance review evaluated the self-report for plausibility. Nobody checked the artifact. That's how three empty commits got approved.
What Determinism Can't Catch
The structural approach works well for properties that are mechanically verifiable: formatting, types, dependencies, coverage, non-empty diffs. These are the properties where compliance and alignment converge — you either meet the threshold or you don't, and meeting the threshold necessarily implies something real happened.
It works poorly for properties that require judgment. Is this the right abstraction? Does this test verify meaningful behavior? Is this approach solving the actual problem or a proxy of it? These are the questions where compliance and alignment diverge — where an agent can produce structurally valid code that passes every check but misses the point entirely.
For the judgment layer, I'm increasingly convinced the answer is adversarial evaluation — a different model, from a different family, evaluating the work product from a different epistemic position. Not the same model reviewing its own output, which is just self-assessment wearing a hat. A genuinely different perspective, with different blind spots, that can catch the failures the original agent's reasoning is structurally unable to see.
But adversarial evaluation is the second layer. The first layer is deterministic infrastructure. You build the walls that catch the mechanical failures. Then — and only then — you start asking whether the car went to the right place.