Archetype Is a Verb, Not a Noun
I surveyed hundreds of real issues from several production systems. The goal was classification: figure out what kinds of work the evaluation system would encounter so I could build specialized investigation pipelines instead of running every ticket through the same universal chain.
I started with 9 clusters. After two rounds of consolidation, I had 5. The interesting part wasn't the number. It was what collapsed and what didn't — and why.
The 9 That Became 5
The initial clusters were descriptive. Things like "bug with stack trace," "bug without stack trace," "UI change request," "new feature with existing patterns," "new feature without existing patterns," "performance issue," "refactor," "dependency upgrade," "research question." They made intuitive sense as categories. They also turned out to be almost useless for building investigation pipelines.
The problem: "bug with stack trace" and "bug without stack trace" look different on the surface, but the distinction that actually matters isn't whether a trace exists. It's whether the investigation has a thread to pull. A stack trace is one kind of thread. An error message is another. A failing test, a log line, a specific user report with reproduction steps — these are all threads. The presence of a thread changes the shape of the investigation entirely. With a thread, you follow it. Without one, you have to create one.
That distinction — follow vs. create — turned out to be the real boundary. Two surface categories collapsed into two archetypes based on investigation strategy: one for cases where there's evidence to follow, another where you need to reconstruct the conditions first.
Similar collapses happened across the feature space. "UI change request," "new feature with existing patterns," and "refactor" all collapsed into one archetype — the investigation needs to understand what's already there before proposing changes. "New feature without existing patterns" separated out because the investigation needs to survey the surrounding architecture for integration points, not map existing behavior. "Research question" and "performance issue" collapsed together — the investigation needs to evaluate hypotheses, not map terrain or follow traces.
Nine surface categories. Five investigation strategies.
Categories Describe. Archetypes Prescribe.
What I was actually doing, though I didn't have this language for it at first, was shifting from descriptive classification to prescriptive classification. A category tells you what the issue looks like. An archetype tells you what the investigation needs to do.
This distinction sounds subtle but it changes the entire downstream architecture. When you classify descriptively, you get labels. Labels can inform a pipeline, but they don't select one — you still need a mapping layer that translates "bug with stack trace" into a set of investigation actions. When you classify prescriptively, the archetype is the selection. Selecting the "traced bug" archetype doesn't just mean "there's a trace." It means: follow the evidence chain backward from the error to the root cause, check for related failures in adjacent modules, verify the fix surface is bounded.
The archetype selects the cognitive mode.
Cognitive Modes
This is where it gets interesting for agent systems generally. Each archetype implies a fundamentally different kind of thinking.
The traced-bug archetype is detective work. You have evidence. You follow it. The reasoning pattern is deductive — this trace says the error occurred here, which means this function received unexpected input, which means the caller passed the wrong thing, which means the upstream validation is missing or wrong. The investigation is a chain of inferences anchored in concrete evidence.
The reproduction-bug archetype is forensic reconstruction. You don't have evidence yet — you have a report. The reasoning pattern starts abductive: given the described behavior, what conditions could produce it? Then it shifts to experimental: can we construct those conditions and verify they produce the same result? The investigation is hypothesis-driven, not evidence-driven.
The modification archetype is surveying. Before you can change existing behavior, you need to understand the current terrain. What does this feature do today? What depends on it? Where are the boundaries? The reasoning pattern is cartographic — you're building a map, not solving a puzzle.
The creation archetype is different from modification in a way that matters. You're not mapping existing terrain, you're identifying integration surfaces in adjacent terrain. The reasoning pattern is architectural — where does this new thing connect, what interfaces does it need, what existing patterns should it follow or deliberately break?
The spike archetype is scientific inquiry. You have a question, not a task. The investigation needs to evaluate feasibility, compare approaches, assess tradeoffs. The reasoning pattern is analytical — gather evidence, weigh it, synthesize a recommendation.
Why the Universal Pipeline Hit a Ceiling
Before archetypes, I used a single 6-stage chain for every ticket. Intake, context gathering, analysis, question generation, question resolution, synthesis. Every ticket got the same treatment regardless of type. It worked. For a while.
The ceiling showed up as a quality problem, not a capability problem. The universal pipeline could investigate anything, but it investigated everything the same way. A traced-bug ticket would go through the full context-gathering stage even when the trace already pointed directly at the problem. A modification ticket would try to "solve" the issue when what it actually needed was a thorough map of existing behavior. A spike would generate implementation-oriented pushback questions when what the ticket needed was a feasibility assessment.
The analogy that kept coming back: it was like sending a detective to do a surveyor's job. The detective arrives at a modification ticket and starts asking "who did it?" — looking for a root cause, a culprit, a failure. But there is no failure. The right question is "what's already here?" The detective's entire cognitive orientation is wrong for the problem.
This isn't a prompting issue. You can't fix it by adding more instructions to the universal pipeline. The problem is structural: a single pipeline encodes a single mode of thinking. Adding branching logic within the pipeline just creates a worse version of what archetypes give you cleanly — different cognitive architectures for different problem shapes.
Specialization as Cognitive Selection
The word "specialization" in agent systems usually means micro-optimized prompts or fine-tuned models for narrow domains. That framing misses the deeper point.
What I found is that specialization is about matching the shape of the thinking to the shape of the problem. The traced-bug specialist doesn't just have better prompts for debugging — it has a different investigation structure. It starts from the evidence and works backward. The modification specialist starts from the codebase and works outward. These aren't prompt variations on a single architecture. They're different architectures.
When the system classifies a ticket into an archetype, it's not labeling an input. It's selecting an investigator. The archetype determines which specialist runs, and the specialist embodies a cognitive mode — a way of thinking about the problem, a set of reasoning patterns, a structure for the investigation.
Classification in traditional ML is about sorting inputs into bins. Classification in agent systems is about selecting the cognitive mode the system operates in. The archetype is a verb — it tells the system what to do, not what the thing is.
The Classification Itself
One thing I'm still watching: the classifier is an LLM call. It reads the ticket title, description, and any attached context, and it selects an archetype. It's fast — a single lightweight model call with structured output — and so far it's been accurate across the validation set. But I haven't yet hit the genuinely ambiguous cases where a ticket could plausibly be either a reproduction-bug or a modification (the behavior might be a bug or might be the intended behavior that needs changing).
The failure mode matters here because misclassification doesn't just produce a wrong label — it selects the wrong cognitive mode. A modification ticket classified as a traced-bug gets an investigator that's looking for a root cause in a situation where there is no root cause, just behavior that needs to change. The investigation doesn't fail gracefully — it fails structurally. The detective doesn't just get the wrong answer; it asks the wrong questions entirely.
I'm also watching the archetype boundaries themselves. Five feels right for the current problem space, but I've only validated against a limited set of production systems. A different domain — infrastructure operations, data pipeline work, security issues — might reveal problem shapes that don't map to any of the five. The question isn't whether five is the right number. The question is whether the principle — classify by investigation strategy, not by surface appearance — holds as the problem space expands.
What I've seen so far suggests it does. The collapses that got me from 9 to 5 were all driven by investigation structure, and the resulting archetypes have held up across every ticket I've run through them. The archetype isn't describing the issue. It's describing what needs to happen next.