Lux Goes to Production

Lux started as a codebase search tool — semantic retrieval over source files, built so agents could answer questions about a codebase without dumping the entire thing into context. This week it shipped v1.0 and went inside a Firecracker VM for the first time. It indexed a 40,000-file codebase in 259ms. But the interesting part isn't the speed. It's what happened to Recon's behavior with and without it.

The Number

Without codebase search, Recon's autonomous question resolution rate is 0%. Every pushback question the analyst generates about a ticket — "what validation exists on this input?", "does this endpoint already handle the edge case?", "what's the current retry behavior?" — goes to a human. The agent generates good questions. It can't answer any of them.

With Lux available, that rate goes to 70-77%.

I ran four validation tickets through Recon's specialized pipelines after the archetype work landed. The results:

A feature_modification ticket generated 12 pushback questions. The specialist resolved 9 from the codebase. 3 went to a human — genuine ambiguities that required product decisions, not code knowledge. An 8-question bug_traced ticket: 8 for 8, zero human escalation. The trace pointed at the problem, and Lux provided the surrounding context to verify the analysis. An 8-question bug_repro ticket: same result, 8 for 8. The specialist reconstructed the conditions from code patterns Lux surfaced. A spike generated no pushback questions at all — its output was a feasibility synthesis, not a question set.

Across the three question-generating runs: 28 questions, 25 resolved from the codebase, 3 escalated to humans. The 3 that escalated were legitimately unanswerable from code — they required knowing product intent, not implementation state.

From Questionnaire Generator to Analyst

The qualitative difference is more striking than the numbers. Without Lux, Recon's pushback report reads like a questionnaire. It identifies the right ambiguities — it's good at figuring out what's unclear or risky about a ticket — but every identified ambiguity becomes a question for a human. The report is useful the way a good checklist is useful: it surfaces what needs attention. But it doesn't resolve anything.

With Lux, the same ambiguities get investigated. The specialist identifies "this ticket says 'modify the retry behavior' but doesn't specify what the current behavior is" — and then looks it up. It finds the retry configuration, the backoff strategy, the error types that trigger retries, the places where retry logic is duplicated or inconsistent. The pushback question transforms from "what's the current retry behavior?" (which a human would have to go read the code to answer) to "the current retry behavior uses exponential backoff with a max of 3 attempts, but the payment and notification services implement different backoff curves — does this ticket intend to unify them?"

That second question is a fundamentally different kind of question. The first asks the human to do research. The second presents research and asks the human to make a decision. The agent went from being a question generator to being an analyst — someone who does the legwork and surfaces the decision points.

RAG as Load-Bearing Infrastructure

There's a framing in the broader AI/ML space that treats RAG as an enhancement. A nice-to-have. You have an agent, it works okay, you add retrieval and it works a bit better — more grounded, fewer hallucinations, maybe more relevant responses. The retrieval is supplementary. The agent is the thing; retrieval is the seasoning.

What I observed is categorically different. Lux isn't making Recon slightly better. It's enabling a mode of operation that doesn't exist without it. The 0% to 70% jump isn't a quality improvement. It's a capability threshold. Below it, you have a system that identifies problems and asks humans to investigate them. Above it, you have a system that investigates problems and asks humans to make decisions about them. These are different systems that happen to share a codebase.

This reframes how I think about the retrieval layer in agent architectures. If RAG is supplementary, you can ship without it and add it later. If RAG is load-bearing, you can't. The agent without retrieval isn't a slightly worse version of the agent with retrieval. It's a different agent with different capabilities operating in a different mode.

What 259ms Buys You

The indexing speed matters, but not for the obvious reason. 259ms to index 40,000 files means Lux can re-index on every run. The codebase index is always fresh — no stale embeddings, no incremental update logic, no "the index is from last Tuesday" problems. Every time Recon evaluates a ticket, Lux indexes the current state of the codebase. The agent's knowledge of the code is as current as the code itself.

This is a different architecture than a persistent vector store that gets updated periodically. Persistent stores trade freshness for speed — you pay the indexing cost once and amortize it across queries. That works when the corpus is stable. Codebases aren't stable. A PR merged an hour ago might be exactly the context the agent needs to evaluate the current ticket. With 259ms indexing, the cost of freshness is negligible.

Running inside Firecracker VMs matters for the same architectural reason. Each Recon run gets its own VM, its own Lux instance, its own index built from the current checkout. There's no shared state between runs, no index corruption, no concurrency issues. The isolation is total. This means I can run multiple Recon evaluations in parallel against the same repo without coordination — each one indexes independently and operates on its own snapshot of the codebase.

The Resolution Quality Gap

Not all resolved questions are resolved equally. I'm watching three quality dimensions.

First, accuracy. Are the specialist's answers actually correct? So far, yes — the codebase is the ground truth, and Lux is retrieving from it, not hallucinating about it. But I haven't yet tested against a very large codebase where relevant code might be scattered across dozens of files. There's a retrieval quality ceiling I haven't found yet.

Second, completeness. When the specialist says "the current retry behavior is X," is it accounting for all the places retry behavior is implemented? The feature_modification ticket surfaced an interesting case: the specialist found the main retry configuration but also discovered that two services had divergent implementations. That's the system working well — Lux retrieved enough context for the specialist to notice the inconsistency. But it raises the question of whether there's a third service with yet another implementation that didn't surface in the retrieval results.

Third, the boundary between code-answerable and human-required questions. The 3 questions that escalated to humans in the validation were clean escalations — genuinely required product decisions. But there's a murky middle ground: questions where the code technically contains the answer but the specialist doesn't find it, or questions where the answer requires synthesizing information from code and product context. I'm watching for false escalations (questions that could have been resolved but weren't) and false resolutions (questions the specialist answered from code but answered incorrectly or incompletely).

What This Changes About the Architecture

The original Recon design treated codebase search as one of several context-gathering tools. The agent could search code, read documentation, check ticket history — retrieval was part of a toolkit. What the validation data suggests is that codebase search isn't one tool among many. It's the foundation that makes the other tools useful.

Without Lux, the specialist can't verify its own analysis against the codebase. It can hypothesize about what the code does, but it can't check. Every hypothesis becomes a question for a human. With Lux, hypotheses become verifiable claims. The specialist thinks "the validation probably happens in the controller" and then looks. Either it finds validation logic or it doesn't. Either way, the investigation moves forward without human intervention.

This changes how I think about the specialist pipelines. The pipeline isn't: classify → investigate → generate questions → resolve what you can → escalate the rest. It's: classify → investigate with continuous codebase access → synthesize findings → escalate only genuine decision points. Lux isn't a step in the pipeline. It's the substrate the pipeline runs on.

I built Lux to make agents smarter about code. What it actually did was make the difference between an agent that asks questions and an agent that answers them.

Next
Next

Archetype Is a Verb, Not a Noun