Routing as Attention in a Hierarchical Agent System

Feb 21

This week I added streaming to Lux's expert routing layer. Small change - threading an `onChunk` callback through the call chain - but tracing the full query path from outer agent through routing to expert execution forced a close read I hadn't done in a while. What came back was a recognition: the routing architecture has quietly become a mixture-of-experts system. The interesting part isn't the resemblance. It's where the pattern diverges.

The Mapping

Mixture-of-experts architectures use a gating function to route inputs to specialized sub-networks. Lux does this, but with two meaningful substitutions.

The gate is an LLM. Haiku reads natural language descriptions of each expert's capabilities - pulled from their CLAUDE.md preambles - and selects the best match for the incoming query. This is a semantic gate. It routes on meaning rather than on learned weights, which makes it interpretable in a way that traditional softmax gates aren't. You can read the gate's reasoning. You can see why it chose what it chose.

FTS5 full-text scoring runs as a hard prior before the semantic gate. It narrows the candidate pool by document relevance, eliminating clearly irrelevant experts before the gate makes its final selection. This is analogous to top-k pre-filtering in sparse MoE - you don't evaluate every expert, you reduce the field first.

There's also a degradation chain: if the semantic gate fails, FTS5 takes over entirely. If FTS5 finds nothing, the first active expert gets the query. The system never just shrugs.

Ontologically Distinct Experts

Here's where Lux departs from the standard formulation. The experts aren't weight partitions of a shared architecture. They're full agent instances, each with its own system prompt, working directory, conversation history, and grounded file access.

In the ontological framework I've been developing, agents are defined not by what they do but by what they are - their perceptual boundaries, reasoning mode, capability scope. Lux's experts are a clean instantiation of this. Their CLAUDE.md preambles function as ontological definitions, and the routing system treats them as such. Each expert has bounded epistemology by construction: it knows its domain and is architecturally prevented from seeing beyond it.

This matters because the routing layer isn't just load-balancing across interchangeable processors. It's selecting between genuinely different cognitive contexts. The choice of expert changes not just who answers but what world of knowledge the answer comes from.

Where Retrieval and Routing Collapse

The part that really caught my attention: the FTS5 stage doesn't just filter experts. It simultaneously retrieves the documents that get injected into the chosen expert's augmented query. The router picks who answers and gathers the evidence they'll need - in one pass.

This is Attention-Based Retrieval in practice. The routing query acts as an attention signal, transforming a broad knowledge base into a curated, high-relevance context package. What would normally be two separate systems - a retrieval pipeline and a routing mechanism - fuse into a single operation.

The practical consequence is that the expert operates with bounded context without sacrificing quality. The routing layer has already determined what's relevant. The expert doesn't need to know everything - it needs the right things for this specific query. Bounded epistemology enforced at the infrastructure level, not just the prompt level.

The Hierarchical Separation

The outer agent - Opus, via MCP - sees `lux_ask` as a black box. It can suggest an expert but doesn't need to. This creates a clean separation: the outer agent reasons about what needs to happen; Lux reasons about who should handle it and with what context.

This is hierarchical cognition working the way I'd hoped. A high-level process sets direction. A lower-level process handles specialized execution. The levels coordinate through a defined interface without needing to understand each other's internals.

What I'm Still Watching

The semantic gate works well enough, but I haven't pushed it against genuinely ambiguous queries - ones where multiple experts are plausible matches. Does routing quality degrade gracefully, or is there a cliff?

The retrieval-routing fusion means the expert receives documents selected by the *router's* criteria, not its own. There may be cases where an expert would surface better evidence if it had the chance to search independently. The efficiency gain is real, but I don't yet know the ceiling.

Session continuity via `--resume` gives experts conversational memory, but it's per-expert, per-session. If one expert discovers something relevant to another's domain, that knowledge stays siloed. I think this is bounded epistemology working as designed - but I'm watching for the moment it becomes a genuine limitation.

And as the expert count grows, does the semantic gate's ability to discriminate degrade? At what point does a natural language description of an expert's capabilities become insufficient for accurate routing?

mixture-of-expertsroutingattention-based-retrievalbounded-epistemologylux

Ivan Novak