Proving Agent Quality With Data

The Thesis

Can a system of specialized AI agents running on local models (Gemma 26B, free) match or beat a monolithic cloud agent (Claude Sonnet, paid) for personal task management?

We’re testing this with a series of experiments, each building evidence for the next. Every claim is backed by eval scores, not opinions.

The Scorecard

DomainRustySpecialist (Sonnet)Best LocalModelVerdict
Media2.523.53 (+40%)3.57Gemma 26BLocal viable
Productivity3.853.85 (+0%)4.11Gemma 31BLocal wins (+6.8%)
Cross-domainPlanned
SupervisorPlanned

The Experiments

EXP-001: Does Splitting a Monolithic Agent Into Specialists Improve Eval Scores? Specialization improves media evals by 40%. Gemma 26B matches Sonnet on the focused domain. Also discovered that eval infrastructure quality (harness mocks, scorer bugs) matters more than you think.

EXP-002: Do Mock Evals Predict Real-World Agent Quality? Real APIs score 5.6% higher than mocks. 93% of evals stable across 3 runs. Mock evals are a trustworthy conservative lower bound. Only variance source is model non-determinism, not API instability.

EXP-003: Does Agent Specialization Replicate for Productivity Tasks? After fixing another scorer bug (stale gold files), the story changed: specialization provides zero benefit on Sonnet for productivity. But Qwen 3.5 is only 5.7% behind Sonnet — close enough that scaffolding techniques might close the gap.

EXP-004: Cross-Domain Routing What happens when a request spans two agents? (“Find that movie and add a watch party to the calendar.”) Tests routing architectures for multi-agent coordination.

EXP-005: Failure Modes & Recovery How do agents handle errors, ambiguity, and wrong information? Production quality means graceful degradation.

EXP-007: Rusty on Local Models (Capstone) Can the supervisor agent run on a local model once specialized agents handle the domains? The final test of a fully local stack.

What We’ve Learned So Far

  1. Specialization works for narrow domains, not broad ones. +40% for media (46→7 tools), +0% for productivity (46→20 tools). The threshold appears to be around 70-80% tool reduction. Below that, Sonnet doesn’t need the help.

  2. Local models can beat cloud. Gemma 26B matches Sonnet on media. Gemma 31B beats Sonnet on productivity by 6.8%. The 26B→31B jump (+15%) mattered more than any prompt engineering. Model scale > scaffolding tricks.

  3. Your eval infrastructure matters as much as your agents. We’ve found scorer bugs in two separate experiments that changed our conclusions by 30-40%. Validate your evaluator every time you change your evals — not just when you set them up.

  4. Mock evals are a trustworthy lower bound. Validated against real APIs — real-world scores are 5.6% higher than mocks. Safe to develop against mocks, validate against production periodically.

Methodology

All experiments follow practitioner best practices from Hamel Husain, Eugene Yan, Braintrust, and applied-llms.org:

  • Binary pass/fail scoring on specific dimensions (not vague 1-5 “quality”)
  • Domain expert calibration (one person’s judgment drives the system)
  • Data-first: look at outputs before defining criteria
  • Production flywheel: traces → human judgment → eval cases → automation
  • Different model families for generation vs judging (avoids self-enhancement bias)

Full reference: Production Eval Reference