Proving Agent Quality With Data

The Thesis

Can a system of specialized AI agents running on local models (Gemma 26B, free) match or beat a monolithic cloud agent (Claude Sonnet, paid) for personal task management?

We’re testing this with a series of experiments, each building evidence for the next. Every claim is backed by eval scores, not opinions.

The Scorecard

Domain	Rusty	Specialist (Sonnet)	Best Local	Model	Verdict
Media	2.52	3.53 (+40%)	3.57	Gemma 26B	Local viable
Productivity	3.85	3.85 (+0%)	4.11	Gemma 31B	Local wins (+6.8%)
Cross-domain	—	—	—	—	Planned
Supervisor	—	—	—	—	Planned

The Experiments

EXP-001: Does Splitting a Monolithic Agent Into Specialists Improve Eval Scores? Specialization improves media evals by 40%. Gemma 26B matches Sonnet on the focused domain. Also discovered that eval infrastructure quality (harness mocks, scorer bugs) matters more than you think.

EXP-002: Do Mock Evals Predict Real-World Agent Quality? Real APIs score 5.6% higher than mocks. 93% of evals stable across 3 runs. Mock evals are a trustworthy conservative lower bound. Only variance source is model non-determinism, not API instability.

EXP-003: Does Agent Specialization Replicate for Productivity Tasks? After fixing another scorer bug (stale gold files), the story changed: specialization provides zero benefit on Sonnet for productivity. But Qwen 3.5 is only 5.7% behind Sonnet — close enough that scaffolding techniques might close the gap.

EXP-004: Cross-Domain Routing What happens when a request spans two agents? (“Find that movie and add a watch party to the calendar.”) Tests routing architectures for multi-agent coordination.

EXP-005: Failure Modes & Recovery How do agents handle errors, ambiguity, and wrong information? Production quality means graceful degradation.

EXP-007: Rusty on Local Models (Capstone) Can the supervisor agent run on a local model once specialized agents handle the domains? The final test of a fully local stack.

What We’ve Learned So Far

Specialization works for narrow domains, not broad ones. +40% for media (46→7 tools), +0% for productivity (46→20 tools). The threshold appears to be around 70-80% tool reduction. Below that, Sonnet doesn’t need the help.
Local models can beat cloud. Gemma 26B matches Sonnet on media. Gemma 31B beats Sonnet on productivity by 6.8%. The 26B→31B jump (+15%) mattered more than any prompt engineering. Model scale > scaffolding tricks.
Your eval infrastructure matters as much as your agents. We’ve found scorer bugs in two separate experiments that changed our conclusions by 30-40%. Validate your evaluator every time you change your evals — not just when you set them up.
Mock evals are a trustworthy lower bound. Validated against real APIs — real-world scores are 5.6% higher than mocks. Safe to develop against mocks, validate against production periodically.

Methodology

All experiments follow practitioner best practices from Hamel Husain, Eugene Yan, Braintrust, and applied-llms.org:

Binary pass/fail scoring on specific dimensions (not vague 1-5 “quality”)
Domain expert calibration (one person’s judgment drives the system)
Data-first: look at outputs before defining criteria
Production flywheel: traces → human judgment → eval cases → automation
Different model families for generation vs judging (avoids self-enhancement bias)

Full reference: Production Eval Reference