The cheapest tier that holds quality. Measured.

Thomas Peng. Graphic designer turned AI-native builder. I build agentic systems and evaluate them honestly: cost-aware routing, adversarial verification, deterministic scoring, honest nulls.

#
Tier
Route condition
Cost
Status
T0
DeepSeekdeepseek-v3 / call.sh
Input already in context, no tools needed. Classification, scoring, extraction, summarization.
$0.001per 1M tok (in)
Pending
T1
Haikuclaude-haiku-4
Needs tools but mechanical, single-correct-form. Read file, run command, grep-and-list.
$0.008per 1M tok (in)
Pending
T2
Sonnetclaude-sonnet-4
Real judgment with tools. Code review, exploration, debugging, multi-step investigation. DEFAULT.
$0.30per 1M tok (in)
Pending
T3
Opusclaude-opus-4
Convergence vote, architecture (3+ trade-offs), creative synthesis. Must justify over Sonnet.
$1.50per 1M tok (in)
Pending
Run cost$0.000Evaluating...

Builds agentic systems. Evaluates them honestly.

Four artifacts, one shared substrate. Quorum's core/ provides cost-aware routing, adversarial multi-agent verification, and full tracing. Aegis, FieldAgent, and the Skill-Tuning Council all vendor it.

The eval discipline is non-negotiable: deterministic scoring (no LLM judge in the success path), adversarial verification, cost-gated reproducible runs, and honest nulls reported when the data doesn't hold.

Kernelquorum/core/ -- cost-aware routing + adversarial K=3 verification + full tracing. Vendored by Aegis, FieldAgent, Skill-Tuning Council.
Honest finding

K=3 adversarial verification cut false positives 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps. Held-out real target: 3/3 genuine bugs found, 0 surviving false positives.

Task-aware agent orchestrator: cost-aware model routing (DeepSeek to Haiku to Sonnet to Opus) plus adversarial multi-agent verification plus full tracing, with a trace UI that looks like a product.

Fans out finders per file, K skeptics per finding (concurrency cap 8). Approximately $0.25 total per run. 58 tests, ruff plus mypy plus CI green.make eval-dry reproduces offline.

Cost-routing claim: harness committed, live multi-tier number gated on Anthropic key.

MetricValue
False positives (pre-verification)27.8%
False positives (K=3 verification)0.0%
95% CI post-verification[0, 0]
Recall after verification77.8%
Held-out bugs found3 / 3
Est. cost per run~$0.25Gated on Anthropic key
Test suite58 tests
https://quorum.thomaspeng.caOpen live
Honest finding (the sophisticated null)

A reasoning model is significantly more robust pre-defense (injection ASR 49.3% vs 68.1%, p=0.0012; canary 10.4% vs 21.5%, p=0.010; overall p=0.0002). But the full defense stack erases the gap: 1.7% vs 2.8%, p=0.40, not significant. The defenses are the finding, not the model.

An adaptive attacker agent red-teams a target on two harmless proxies (canary-string extraction plus prompt-injection sentinel), scored deterministically (exact match, no LLM judge). Layered defenses measurably cut attack success. Vendors Quorum's core/.

Adaptation lift 24.0% to 29.9% became significant only after scaling the benchmark (McNemar b=17/c=0, p approx 0; was a null at small n). Frame: scaling is the legit power lever, not p-hacking. Defense reduction 29.2% to 4.2%. Input-classifier is the workhorse. 78 tests, CI plus Pages green.

MetricValue
Injection ASR: reasoning model49.3%
Injection ASR: baseline68.1%
p-value (model robustness)0.0012
Full defense: reasoning vs baselinep=0.40 (n.s.)
Defense reduction29.2% to 4.2%
Adaptation lift (at scale)24.0% to 29.9%
Test suite78 tests
https://7p3ng.github.io/aegis/Open live
Honest finding (the honest null)

Detection F1 = 0.548 (P = 0.741 / R = 0.435), 95% CI [0.460, 0.637]. +0.21 F1 over a keyword floor (robust, baseline-independent). The "agentic chunking lift" is model-specific noise, not a real advantage. Looked like +0.45 on DeepSeek only because of a truncation artifact. Fair rerun collapses it to +0.07, CIs overlap, ties on Claude Sonnet. This honesty is the point.

An agent reads a real commercial contract and flags risk-bearing clauses (span plus severity plus plain-English risk), graded span-IoU against CUAD gold (no LLM judge). Vendors Quorum's core/.

47 tests, CI green. Party names and dollar figures are redacted in the demo. The +0.45 looked real until a truncation audit killed it: both arms need max_tokens headroom and cross-model validation before any lift claim ships.

MetricValue
Detection F10.548
Precision / Recall0.741 / 0.435
95% CI[0.460, 0.637]
F1 lift over keyword floor+0.21
Agentic chunking lift (honest)+0.07 (CIs overlap)
Contracts evaluated20 held-out CUAD
Test suite47 tests
https://fieldagent.thomaspeng.caOpen live
04

Skill-Tuning Council

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary to editors to merger to council, with escalation on disagreement.

576 tests. Internal infra, no public URL. Shown here as a systems-design piece: the log below is a real pipeline run, replayed.

ComponentRole
Adversary agentProbes candidate for regressions
Editor agents (A, B)Rewrite flagged anti-patterns
MergerConflict-resolves editor outputs
Council (4 proxies)Votes: all 4 must pass to ship
Test suite576 tests
skill-tuning-councilrun_id: st-001
05

Eval discipline

PrincipleWhat it means in practice
Deterministic scoringNo LLM judge in the success path. Exact match, span-IoU, p-values against a labeled holdout. If the metric is fuzzy, the result is not publishable.-- FieldAgent: span-IoU vs CUAD gold
Adversarial verificationK skeptic agents challenge every finding before it ships. False positives caught at the verification layer, not in production.-- Quorum: K=3 cut FP 27.8% to 0.0%
Cost-gated reproducibilityEvery result has a make eval-dry command that reproduces it offline. Live tier numbers are gated on real API keys, stated honestly.-- Quorum: make eval-dry reproduces offline
Honest nulls reportedWhen the data doesn't hold, it leads. The FieldAgent agentic-chunking lift collapsed from +0.45 to +0.07 under a fair rerun. That rerun is in the write-up.-- FieldAgent: truncation artifact disclosed
Scaling, not p-hackingAegis adaptation lift was a null at small n, became significant only after scaling the benchmark. The distinction is stated, not elided.-- Aegis: McNemar b=17/c=0, p approx 0 at scale