The cheapest tier that holds quality. Measured.

Thomas Peng. Graphic designer turned AI-native builder. I build agentic systems and evaluate them honestly: cost-aware routing, adversarial verification, deterministic scoring, honest nulls.

thomas@thomaspeng.ca View work GitHub

Tier

Route condition

Cost

Status

DeepSeekdeepseek-v3 / call.sh

Input already in context, no tools needed. Classification, scoring, extraction, summarization.

$0.001per 1M tok (in)

Pending

Haikuclaude-haiku-4

Needs tools but mechanical, single-correct-form. Read file, run command, grep-and-list.

$0.008per 1M tok (in)

Pending

Sonnetclaude-sonnet-4

Real judgment with tools. Code review, exploration, debugging, multi-step investigation. DEFAULT.

$0.30per 1M tok (in)

Pending

Opusclaude-opus-4

Convergence vote, architecture (3+ trade-offs), creative synthesis. Must justify over Sonnet.

$1.50per 1M tok (in)

Pending

Run cost$0.000Evaluating...

Builds agentic systems. Evaluates them honestly.

Four artifacts, one shared substrate. Quorum's core/ provides cost-aware routing, adversarial multi-agent verification, and full tracing. Aegis, FieldAgent, and the Skill-Tuning Council all vendor it.

The eval discipline is non-negotiable: deterministic scoring (no LLM judge in the success path), adversarial verification, cost-gated reproducible runs, and honest nulls reported when the data doesn't hold.

Kernelquorum/core/ -- cost-aware routing + adversarial K=3 verification + full tracing. Vendored by Aegis, FieldAgent, Skill-Tuning Council.

Quorum

github.com/7P3ng/quorum

Honest finding

K=3 adversarial verification cut false positives 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps. Held-out real target: 3/3 genuine bugs found, 0 surviving false positives.

Task-aware agent orchestrator: cost-aware model routing (DeepSeek to Haiku to Sonnet to Opus) plus adversarial multi-agent verification plus full tracing, with a trace UI that looks like a product.

Fans out finders per file, K skeptics per finding (concurrency cap 8). Approximately $0.25 total per run. 58 tests, ruff plus mypy plus CI green.make eval-dry reproduces offline.

Cost-routing claim: harness committed, live multi-tier number gated on Anthropic key.

Metric	Value
False positives (pre-verification)	27.8%
False positives (K=3 verification)	0.0%
95% CI post-verification	[0, 0]
Recall after verification	77.8%
Held-out bugs found	3 / 3
Est. cost per run	~$0.25Gated on Anthropic key
Test suite	58 tests

https://quorum.thomaspeng.caOpen live

Aegis

github.com/7P3ng/aegis

Honest finding (the sophisticated null)

A reasoning model is significantly more robust pre-defense (injection ASR 49.3% vs 68.1%, p=0.0012; canary 10.4% vs 21.5%, p=0.010; overall p=0.0002). But the full defense stack erases the gap: 1.7% vs 2.8%, p=0.40, not significant. The defenses are the finding, not the model.

An adaptive attacker agent red-teams a target on two harmless proxies (canary-string extraction plus prompt-injection sentinel), scored deterministically (exact match, no LLM judge). Layered defenses measurably cut attack success. Vendors Quorum's core/.

Adaptation lift 24.0% to 29.9% became significant only after scaling the benchmark (McNemar b=17/c=0, p approx 0; was a null at small n). Frame: scaling is the legit power lever, not p-hacking. Defense reduction 29.2% to 4.2%. Input-classifier is the workhorse. 78 tests, CI plus Pages green.

Metric	Value
Injection ASR: reasoning model	49.3%
Injection ASR: baseline	68.1%
p-value (model robustness)	0.0012
Full defense: reasoning vs baseline	p=0.40 (n.s.)
Defense reduction	29.2% to 4.2%
Adaptation lift (at scale)	24.0% to 29.9%
Test suite	78 tests

https://7p3ng.github.io/aegis/Open live

FieldAgent

github.com/7P3ng/fieldagent

Honest finding (the honest null)

Detection F1 = 0.548 (P = 0.741 / R = 0.435), 95% CI [0.460, 0.637]. +0.21 F1 over a keyword floor (robust, baseline-independent). The "agentic chunking lift" is model-specific noise, not a real advantage. Looked like +0.45 on DeepSeek only because of a truncation artifact. Fair rerun collapses it to +0.07, CIs overlap, ties on Claude Sonnet. This honesty is the point.

An agent reads a real commercial contract and flags risk-bearing clauses (span plus severity plus plain-English risk), graded span-IoU against CUAD gold (no LLM judge). Vendors Quorum's core/.

47 tests, CI green. Party names and dollar figures are redacted in the demo. The +0.45 looked real until a truncation audit killed it: both arms need max_tokens headroom and cross-model validation before any lift claim ships.

Metric	Value
Detection F1	0.548
Precision / Recall	0.741 / 0.435
95% CI	[0.460, 0.637]
F1 lift over keyword floor	+0.21
Agentic chunking lift (honest)	+0.07 (CIs overlap)
Contracts evaluated	20 held-out CUAD
Test suite	47 tests

https://fieldagent.thomaspeng.caOpen live

Skill-Tuning Council

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary to editors to merger to council, with escalation on disagreement.

576 tests. Internal infra, no public URL. Shown here as a systems-design piece: the log below is a real pipeline run, replayed.

Component	Role
Adversary agent	Probes candidate for regressions
Editor agents (A, B)	Rewrite flagged anti-patterns
Merger	Conflict-resolves editor outputs
Council (4 proxies)	Votes: all 4 must pass to ship
Test suite	576 tests

skill-tuning-councilrun_id: st-001

Eval discipline

Principle	What it means in practice
Deterministic scoring	No LLM judge in the success path. Exact match, span-IoU, p-values against a labeled holdout. If the metric is fuzzy, the result is not publishable.-- FieldAgent: span-IoU vs CUAD gold
Adversarial verification	K skeptic agents challenge every finding before it ships. False positives caught at the verification layer, not in production.-- Quorum: K=3 cut FP 27.8% to 0.0%
Cost-gated reproducibility	Every result has a `make eval-dry` command that reproduces it offline. Live tier numbers are gated on real API keys, stated honestly.-- Quorum: make eval-dry reproduces offline
Honest nulls reported	When the data doesn't hold, it leads. The FieldAgent agentic-chunking lift collapsed from +0.45 to +0.07 under a fair rerun. That rerun is in the write-up.-- FieldAgent: truncation artifact disclosed
Scaling, not p-hacking	Aegis adaptation lift was a null at small n, became significant only after scaling the benchmark. The distinction is stated, not elided.-- Aegis: McNemar b=17/c=0, p approx 0 at scale

Let's talk.

Emailthomas@thomaspeng.ca GitHubgithub.com/7P3ng