K=3 adversarial verification cut false positives 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps. Held-out real target: 3/3 genuine bugs found, 0 surviving false positives.
Task-aware agent orchestrator: cost-aware model routing (DeepSeek to Haiku to Sonnet to Opus) plus adversarial multi-agent verification plus full tracing, with a trace UI that looks like a product.
Fans out finders per file, K skeptics per finding (concurrency cap 8). Approximately $0.25 total per run. 58 tests, ruff plus mypy plus CI green.make eval-dry reproduces offline.
Cost-routing claim: harness committed, live multi-tier number gated on Anthropic key.
| Metric | Value |
|---|---|
| False positives (pre-verification) | 27.8% |
| False positives (K=3 verification) | 0.0% |
| 95% CI post-verification | [0, 0] |
| Recall after verification | 77.8% |
| Held-out bugs found | 3 / 3 |
| Est. cost per run | ~$0.25Gated on Anthropic key |
| Test suite | 58 tests |