Last updated: February 22, 2026
These runs compare claude-agent-sdk baseline vs the Raysurfer drop-in replacement on rotating task variants. When cache files are present, the agent is required to inspect cache files and reputation signals before execution.
Best Uplift (Mixed Existing)
+32.5 pp
45.0% baseline vs 77.5% Raysurfer
Best Showcase (MBPP Rotating)
+33.3 pp
0.0% baseline vs 33.3% Raysurfer
Cache Review Compliance
90%+
On the strongest persistence-focused runs
| Run Set | Attempts / Mode | Model | Turns | Timeout | Baseline | Raysurfer | Delta |
|---|---|---|---|---|---|---|---|
| Mixed existing benchmarks (10 HumanEval + 10 MBPP), rotating prompts | 40 (20 tasks x 2 rounds) | claude-haiku-4-5-20251001 | 8 | 120s | 45.0% (18/40) | 77.5% (31/40) | +32.5 pp |
| MBPP showcase (best persistence demo), rotating prompts | 30 (10 tasks x 3 rounds) | claude-haiku-4-5-20251001 | 8 | 120s | 0.0% (0/30) | 33.3% (10/30) | +33.3 pp |
| Public one-shot eval (20 implementation-heavy tasks), rotating prompts | 40 (20 tasks x 2 rounds) | claude-haiku-4-5-20251001 | 8 | 180s | 55.0% (22/40) | 40.0% (16/40) | -15.0 pp |
Full methodology and re-run commands live in the docs benchmark page.
The best public setup is a single canonical page at raysurfer.com/benchmarks, with the reproducibility details and commands kept in docs.