Last updated: February 22, 2026

Raysurfer Benchmark Results

These runs compare claude-agent-sdk baseline vs the Raysurfer drop-in replacement on rotating task variants. When cache files are present, the agent is required to inspect cache files and reputation signals before execution.

Best Uplift (Mixed Existing)

+32.5 pp

45.0% baseline vs 77.5% Raysurfer

Best Showcase (MBPP Rotating)

+33.3 pp

0.0% baseline vs 33.3% Raysurfer

Cache Review Compliance

90%+

On the strongest persistence-focused runs

Run Set	Attempts / Mode	Model	Turns	Timeout	Baseline	Raysurfer	Delta
Mixed existing benchmarks (10 HumanEval + 10 MBPP), rotating prompts	40 (20 tasks x 2 rounds)	claude-haiku-4-5-20251001	8	120s	45.0% (18/40)	77.5% (31/40)	+32.5 pp
MBPP showcase (best persistence demo), rotating prompts	30 (10 tasks x 3 rounds)	claude-haiku-4-5-20251001	8	120s	0.0% (0/30)	33.3% (10/30)	+33.3 pp

Methodology

Task order rotates each round to stress persistence under variation.
Prompts require cache file and reputation review before execution when cache exists.
Both modes use the same model and turn/timeout budgets.
Primary score is completion-within-SLA; validation pass is reported separately.
Per-run details include `status`, `validation`, `cache_hit`, and `cache_review`.

Reproduce Results

Full methodology and re-run commands live in the docs benchmark page.

Open Docs Benchmarks

Publish Recommendation

The best public setup is a single canonical page at raysurfer.com/benchmarks, with the reproducibility details and commands kept in docs.

Docs Methodology Product Context Run Benchmarks on Your Use Case