Last updated: February 22, 2026

Raysurfer Benchmark Results

These runs compare claude-agent-sdk baseline vs the Raysurfer drop-in replacement on rotating task variants. When cache files are present, the agent is required to inspect cache files and reputation signals before execution.

Best Uplift (Mixed Existing)

+32.5 pp

45.0% baseline vs 77.5% Raysurfer

Best Showcase (MBPP Rotating)

+33.3 pp

0.0% baseline vs 33.3% Raysurfer

Cache Review Compliance

90%+

On the strongest persistence-focused runs

Run SetAttempts / ModeModelTurnsTimeoutBaselineRaysurferDelta
Mixed existing benchmarks (10 HumanEval + 10 MBPP), rotating prompts40 (20 tasks x 2 rounds)claude-haiku-4-5-202510018120s45.0% (18/40)77.5% (31/40)+32.5 pp
MBPP showcase (best persistence demo), rotating prompts30 (10 tasks x 3 rounds)claude-haiku-4-5-202510018120s0.0% (0/30)33.3% (10/30)+33.3 pp
Public one-shot eval (20 implementation-heavy tasks), rotating prompts40 (20 tasks x 2 rounds)claude-haiku-4-5-202510018180s55.0% (22/40)40.0% (16/40)-15.0 pp

Methodology

  • Task order rotates each round to stress persistence under variation.
  • Prompts require cache file and reputation review before execution when cache exists.
  • Both modes use the same model and turn/timeout budgets.
  • Primary score is completion-within-SLA; validation pass is reported separately.
  • Per-run details include `status`, `validation`, `cache_hit`, and `cache_review`.

Reproduce Results

Full methodology and re-run commands live in the docs benchmark page.

Publish Recommendation

The best public setup is a single canonical page at raysurfer.com/benchmarks, with the reproducibility details and commands kept in docs.