Back to snippets

mcp_server_development_guide_with_evaluation_harness.py

python

Generated for task: mcp-builder: Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs t

Agent Votes
0
0
mcp_server_development_guide_with_evaluation_harness.py
1# SKILL.md
2
3---
4name: mcp-builder
5description: Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP) or Node/TypeScript (MCP SDK).
6license: Complete terms in LICENSE.txt
7---
8
9# MCP Server Development Guide
10
11## Overview
12
13To create high-quality MCP (Model Context Protocol) servers that enable LLMs to effectively interact with external services, use this skill. An MCP server provides tools that allow LLMs to access external services and APIs. The quality of an MCP server is measured by how well it enables LLMs to accomplish real-world tasks using the tools provided.
14
15---
16
17# Process
18
19## šŸš€ High-Level Workflow
20
21Creating a high-quality MCP server involves four main phases:
22
23### Phase 1: Deep Research and Planning
24
25#### 1.1 Understand Agent-Centric Design Principles
26
27Before diving into implementation, understand how to design tools for AI agents by reviewing these principles:
28
29**Build for Workflows, Not Just API Endpoints:**
30- Don't simply wrap existing API endpoints - build thoughtful, high-impact workflow tools
31- Consolidate related operations (e.g., `schedule_event` that both checks availability and creates event)
32- Focus on tools that enable complete tasks, not just individual API calls
33- Consider what workflows agents actually need to accomplish
34
35**Optimize for Limited Context:**
36- Agents have constrained context windows - make every token count
37- Return high-signal information, not exhaustive data dumps
38- Provide "concise" vs "detailed" response format options
39- Default to human-readable identifiers over technical codes (names over IDs)
40- Consider the agent's context budget as a scarce resource
41
42**Design Actionable Error Messages:**
43- Error messages should guide agents toward correct usage patterns
44- Suggest specific next steps: "Try using filter='active_only' to reduce results"
45- Make errors educational, not just diagnostic
46- Help agents learn proper tool usage through clear feedback
47
48**Follow Natural Task Subdivisions:**
49- Tool names should reflect how humans think about tasks
50- Group related tools with consistent prefixes for discoverability
51- Design tools around natural workflows, not just API structure
52
53**Use Evaluation-Driven Development:**
54- Create realistic evaluation scenarios early
55- Let agent feedback drive tool improvements
56- Prototype quickly and iterate based on actual agent performance
57
58#### 1.3 Study MCP Protocol Documentation
59
60**Fetch the latest MCP protocol documentation:**
61
62Use WebFetch to load: `https://modelcontextprotocol.io/llms-full.txt`
63
64This comprehensive document contains the complete MCP specification and guidelines.
65
66#### 1.4 Study Framework Documentation
67
68**Load and read the following reference files:**
69
70- **MCP Best Practices**: [šŸ“‹ View Best Practices](./reference/mcp_best_practices.md) - Core guidelines for all MCP servers
71
72**For Python implementations, also load:**
73- **Python SDK Documentation**: Use WebFetch to load `https://raw.githubusercontent.com/modelcontextprotocol/python-sdk/main/README.md`
74- [šŸ Python Implementation Guide](./reference/python_mcp_server.md) - Python-specific best practices and examples
75
76**For Node/TypeScript implementations, also load:**
77- **TypeScript SDK Documentation**: Use WebFetch to load `https://raw.githubusercontent.com/modelcontextprotocol/typescript-sdk/main/README.md`
78- [⚔ TypeScript Implementation Guide](./reference/node_mcp_server.md) - Node/TypeScript-specific best practices and examples
79
80#### 1.5 Exhaustively Study API Documentation
81
82To integrate a service, read through **ALL** available API documentation:
83- Official API reference documentation
84- Authentication and authorization requirements
85- Rate limiting and pagination patterns
86- Error responses and status codes
87- Available endpoints and their parameters
88- Data models and schemas
89
90**To gather comprehensive information, use web search and the WebFetch tool as needed.**
91
92#### 1.6 Create a Comprehensive Implementation Plan
93
94Based on your research, create a detailed plan that includes:
95
96**Tool Selection:**
97- List the most valuable endpoints/operations to implement
98- Prioritize tools that enable the most common and important use cases
99- Consider which tools work together to enable complex workflows
100
101**Shared Utilities and Helpers:**
102- Identify common API request patterns
103- Plan pagination helpers
104- Design filtering and formatting utilities
105- Plan error handling strategies
106
107**Input/Output Design:**
108- Define input validation models (Pydantic for Python, Zod for TypeScript)
109- Design consistent response formats (e.g., JSON or Markdown), and configurable levels of detail (e.g., Detailed or Concise)
110- Plan for large-scale usage (thousands of users/resources)
111- Implement character limits and truncation strategies (e.g., 25,000 tokens)
112
113**Error Handling Strategy:**
114- Plan graceful failure modes
115- Design clear, actionable, LLM-friendly, natural language error messages which prompt further action
116- Consider rate limiting and timeout scenarios
117- Handle authentication and authorization errors
118
119---
120
121### Phase 2: Implementation
122
123Now that you have a comprehensive plan, begin implementation following language-specific best practices.
124
125#### 2.1 Set Up Project Structure
126
127**For Python:**
128- Create a single `.py` file or organize into modules if complex (see [šŸ Python Guide](./reference/python_mcp_server.md))
129- Use the MCP Python SDK for tool registration
130- Define Pydantic models for input validation
131
132**For Node/TypeScript:**
133- Create proper project structure (see [⚔ TypeScript Guide](./reference/node_mcp_server.md))
134- Set up `package.json` and `tsconfig.json`
135- Use MCP TypeScript SDK
136- Define Zod schemas for input validation
137
138#### 2.2 Implement Core Infrastructure First
139
140**To begin implementation, create shared utilities before implementing tools:**
141- API request helper functions
142- Error handling utilities
143- Response formatting functions (JSON and Markdown)
144- Pagination helpers
145- Authentication/token management
146
147#### 2.3 Implement Tools Systematically
148
149For each tool in the plan:
150
151**Define Input Schema:**
152- Use Pydantic (Python) or Zod (TypeScript) for validation
153- Include proper constraints (min/max length, regex patterns, min/max values, ranges)
154- Provide clear, descriptive field descriptions
155- Include diverse examples in field descriptions
156
157**Write Comprehensive Docstrings/Descriptions:**
158- One-line summary of what the tool does
159- Detailed explanation of purpose and functionality
160- Explicit parameter types with examples
161- Complete return type schema
162- Usage examples (when to use, when not to use)
163- Error handling documentation, which outlines how to proceed given specific errors
164
165**Implement Tool Logic:**
166- Use shared utilities to avoid code duplication
167- Follow async/await patterns for all I/O
168- Implement proper error handling
169- Support multiple response formats (JSON and Markdown)
170- Respect pagination parameters
171- Check character limits and truncate appropriately
172
173**Add Tool Annotations:**
174- `readOnlyHint`: true (for read-only operations)
175- `destructiveHint`: false (for non-destructive operations)
176- `idempotentHint`: true (if repeated calls have same effect)
177- `openWorldHint`: true (if interacting with external systems)
178
179#### 2.4 Follow Language-Specific Best Practices
180
181**At this point, load the appropriate language guide:**
182
183**For Python: Load [šŸ Python Implementation Guide](./reference/python_mcp_server.md) and ensure the following:**
184- Using MCP Python SDK with proper tool registration
185- Pydantic v2 models with `model_config`
186- Type hints throughout
187- Async/await for all I/O operations
188- Proper imports organization
189- Module-level constants (CHARACTER_LIMIT, API_BASE_URL)
190
191**For Node/TypeScript: Load [⚔ TypeScript Implementation Guide](./reference/node_mcp_server.md) and ensure the following:**
192- Using `server.registerTool` properly
193- Zod schemas with `.strict()`
194- TypeScript strict mode enabled
195- No `any` types - use proper types
196- Explicit Promise<T> return types
197- Build process configured (`npm run build`)
198
199---
200
201### Phase 3: Review and Refine
202
203After initial implementation:
204
205#### 3.1 Code Quality Review
206
207To ensure quality, review the code for:
208- **DRY Principle**: No duplicated code between tools
209- **Composability**: Shared logic extracted into functions
210- **Consistency**: Similar operations return similar formats
211- **Error Handling**: All external calls have error handling
212- **Type Safety**: Full type coverage (Python type hints, TypeScript types)
213- **Documentation**: Every tool has comprehensive docstrings/descriptions
214
215#### 3.2 Test and Build
216
217**Important:** MCP servers are long-running processes that wait for requests over stdio/stdin or sse/http. Running them directly in your main process (e.g., `python server.py` or `node dist/index.js`) will cause your process to hang indefinitely.
218
219**Safe ways to test the server:**
220- Use the evaluation harness (see Phase 4) - recommended approach
221- Run the server in tmux to keep it outside your main process
222- Use a timeout when testing: `timeout 5s python server.py`
223
224**For Python:**
225- Verify Python syntax: `python -m py_compile your_server.py`
226- Check imports work correctly by reviewing the file
227- To manually test: Run server in tmux, then test with evaluation harness in main process
228- Or use the evaluation harness directly (it manages the server for stdio transport)
229
230**For Node/TypeScript:**
231- Run `npm run build` and ensure it completes without errors
232- Verify dist/index.js is created
233- To manually test: Run server in tmux, then test with evaluation harness in main process
234- Or use the evaluation harness directly (it manages the server for stdio transport)
235
236#### 3.3 Use Quality Checklist
237
238To verify implementation quality, load the appropriate checklist from the language-specific guide:
239- Python: see "Quality Checklist" in [šŸ Python Guide](./reference/python_mcp_server.md)
240- Node/TypeScript: see "Quality Checklist" in [⚔ TypeScript Guide](./reference/node_mcp_server.md)
241
242---
243
244### Phase 4: Create Evaluations
245
246After implementing your MCP server, create comprehensive evaluations to test its effectiveness.
247
248**Load [āœ… Evaluation Guide](./reference/evaluation.md) for complete evaluation guidelines.**
249
250#### 4.1 Understand Evaluation Purpose
251
252Evaluations test whether LLMs can effectively use your MCP server to answer realistic, complex questions.
253
254#### 4.2 Create 10 Evaluation Questions
255
256To create effective evaluations, follow the process outlined in the evaluation guide:
257
2581. **Tool Inspection**: List available tools and understand their capabilities
2592. **Content Exploration**: Use READ-ONLY operations to explore available data
2603. **Question Generation**: Create 10 complex, realistic questions
2614. **Answer Verification**: Solve each question yourself to verify answers
262
263#### 4.3 Evaluation Requirements
264
265Each question must be:
266- **Independent**: Not dependent on other questions
267- **Read-only**: Only non-destructive operations required
268- **Complex**: Requiring multiple tool calls and deep exploration
269- **Realistic**: Based on real use cases humans would care about
270- **Verifiable**: Single, clear answer that can be verified by string comparison
271- **Stable**: Answer won't change over time
272
273#### 4.4 Output Format
274
275Create an XML file with this structure:
276
277```xml
278<evaluation>
279  <qa_pair>
280    <question>Find discussions about AI model launches with animal codenames. One model needed a specific safety designation that uses the format ASL-X. What number X was being determined for the model named after a spotted wild cat?</question>
281    <answer>3</answer>
282  </qa_pair>
283<!-- More qa_pairs... -->
284</evaluation>
285```
286
287---
288
289# Reference Files
290
291## šŸ“š Documentation Library
292
293Load these resources as needed during development:
294
295### Core MCP Documentation (Load First)
296- **MCP Protocol**: Fetch from `https://modelcontextprotocol.io/llms-full.txt` - Complete MCP specification
297- [šŸ“‹ MCP Best Practices](./reference/mcp_best_practices.md) - Universal MCP guidelines including:
298  - Server and tool naming conventions
299  - Response format guidelines (JSON vs Markdown)
300  - Pagination best practices
301  - Character limits and truncation strategies
302  - Tool development guidelines
303  - Security and error handling standards
304
305### SDK Documentation (Load During Phase 1/2)
306- **Python SDK**: Fetch from `https://raw.githubusercontent.com/modelcontextprotocol/python-sdk/main/README.md`
307- **TypeScript SDK**: Fetch from `https://raw.githubusercontent.com/modelcontextprotocol/typescript-sdk/main/README.md`
308
309### Language-Specific Implementation Guides (Load During Phase 2)
310- [šŸ Python Implementation Guide](./reference/python_mcp_server.md) - Complete Python/FastMCP guide with:
311  - Server initialization patterns
312  - Pydantic model examples
313  - Tool registration with `@mcp.tool`
314  - Complete working examples
315  - Quality checklist
316
317- [⚔ TypeScript Implementation Guide](./reference/node_mcp_server.md) - Complete TypeScript guide with:
318  - Project structure
319  - Zod schema patterns
320  - Tool registration with `server.registerTool`
321  - Complete working examples
322  - Quality checklist
323
324### Evaluation Guide (Load During Phase 4)
325- [āœ… Evaluation Guide](./reference/evaluation.md) - Complete evaluation creation guide with:
326  - Question creation guidelines
327  - Answer verification strategies
328  - XML format specifications
329  - Example questions and answers
330  - Running an evaluation with the provided scripts
331
332
333
334# connections.py
335
336```python
337"""Lightweight connection handling for MCP servers."""
338
339from abc import ABC, abstractmethod
340from contextlib import AsyncExitStack
341from typing import Any
342
343from mcp import ClientSession, StdioServerParameters
344from mcp.client.sse import sse_client
345from mcp.client.stdio import stdio_client
346from mcp.client.streamable_http import streamablehttp_client
347
348
349class MCPConnection(ABC):
350    """Base class for MCP server connections."""
351
352    def __init__(self):
353        self.session = None
354        self._stack = None
355
356    @abstractmethod
357    def _create_context(self):
358        """Create the connection context based on connection type."""
359
360    async def __aenter__(self):
361        """Initialize MCP server connection."""
362        self._stack = AsyncExitStack()
363        await self._stack.__aenter__()
364
365        try:
366            ctx = self._create_context()
367            result = await self._stack.enter_async_context(ctx)
368
369            if len(result) == 2:
370                read, write = result
371            elif len(result) == 3:
372                read, write, _ = result
373            else:
374                raise ValueError(f"Unexpected context result: {result}")
375
376            session_ctx = ClientSession(read, write)
377            self.session = await self._stack.enter_async_context(session_ctx)
378            await self.session.initialize()
379            return self
380        except BaseException:
381            await self._stack.__aexit__(None, None, None)
382            raise
383
384    async def __aexit__(self, exc_type, exc_val, exc_tb):
385        """Clean up MCP server connection resources."""
386        if self._stack:
387            await self._stack.__aexit__(exc_type, exc_val, exc_tb)
388        self.session = None
389        self._stack = None
390
391    async def list_tools(self) -> list[dict[str, Any]]:
392        """Retrieve available tools from the MCP server."""
393        response = await self.session.list_tools()
394        return [
395            {
396                "name": tool.name,
397                "description": tool.description,
398                "input_schema": tool.inputSchema,
399            }
400            for tool in response.tools
401        ]
402
403    async def call_tool(self, tool_name: str, arguments: dict[str, Any]) -> Any:
404        """Call a tool on the MCP server with provided arguments."""
405        result = await self.session.call_tool(tool_name, arguments=arguments)
406        return result.content
407
408
409class MCPConnectionStdio(MCPConnection):
410    """MCP connection using standard input/output."""
411
412    def __init__(self, command: str, args: list[str] = None, env: dict[str, str] = None):
413        super().__init__()
414        self.command = command
415        self.args = args or []
416        self.env = env
417
418    def _create_context(self):
419        return stdio_client(
420            StdioServerParameters(command=self.command, args=self.args, env=self.env)
421        )
422
423
424class MCPConnectionSSE(MCPConnection):
425    """MCP connection using Server-Sent Events."""
426
427    def __init__(self, url: str, headers: dict[str, str] = None):
428        super().__init__()
429        self.url = url
430        self.headers = headers or {}
431
432    def _create_context(self):
433        return sse_client(url=self.url, headers=self.headers)
434
435
436class MCPConnectionHTTP(MCPConnection):
437    """MCP connection using Streamable HTTP."""
438
439    def __init__(self, url: str, headers: dict[str, str] = None):
440        super().__init__()
441        self.url = url
442        self.headers = headers or {}
443
444    def _create_context(self):
445        return streamablehttp_client(url=self.url, headers=self.headers)
446
447
448def create_connection(
449    transport: str,
450    command: str = None,
451    args: list[str] = None,
452    env: dict[str, str] = None,
453    url: str = None,
454    headers: dict[str, str] = None,
455) -> MCPConnection:
456    """Factory function to create the appropriate MCP connection.
457
458    Args:
459        transport: Connection type ("stdio", "sse", or "http")
460        command: Command to run (stdio only)
461        args: Command arguments (stdio only)
462        env: Environment variables (stdio only)
463        url: Server URL (sse and http only)
464        headers: HTTP headers (sse and http only)
465
466    Returns:
467        MCPConnection instance
468    """
469    transport = transport.lower()
470
471    if transport == "stdio":
472        if not command:
473            raise ValueError("Command is required for stdio transport")
474        return MCPConnectionStdio(command=command, args=args, env=env)
475
476    elif transport == "sse":
477        if not url:
478            raise ValueError("URL is required for sse transport")
479        return MCPConnectionSSE(url=url, headers=headers)
480
481    elif transport in ["http", "streamable_http", "streamable-http"]:
482        if not url:
483            raise ValueError("URL is required for http transport")
484        return MCPConnectionHTTP(url=url, headers=headers)
485
486    else:
487        raise ValueError(f"Unsupported transport type: {transport}. Use 'stdio', 'sse', or 'http'")
488
489```
490
491
492# evaluation.py
493
494```python
495"""MCP Server Evaluation Harness
496
497This script evaluates MCP servers by running test questions against them using Claude.
498"""
499
500import argparse
501import asyncio
502import json
503import re
504import sys
505import time
506import traceback
507import xml.etree.ElementTree as ET
508from pathlib import Path
509from typing import Any
510
511from anthropic import Anthropic
512
513from connections import create_connection
514
515EVALUATION_PROMPT = """You are an AI assistant with access to tools.
516
517When given a task, you MUST:
5181. Use the available tools to complete the task
5192. Provide summary of each step in your approach, wrapped in <summary> tags
5203. Provide feedback on the tools provided, wrapped in <feedback> tags
5214. Provide your final response, wrapped in <response> tags
522
523Summary Requirements:
524- In your <summary> tags, you must explain:
525  - The steps you took to complete the task
526  - Which tools you used, in what order, and why
527  - The inputs you provided to each tool
528  - The outputs you received from each tool
529  - A summary for how you arrived at the response
530
531Feedback Requirements:
532- In your <feedback> tags, provide constructive feedback on the tools:
533  - Comment on tool names: Are they clear and descriptive?
534  - Comment on input parameters: Are they well-documented? Are required vs optional parameters clear?
535  - Comment on descriptions: Do they accurately describe what the tool does?
536  - Comment on any errors encountered during tool usage: Did the tool fail to execute? Did the tool return too many tokens?
537  - Identify specific areas for improvement and explain WHY they would help
538  - Be specific and actionable in your suggestions
539
540Response Requirements:
541- Your response should be concise and directly address what was asked
542- Always wrap your final response in <response> tags
543- If you cannot solve the task return <response>NOT_FOUND</response>
544- For numeric responses, provide just the number
545- For IDs, provide just the ID
546- For names or text, provide the exact text requested
547- Your response should go last"""
548
549
550def parse_evaluation_file(file_path: Path) -> list[dict[str, Any]]:
551    """Parse XML evaluation file with qa_pair elements."""
552    try:
553        tree = ET.parse(file_path)
554        root = tree.getroot()
555        evaluations = []
556
557        for qa_pair in root.findall(".//qa_pair"):
558            question_elem = qa_pair.find("question")
559            answer_elem = qa_pair.find("answer")
560
561            if question_elem is not None and answer_elem is not None:
562                evaluations.append({
563                    "question": (question_elem.text or "").strip(),
564                    "answer": (answer_elem.text or "").strip(),
565                })
566
567        return evaluations
568    except Exception as e:
569        print(f"Error parsing evaluation file {file_path}: {e}")
570        return []
571
572
573def extract_xml_content(text: str, tag: str) -> str | None:
574    """Extract content from XML tags."""
575    pattern = rf"<{tag}>(.*?)</{tag}>"
576    matches = re.findall(pattern, text, re.DOTALL)
577    return matches[-1].strip() if matches else None
578
579
580async def agent_loop(
581    client: Anthropic,
582    model: str,
583    question: str,
584    tools: list[dict[str, Any]],
585    connection: Any,
586) -> tuple[str, dict[str, Any]]:
587    """Run the agent loop with MCP tools."""
588    messages = [{"role": "user", "content": question}]
589
590    response = await asyncio.to_thread(
591        client.messages.create,
592        model=model,
593        max_tokens=4096,
594        system=EVALUATION_PROMPT,
595        messages=messages,
596        tools=tools,
597    )
598
599    messages.append({"role": "assistant", "content": response.content})
600
601    tool_metrics = {}
602
603    while response.stop_reason == "tool_use":
604        tool_use = next(block for block in response.content if block.type == "tool_use")
605        tool_name = tool_use.name
606        tool_input = tool_use.input
607
608        tool_start_ts = time.time()
609        try:
610            tool_result = await connection.call_tool(tool_name, tool_input)
611            tool_response = json.dumps(tool_result) if isinstance(tool_result, (dict, list)) else str(tool_result)
612        except Exception as e:
613            tool_response = f"Error executing tool {tool_name}: {str(e)}\n"
614            tool_response += traceback.format_exc()
615        tool_duration = time.time() - tool_start_ts
616
617        if tool_name not in tool_metrics:
618            tool_metrics[tool_name] = {"count": 0, "durations": []}
619        tool_metrics[tool_name]["count"] += 1
620        tool_metrics[tool_name]["durations"].append(tool_duration)
621
622        messages.append({
623            "role": "user",
624            "content": [{
625                "type": "tool_result",
626                "tool_use_id": tool_use.id,
627                "content": tool_response,
628            }]
629        })
630
631        response = await asyncio.to_thread(
632            client.messages.create,
633            model=model,
634            max_tokens=4096,
635            system=EVALUATION_PROMPT,
636            messages=messages,
637            tools=tools,
638        )
639        messages.append({"role": "assistant", "content": response.content})
640
641    response_text = next(
642        (block.text for block in response.content if hasattr(block, "text")),
643        None,
644    )
645    return response_text, tool_metrics
646
647
648async def evaluate_single_task(
649    client: Anthropic,
650    model: str,
651    qa_pair: dict[str, Any],
652    tools: list[dict[str, Any]],
653    connection: Any,
654    task_index: int,
655) -> dict[str, Any]:
656    """Evaluate a single QA pair with the given tools."""
657    start_time = time.time()
658
659    print(f"Task {task_index + 1}: Running task with question: {qa_pair['question']}")
660    response, tool_metrics = await agent_loop(client, model, qa_pair["question"], tools, connection)
661
662    response_value = extract_xml_content(response, "response")
663    summary = extract_xml_content(response, "summary")
664    feedback = extract_xml_content(response, "feedback")
665
666    duration_seconds = time.time() - start_time
667
668    return {
669        "question": qa_pair["question"],
670        "expected": qa_pair["answer"],
671        "actual": response_value,
672        "score": int(response_value == qa_pair["answer"]) if response_value else 0,
673        "total_duration": duration_seconds,
674        "tool_calls": tool_metrics,
675        "num_tool_calls": sum(len(metrics["durations"]) for metrics in tool_metrics.values()),
676        "summary": summary,
677        "feedback": feedback,
678    }
679
680
681REPORT_HEADER = """
682# Evaluation Report
683
684## Summary
685
686- **Accuracy**: {correct}/{total} ({accuracy:.1f}%)
687- **Average Task Duration**: {average_duration_s:.2f}s
688- **Average Tool Calls per Task**: {average_tool_calls:.2f}
689- **Total Tool Calls**: {total_tool_calls}
690
691---
692"""
693
694TASK_TEMPLATE = """
695### Task {task_num}
696
697**Question**: {question}
698**Ground Truth Answer**: `{expected_answer}`
699**Actual Answer**: `{actual_answer}`
700**Correct**: {correct_indicator}
701**Duration**: {total_duration:.2f}s
702**Tool Calls**: {tool_calls}
703
704**Summary**
705{summary}
706
707**Feedback**
708{feedback}
709
710---
711"""
712
713
714async def run_evaluation(
715    eval_path: Path,
716    connection: Any,
717    model: str = "claude-3-7-sonnet-20250219",
718) -> str:
719    """Run evaluation with MCP server tools."""
720    print("šŸš€ Starting Evaluation")
721
722    client = Anthropic()
723
724    tools = await connection.list_tools()
725    print(f"šŸ“‹ Loaded {len(tools)} tools from MCP server")
726
727    qa_pairs = parse_evaluation_file(eval_path)
728    print(f"šŸ“‹ Loaded {len(qa_pairs)} evaluation tasks")
729
730    results = []
731    for i, qa_pair in enumerate(qa_pairs):
732        print(f"Processing task {i + 1}/{len(qa_pairs)}")
733        result = await evaluate_single_task(client, model, qa_pair, tools, connection, i)
734        results.append(result)
735
736    correct = sum(r["score"] for r in results)
737    accuracy = (correct / len(results)) * 100 if results else 0
738    average_duration_s = sum(r["total_duration"] for r in results) / len(results) if results else 0
739    average_tool_calls = sum(r["num_tool_calls"] for r in results) / len(results) if results else 0
740    total_tool_calls = sum(r["num_tool_calls"] for r in results)
741
742    report = REPORT_HEADER.format(
743        correct=correct,
744        total=len(results),
745        accuracy=accuracy,
746        average_duration_s=average_duration_s,
747        average_tool_calls=average_tool_calls,
748        total_tool_calls=total_tool_calls,
749    )
750
751    report += "".join([
752        TASK_TEMPLATE.format(
753            task_num=i + 1,
754            question=qa_pair["question"],
755            expected_answer=qa_pair["answer"],
756            actual_answer=result["actual"] or "N/A",
757            correct_indicator="āœ…" if result["score"] else "āŒ",
758            total_duration=result["total_duration"],
759            tool_calls=json.dumps(result["tool_calls"], indent=2),
760            summary=result["summary"] or "N/A",
761            feedback=result["feedback"] or "N/A",
762        )
763        for i, (qa_pair, result) in enumerate(zip(qa_pairs, results))
764    ])
765
766    return report
767
768
769def parse_headers(header_list: list[str]) -> dict[str, str]:
770    """Parse header strings in format 'Key: Value' into a dictionary."""
771    headers = {}
772    if not header_list:
773        return headers
774
775    for header in header_list:
776        if ":" in header:
777            key, value = header.split(":", 1)
778            headers[key.strip()] = value.strip()
779        else:
780            print(f"Warning: Ignoring malformed header: {header}")
781    return headers
782
783
784def parse_env_vars(env_list: list[str]) -> dict[str, str]:
785    """Parse environment variable strings in format 'KEY=VALUE' into a dictionary."""
786    env = {}
787    if not env_list:
788        return env
789
790    for env_var in env_list:
791        if "=" in env_var:
792            key, value = env_var.split("=", 1)
793            env[key.strip()] = value.strip()
794        else:
795            print(f"Warning: Ignoring malformed environment variable: {env_var}")
796    return env
797
798
799async def main():
800    parser = argparse.ArgumentParser(
801        description="Evaluate MCP servers using test questions",
802        formatter_class=argparse.RawDescriptionHelpFormatter,
803        epilog="""
804Examples:
805  # Evaluate a local stdio MCP server
806  python evaluation.py -t stdio -c python -a my_server.py eval.xml
807
808  # Evaluate an SSE MCP server
809  python evaluation.py -t sse -u https://example.com/mcp -H "Authorization: Bearer token" eval.xml
810
811  # Evaluate an HTTP MCP server with custom model
812  python evaluation.py -t http -u https://example.com/mcp -m claude-3-5-sonnet-20241022 eval.xml
813        """,
814    )
815
816    parser.add_argument("eval_file", type=Path, help="Path to evaluation XML file")
817    parser.add_argument("-t", "--transport", choices=["stdio", "sse", "http"], default="stdio", help="Transport type (default: stdio)")
818    parser.add_argument("-m", "--model", default="claude-3-7-sonnet-20250219", help="Claude model to use (default: claude-3-7-sonnet-20250219)")
819
820    stdio_group = parser.add_argument_group("stdio options")
821    stdio_group.add_argument("-c", "--command", help="Command to run MCP server (stdio only)")
822    stdio_group.add_argument("-a", "--args", nargs="+", help="Arguments for the command (stdio only)")
823    stdio_group.add_argument("-e", "--env", nargs="+", help="Environment variables in KEY=VALUE format (stdio only)")
824
825    remote_group = parser.add_argument_group("sse/http options")
826    remote_group.add_argument("-u", "--url", help="MCP server URL (sse/http only)")
827    remote_group.add_argument("-H", "--header", nargs="+", dest="headers", help="HTTP headers in 'Key: Value' format (sse/http only)")
828
829    parser.add_argument("-o", "--output", type=Path, help="Output file for evaluation report (default: stdout)")
830
831    args = parser.parse_args()
832
833    if not args.eval_file.exists():
834        print(f"Error: Evaluation file not found: {args.eval_file}")
835        sys.exit(1)
836
837    headers = parse_headers(args.headers) if args.headers else None
838    env_vars = parse_env_vars(args.env) if args.env else None
839
840    try:
841        connection = create_connection(
842            transport=args.transport,
843            command=args.command,
844            args=args.args,
845            env=env_vars,
846            url=args.url,
847            headers=headers,
848        )
849    except ValueError as e:
850        print(f"Error: {e}")
851        sys.exit(1)
852
853    print(f"šŸ”— Connecting to MCP server via {args.transport}...")
854
855    async with connection:
856        print("āœ… Connected successfully")
857        report = await run_evaluation(args.eval_file, connection, args.model)
858
859        if args.output:
860            args.output.write_text(report)
861            print(f"\nāœ… Report saved to {args.output}")
862        else:
863            print("\n" + report)
864
865
866if __name__ == "__main__":
867    asyncio.run(main())
868
869```