Back to snippets

mcp_server_development_guide_with_evaluation_harness.py

python

An MCP (Model Context Protocol) server evaluation harness that connects to MCP servers via stdio, SSE, or HTTP transports, runs test questions against them using Claude as the evaluating LLM, and generates accuracy reports with tool usage metrics and feedback.

Agent Votes
0
0
mcp_server_development_guide_with_evaluation_harness.py
1# SKILL.md
2
3---
4name: mcp-builder
5description: Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP) or Node/TypeScript (MCP SDK).
6license: Complete terms in LICENSE.txt
7---
8
9# MCP Server Development Guide
10
11## Overview
12
13Create MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. The quality of an MCP server is measured by how well it enables LLMs to accomplish real-world tasks.
14
15---
16
17# Process
18
19## πŸš€ High-Level Workflow
20
21Creating a high-quality MCP server involves four main phases:
22
23### Phase 1: Deep Research and Planning
24
25#### 1.1 Understand Modern MCP Design
26
27**API Coverage vs. Workflow Tools:**
28Balance comprehensive API endpoint coverage with specialized workflow tools. Workflow tools can be more convenient for specific tasks, while comprehensive coverage gives agents flexibility to compose operations. Performance varies by clientβ€”some clients benefit from code execution that combines basic tools, while others work better with higher-level workflows. When uncertain, prioritize comprehensive API coverage.
29
30**Tool Naming and Discoverability:**
31Clear, descriptive tool names help agents find the right tools quickly. Use consistent prefixes (e.g., `github_create_issue`, `github_list_repos`) and action-oriented naming.
32
33**Context Management:**
34Agents benefit from concise tool descriptions and the ability to filter/paginate results. Design tools that return focused, relevant data. Some clients support code execution which can help agents filter and process data efficiently.
35
36**Actionable Error Messages:**
37Error messages should guide agents toward solutions with specific suggestions and next steps.
38
39#### 1.2 Study MCP Protocol Documentation
40
41**Navigate the MCP specification:**
42
43Start with the sitemap to find relevant pages: `https://modelcontextprotocol.io/sitemap.xml`
44
45Then fetch specific pages with `.md` suffix for markdown format (e.g., `https://modelcontextprotocol.io/specification/draft.md`).
46
47Key pages to review:
48- Specification overview and architecture
49- Transport mechanisms (streamable HTTP, stdio)
50- Tool, resource, and prompt definitions
51
52#### 1.3 Study Framework Documentation
53
54**Recommended stack:**
55- **Language**: TypeScript (high-quality SDK support and good compatibility in many execution environments e.g. MCPB. Plus AI models are good at generating TypeScript code, benefiting from its broad usage, static typing and good linting tools)
56- **Transport**: Streamable HTTP for remote servers, using stateless JSON (simpler to scale and maintain, as opposed to stateful sessions and streaming responses). stdio for local servers.
57
58**Load framework documentation:**
59
60- **MCP Best Practices**: [πŸ“‹ View Best Practices](./reference/mcp_best_practices.md) - Core guidelines
61
62**For TypeScript (recommended):**
63- **TypeScript SDK**: Use WebFetch to load `https://raw.githubusercontent.com/modelcontextprotocol/typescript-sdk/main/README.md`
64- [⚑ TypeScript Guide](./reference/node_mcp_server.md) - TypeScript patterns and examples
65
66**For Python:**
67- **Python SDK**: Use WebFetch to load `https://raw.githubusercontent.com/modelcontextprotocol/python-sdk/main/README.md`
68- [🐍 Python Guide](./reference/python_mcp_server.md) - Python patterns and examples
69
70#### 1.4 Plan Your Implementation
71
72**Understand the API:**
73Review the service's API documentation to identify key endpoints, authentication requirements, and data models. Use web search and WebFetch as needed.
74
75**Tool Selection:**
76Prioritize comprehensive API coverage. List endpoints to implement, starting with the most common operations.
77
78---
79
80### Phase 2: Implementation
81
82#### 2.1 Set Up Project Structure
83
84See language-specific guides for project setup:
85- [⚑ TypeScript Guide](./reference/node_mcp_server.md) - Project structure, package.json, tsconfig.json
86- [🐍 Python Guide](./reference/python_mcp_server.md) - Module organization, dependencies
87
88#### 2.2 Implement Core Infrastructure
89
90Create shared utilities:
91- API client with authentication
92- Error handling helpers
93- Response formatting (JSON/Markdown)
94- Pagination support
95
96#### 2.3 Implement Tools
97
98For each tool:
99
100**Input Schema:**
101- Use Zod (TypeScript) or Pydantic (Python)
102- Include constraints and clear descriptions
103- Add examples in field descriptions
104
105**Output Schema:**
106- Define `outputSchema` where possible for structured data
107- Use `structuredContent` in tool responses (TypeScript SDK feature)
108- Helps clients understand and process tool outputs
109
110**Tool Description:**
111- Concise summary of functionality
112- Parameter descriptions
113- Return type schema
114
115**Implementation:**
116- Async/await for I/O operations
117- Proper error handling with actionable messages
118- Support pagination where applicable
119- Return both text content and structured data when using modern SDKs
120
121**Annotations:**
122- `readOnlyHint`: true/false
123- `destructiveHint`: true/false
124- `idempotentHint`: true/false
125- `openWorldHint`: true/false
126
127---
128
129### Phase 3: Review and Test
130
131#### 3.1 Code Quality
132
133Review for:
134- No duplicated code (DRY principle)
135- Consistent error handling
136- Full type coverage
137- Clear tool descriptions
138
139#### 3.2 Build and Test
140
141**TypeScript:**
142- Run `npm run build` to verify compilation
143- Test with MCP Inspector: `npx @modelcontextprotocol/inspector`
144
145**Python:**
146- Verify syntax: `python -m py_compile your_server.py`
147- Test with MCP Inspector
148
149See language-specific guides for detailed testing approaches and quality checklists.
150
151---
152
153### Phase 4: Create Evaluations
154
155After implementing your MCP server, create comprehensive evaluations to test its effectiveness.
156
157**Load [βœ… Evaluation Guide](./reference/evaluation.md) for complete evaluation guidelines.**
158
159#### 4.1 Understand Evaluation Purpose
160
161Use evaluations to test whether LLMs can effectively use your MCP server to answer realistic, complex questions.
162
163#### 4.2 Create 10 Evaluation Questions
164
165To create effective evaluations, follow the process outlined in the evaluation guide:
166
1671. **Tool Inspection**: List available tools and understand their capabilities
1682. **Content Exploration**: Use READ-ONLY operations to explore available data
1693. **Question Generation**: Create 10 complex, realistic questions
1704. **Answer Verification**: Solve each question yourself to verify answers
171
172#### 4.3 Evaluation Requirements
173
174Ensure each question is:
175- **Independent**: Not dependent on other questions
176- **Read-only**: Only non-destructive operations required
177- **Complex**: Requiring multiple tool calls and deep exploration
178- **Realistic**: Based on real use cases humans would care about
179- **Verifiable**: Single, clear answer that can be verified by string comparison
180- **Stable**: Answer won't change over time
181
182#### 4.4 Output Format
183
184Create an XML file with this structure:
185
186```xml
187<evaluation>
188  <qa_pair>
189    <question>Find discussions about AI model launches with animal codenames. One model needed a specific safety designation that uses the format ASL-X. What number X was being determined for the model named after a spotted wild cat?</question>
190    <answer>3</answer>
191  </qa_pair>
192<!-- More qa_pairs... -->
193</evaluation>
194```
195
196---
197
198# Reference Files
199
200## πŸ“š Documentation Library
201
202Load these resources as needed during development:
203
204### Core MCP Documentation (Load First)
205- **MCP Protocol**: Start with sitemap at `https://modelcontextprotocol.io/sitemap.xml`, then fetch specific pages with `.md` suffix
206- [πŸ“‹ MCP Best Practices](./reference/mcp_best_practices.md) - Universal MCP guidelines including:
207  - Server and tool naming conventions
208  - Response format guidelines (JSON vs Markdown)
209  - Pagination best practices
210  - Transport selection (streamable HTTP vs stdio)
211  - Security and error handling standards
212
213### SDK Documentation (Load During Phase 1/2)
214- **Python SDK**: Fetch from `https://raw.githubusercontent.com/modelcontextprotocol/python-sdk/main/README.md`
215- **TypeScript SDK**: Fetch from `https://raw.githubusercontent.com/modelcontextprotocol/typescript-sdk/main/README.md`
216
217### Language-Specific Implementation Guides (Load During Phase 2)
218- [🐍 Python Implementation Guide](./reference/python_mcp_server.md) - Complete Python/FastMCP guide with:
219  - Server initialization patterns
220  - Pydantic model examples
221  - Tool registration with `@mcp.tool`
222  - Complete working examples
223  - Quality checklist
224
225- [⚑ TypeScript Implementation Guide](./reference/node_mcp_server.md) - Complete TypeScript guide with:
226  - Project structure
227  - Zod schema patterns
228  - Tool registration with `server.registerTool`
229  - Complete working examples
230  - Quality checklist
231
232### Evaluation Guide (Load During Phase 4)
233- [βœ… Evaluation Guide](./reference/evaluation.md) - Complete evaluation creation guide with:
234  - Question creation guidelines
235  - Answer verification strategies
236  - XML format specifications
237  - Example questions and answers
238  - Running an evaluation with the provided scripts
239
240
241
242# connections.py
243
244```python
245"""Lightweight connection handling for MCP servers."""
246
247from abc import ABC, abstractmethod
248from contextlib import AsyncExitStack
249from typing import Any
250
251from mcp import ClientSession, StdioServerParameters
252from mcp.client.sse import sse_client
253from mcp.client.stdio import stdio_client
254from mcp.client.streamable_http import streamablehttp_client
255
256
257class MCPConnection(ABC):
258    """Base class for MCP server connections."""
259
260    def __init__(self):
261        self.session = None
262        self._stack = None
263
264    @abstractmethod
265    def _create_context(self):
266        """Create the connection context based on connection type."""
267
268    async def __aenter__(self):
269        """Initialize MCP server connection."""
270        self._stack = AsyncExitStack()
271        await self._stack.__aenter__()
272
273        try:
274            ctx = self._create_context()
275            result = await self._stack.enter_async_context(ctx)
276
277            if len(result) == 2:
278                read, write = result
279            elif len(result) == 3:
280                read, write, _ = result
281            else:
282                raise ValueError(f"Unexpected context result: {result}")
283
284            session_ctx = ClientSession(read, write)
285            self.session = await self._stack.enter_async_context(session_ctx)
286            await self.session.initialize()
287            return self
288        except BaseException:
289            await self._stack.__aexit__(None, None, None)
290            raise
291
292    async def __aexit__(self, exc_type, exc_val, exc_tb):
293        """Clean up MCP server connection resources."""
294        if self._stack:
295            await self._stack.__aexit__(exc_type, exc_val, exc_tb)
296        self.session = None
297        self._stack = None
298
299    async def list_tools(self) -> list[dict[str, Any]]:
300        """Retrieve available tools from the MCP server."""
301        response = await self.session.list_tools()
302        return [
303            {
304                "name": tool.name,
305                "description": tool.description,
306                "input_schema": tool.inputSchema,
307            }
308            for tool in response.tools
309        ]
310
311    async def call_tool(self, tool_name: str, arguments: dict[str, Any]) -> Any:
312        """Call a tool on the MCP server with provided arguments."""
313        result = await self.session.call_tool(tool_name, arguments=arguments)
314        return result.content
315
316
317class MCPConnectionStdio(MCPConnection):
318    """MCP connection using standard input/output."""
319
320    def __init__(self, command: str, args: list[str] = None, env: dict[str, str] = None):
321        super().__init__()
322        self.command = command
323        self.args = args or []
324        self.env = env
325
326    def _create_context(self):
327        return stdio_client(
328            StdioServerParameters(command=self.command, args=self.args, env=self.env)
329        )
330
331
332class MCPConnectionSSE(MCPConnection):
333    """MCP connection using Server-Sent Events."""
334
335    def __init__(self, url: str, headers: dict[str, str] = None):
336        super().__init__()
337        self.url = url
338        self.headers = headers or {}
339
340    def _create_context(self):
341        return sse_client(url=self.url, headers=self.headers)
342
343
344class MCPConnectionHTTP(MCPConnection):
345    """MCP connection using Streamable HTTP."""
346
347    def __init__(self, url: str, headers: dict[str, str] = None):
348        super().__init__()
349        self.url = url
350        self.headers = headers or {}
351
352    def _create_context(self):
353        return streamablehttp_client(url=self.url, headers=self.headers)
354
355
356def create_connection(
357    transport: str,
358    command: str = None,
359    args: list[str] = None,
360    env: dict[str, str] = None,
361    url: str = None,
362    headers: dict[str, str] = None,
363) -> MCPConnection:
364    """Factory function to create the appropriate MCP connection.
365
366    Args:
367        transport: Connection type ("stdio", "sse", or "http")
368        command: Command to run (stdio only)
369        args: Command arguments (stdio only)
370        env: Environment variables (stdio only)
371        url: Server URL (sse and http only)
372        headers: HTTP headers (sse and http only)
373
374    Returns:
375        MCPConnection instance
376    """
377    transport = transport.lower()
378
379    if transport == "stdio":
380        if not command:
381            raise ValueError("Command is required for stdio transport")
382        return MCPConnectionStdio(command=command, args=args, env=env)
383
384    elif transport == "sse":
385        if not url:
386            raise ValueError("URL is required for sse transport")
387        return MCPConnectionSSE(url=url, headers=headers)
388
389    elif transport in ["http", "streamable_http", "streamable-http"]:
390        if not url:
391            raise ValueError("URL is required for http transport")
392        return MCPConnectionHTTP(url=url, headers=headers)
393
394    else:
395        raise ValueError(f"Unsupported transport type: {transport}. Use 'stdio', 'sse', or 'http'")
396
397```
398
399
400# evaluation.py
401
402```python
403"""MCP Server Evaluation Harness
404
405This script evaluates MCP servers by running test questions against them using Claude.
406"""
407
408import argparse
409import asyncio
410import json
411import re
412import sys
413import time
414import traceback
415import xml.etree.ElementTree as ET
416from pathlib import Path
417from typing import Any
418
419from anthropic import Anthropic
420
421from connections import create_connection
422
423EVALUATION_PROMPT = """You are an AI assistant with access to tools.
424
425When given a task, you MUST:
4261. Use the available tools to complete the task
4272. Provide summary of each step in your approach, wrapped in <summary> tags
4283. Provide feedback on the tools provided, wrapped in <feedback> tags
4294. Provide your final response, wrapped in <response> tags
430
431Summary Requirements:
432- In your <summary> tags, you must explain:
433  - The steps you took to complete the task
434  - Which tools you used, in what order, and why
435  - The inputs you provided to each tool
436  - The outputs you received from each tool
437  - A summary for how you arrived at the response
438
439Feedback Requirements:
440- In your <feedback> tags, provide constructive feedback on the tools:
441  - Comment on tool names: Are they clear and descriptive?
442  - Comment on input parameters: Are they well-documented? Are required vs optional parameters clear?
443  - Comment on descriptions: Do they accurately describe what the tool does?
444  - Comment on any errors encountered during tool usage: Did the tool fail to execute? Did the tool return too many tokens?
445  - Identify specific areas for improvement and explain WHY they would help
446  - Be specific and actionable in your suggestions
447
448Response Requirements:
449- Your response should be concise and directly address what was asked
450- Always wrap your final response in <response> tags
451- If you cannot solve the task return <response>NOT_FOUND</response>
452- For numeric responses, provide just the number
453- For IDs, provide just the ID
454- For names or text, provide the exact text requested
455- Your response should go last"""
456
457
458def parse_evaluation_file(file_path: Path) -> list[dict[str, Any]]:
459    """Parse XML evaluation file with qa_pair elements."""
460    try:
461        tree = ET.parse(file_path)
462        root = tree.getroot()
463        evaluations = []
464
465        for qa_pair in root.findall(".//qa_pair"):
466            question_elem = qa_pair.find("question")
467            answer_elem = qa_pair.find("answer")
468
469            if question_elem is not None and answer_elem is not None:
470                evaluations.append({
471                    "question": (question_elem.text or "").strip(),
472                    "answer": (answer_elem.text or "").strip(),
473                })
474
475        return evaluations
476    except Exception as e:
477        print(f"Error parsing evaluation file {file_path}: {e}")
478        return []
479
480
481def extract_xml_content(text: str, tag: str) -> str | None:
482    """Extract content from XML tags."""
483    pattern = rf"<{tag}>(.*?)</{tag}>"
484    matches = re.findall(pattern, text, re.DOTALL)
485    return matches[-1].strip() if matches else None
486
487
488async def agent_loop(
489    client: Anthropic,
490    model: str,
491    question: str,
492    tools: list[dict[str, Any]],
493    connection: Any,
494) -> tuple[str, dict[str, Any]]:
495    """Run the agent loop with MCP tools."""
496    messages = [{"role": "user", "content": question}]
497
498    response = await asyncio.to_thread(
499        client.messages.create,
500        model=model,
501        max_tokens=4096,
502        system=EVALUATION_PROMPT,
503        messages=messages,
504        tools=tools,
505    )
506
507    messages.append({"role": "assistant", "content": response.content})
508
509    tool_metrics = {}
510
511    while response.stop_reason == "tool_use":
512        tool_use = next(block for block in response.content if block.type == "tool_use")
513        tool_name = tool_use.name
514        tool_input = tool_use.input
515
516        tool_start_ts = time.time()
517        try:
518            tool_result = await connection.call_tool(tool_name, tool_input)
519            tool_response = json.dumps(tool_result) if isinstance(tool_result, (dict, list)) else str(tool_result)
520        except Exception as e:
521            tool_response = f"Error executing tool {tool_name}: {str(e)}\n"
522            tool_response += traceback.format_exc()
523        tool_duration = time.time() - tool_start_ts
524
525        if tool_name not in tool_metrics:
526            tool_metrics[tool_name] = {"count": 0, "durations": []}
527        tool_metrics[tool_name]["count"] += 1
528        tool_metrics[tool_name]["durations"].append(tool_duration)
529
530        messages.append({
531            "role": "user",
532            "content": [{
533                "type": "tool_result",
534                "tool_use_id": tool_use.id,
535                "content": tool_response,
536            }]
537        })
538
539        response = await asyncio.to_thread(
540            client.messages.create,
541            model=model,
542            max_tokens=4096,
543            system=EVALUATION_PROMPT,
544            messages=messages,
545            tools=tools,
546        )
547        messages.append({"role": "assistant", "content": response.content})
548
549    response_text = next(
550        (block.text for block in response.content if hasattr(block, "text")),
551        None,
552    )
553    return response_text, tool_metrics
554
555
556async def evaluate_single_task(
557    client: Anthropic,
558    model: str,
559    qa_pair: dict[str, Any],
560    tools: list[dict[str, Any]],
561    connection: Any,
562    task_index: int,
563) -> dict[str, Any]:
564    """Evaluate a single QA pair with the given tools."""
565    start_time = time.time()
566
567    print(f"Task {task_index + 1}: Running task with question: {qa_pair['question']}")
568    response, tool_metrics = await agent_loop(client, model, qa_pair["question"], tools, connection)
569
570    response_value = extract_xml_content(response, "response")
571    summary = extract_xml_content(response, "summary")
572    feedback = extract_xml_content(response, "feedback")
573
574    duration_seconds = time.time() - start_time
575
576    return {
577        "question": qa_pair["question"],
578        "expected": qa_pair["answer"],
579        "actual": response_value,
580        "score": int(response_value == qa_pair["answer"]) if response_value else 0,
581        "total_duration": duration_seconds,
582        "tool_calls": tool_metrics,
583        "num_tool_calls": sum(len(metrics["durations"]) for metrics in tool_metrics.values()),
584        "summary": summary,
585        "feedback": feedback,
586    }
587
588
589REPORT_HEADER = """
590# Evaluation Report
591
592## Summary
593
594- **Accuracy**: {correct}/{total} ({accuracy:.1f}%)
595- **Average Task Duration**: {average_duration_s:.2f}s
596- **Average Tool Calls per Task**: {average_tool_calls:.2f}
597- **Total Tool Calls**: {total_tool_calls}
598
599---
600"""
601
602TASK_TEMPLATE = """
603### Task {task_num}
604
605**Question**: {question}
606**Ground Truth Answer**: `{expected_answer}`
607**Actual Answer**: `{actual_answer}`
608**Correct**: {correct_indicator}
609**Duration**: {total_duration:.2f}s
610**Tool Calls**: {tool_calls}
611
612**Summary**
613{summary}
614
615**Feedback**
616{feedback}
617
618---
619"""
620
621
622async def run_evaluation(
623    eval_path: Path,
624    connection: Any,
625    model: str = "claude-3-7-sonnet-20250219",
626) -> str:
627    """Run evaluation with MCP server tools."""
628    print("πŸš€ Starting Evaluation")
629
630    client = Anthropic()
631
632    tools = await connection.list_tools()
633    print(f"πŸ“‹ Loaded {len(tools)} tools from MCP server")
634
635    qa_pairs = parse_evaluation_file(eval_path)
636    print(f"πŸ“‹ Loaded {len(qa_pairs)} evaluation tasks")
637
638    results = []
639    for i, qa_pair in enumerate(qa_pairs):
640        print(f"Processing task {i + 1}/{len(qa_pairs)}")
641        result = await evaluate_single_task(client, model, qa_pair, tools, connection, i)
642        results.append(result)
643
644    correct = sum(r["score"] for r in results)
645    accuracy = (correct / len(results)) * 100 if results else 0
646    average_duration_s = sum(r["total_duration"] for r in results) / len(results) if results else 0
647    average_tool_calls = sum(r["num_tool_calls"] for r in results) / len(results) if results else 0
648    total_tool_calls = sum(r["num_tool_calls"] for r in results)
649
650    report = REPORT_HEADER.format(
651        correct=correct,
652        total=len(results),
653        accuracy=accuracy,
654        average_duration_s=average_duration_s,
655        average_tool_calls=average_tool_calls,
656        total_tool_calls=total_tool_calls,
657    )
658
659    report += "".join([
660        TASK_TEMPLATE.format(
661            task_num=i + 1,
662            question=qa_pair["question"],
663            expected_answer=qa_pair["answer"],
664            actual_answer=result["actual"] or "N/A",
665            correct_indicator="βœ…" if result["score"] else "❌",
666            total_duration=result["total_duration"],
667            tool_calls=json.dumps(result["tool_calls"], indent=2),
668            summary=result["summary"] or "N/A",
669            feedback=result["feedback"] or "N/A",
670        )
671        for i, (qa_pair, result) in enumerate(zip(qa_pairs, results))
672    ])
673
674    return report
675
676
677def parse_headers(header_list: list[str]) -> dict[str, str]:
678    """Parse header strings in format 'Key: Value' into a dictionary."""
679    headers = {}
680    if not header_list:
681        return headers
682
683    for header in header_list:
684        if ":" in header:
685            key, value = header.split(":", 1)
686            headers[key.strip()] = value.strip()
687        else:
688            print(f"Warning: Ignoring malformed header: {header}")
689    return headers
690
691
692def parse_env_vars(env_list: list[str]) -> dict[str, str]:
693    """Parse environment variable strings in format 'KEY=VALUE' into a dictionary."""
694    env = {}
695    if not env_list:
696        return env
697
698    for env_var in env_list:
699        if "=" in env_var:
700            key, value = env_var.split("=", 1)
701            env[key.strip()] = value.strip()
702        else:
703            print(f"Warning: Ignoring malformed environment variable: {env_var}")
704    return env
705
706
707async def main():
708    parser = argparse.ArgumentParser(
709        description="Evaluate MCP servers using test questions",
710        formatter_class=argparse.RawDescriptionHelpFormatter,
711        epilog="""
712Examples:
713  # Evaluate a local stdio MCP server
714  python evaluation.py -t stdio -c python -a my_server.py eval.xml
715
716  # Evaluate an SSE MCP server
717  python evaluation.py -t sse -u https://example.com/mcp -H "Authorization: Bearer token" eval.xml
718
719  # Evaluate an HTTP MCP server with custom model
720  python evaluation.py -t http -u https://example.com/mcp -m claude-3-5-sonnet-20241022 eval.xml
721        """,
722    )
723
724    parser.add_argument("eval_file", type=Path, help="Path to evaluation XML file")
725    parser.add_argument("-t", "--transport", choices=["stdio", "sse", "http"], default="stdio", help="Transport type (default: stdio)")
726    parser.add_argument("-m", "--model", default="claude-3-7-sonnet-20250219", help="Claude model to use (default: claude-3-7-sonnet-20250219)")
727
728    stdio_group = parser.add_argument_group("stdio options")
729    stdio_group.add_argument("-c", "--command", help="Command to run MCP server (stdio only)")
730    stdio_group.add_argument("-a", "--args", nargs="+", help="Arguments for the command (stdio only)")
731    stdio_group.add_argument("-e", "--env", nargs="+", help="Environment variables in KEY=VALUE format (stdio only)")
732
733    remote_group = parser.add_argument_group("sse/http options")
734    remote_group.add_argument("-u", "--url", help="MCP server URL (sse/http only)")
735    remote_group.add_argument("-H", "--header", nargs="+", dest="headers", help="HTTP headers in 'Key: Value' format (sse/http only)")
736
737    parser.add_argument("-o", "--output", type=Path, help="Output file for evaluation report (default: stdout)")
738
739    args = parser.parse_args()
740
741    if not args.eval_file.exists():
742        print(f"Error: Evaluation file not found: {args.eval_file}")
743        sys.exit(1)
744
745    headers = parse_headers(args.headers) if args.headers else None
746    env_vars = parse_env_vars(args.env) if args.env else None
747
748    try:
749        connection = create_connection(
750            transport=args.transport,
751            command=args.command,
752            args=args.args,
753            env=env_vars,
754            url=args.url,
755            headers=headers,
756        )
757    except ValueError as e:
758        print(f"Error: {e}")
759        sys.exit(1)
760
761    print(f"πŸ”— Connecting to MCP server via {args.transport}...")
762
763    async with connection:
764        print("βœ… Connected successfully")
765        report = await run_evaluation(args.eval_file, connection, args.model)
766
767        if args.output:
768            args.output.write_text(report)
769            print(f"\nβœ… Report saved to {args.output}")
770        else:
771            print("\n" + report)
772
773
774if __name__ == "__main__":
775    asyncio.run(main())
776
777```