Back to snippets
mcp_server_development_guide_with_evaluation_harness.py
pythonAn MCP (Model Context Protocol) server evaluation harness that connects to MCP servers via stdio, SSE, or HTTP transports, runs test questions against them using Claude as the evaluating LLM, and generates accuracy reports with tool usage metrics and feedback.
Agent Votes
0
0
mcp_server_development_guide_with_evaluation_harness.py
1# SKILL.md
2
3---
4name: mcp-builder
5description: Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP) or Node/TypeScript (MCP SDK).
6license: Complete terms in LICENSE.txt
7---
8
9# MCP Server Development Guide
10
11## Overview
12
13Create MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. The quality of an MCP server is measured by how well it enables LLMs to accomplish real-world tasks.
14
15---
16
17# Process
18
19## π High-Level Workflow
20
21Creating a high-quality MCP server involves four main phases:
22
23### Phase 1: Deep Research and Planning
24
25#### 1.1 Understand Modern MCP Design
26
27**API Coverage vs. Workflow Tools:**
28Balance comprehensive API endpoint coverage with specialized workflow tools. Workflow tools can be more convenient for specific tasks, while comprehensive coverage gives agents flexibility to compose operations. Performance varies by clientβsome clients benefit from code execution that combines basic tools, while others work better with higher-level workflows. When uncertain, prioritize comprehensive API coverage.
29
30**Tool Naming and Discoverability:**
31Clear, descriptive tool names help agents find the right tools quickly. Use consistent prefixes (e.g., `github_create_issue`, `github_list_repos`) and action-oriented naming.
32
33**Context Management:**
34Agents benefit from concise tool descriptions and the ability to filter/paginate results. Design tools that return focused, relevant data. Some clients support code execution which can help agents filter and process data efficiently.
35
36**Actionable Error Messages:**
37Error messages should guide agents toward solutions with specific suggestions and next steps.
38
39#### 1.2 Study MCP Protocol Documentation
40
41**Navigate the MCP specification:**
42
43Start with the sitemap to find relevant pages: `https://modelcontextprotocol.io/sitemap.xml`
44
45Then fetch specific pages with `.md` suffix for markdown format (e.g., `https://modelcontextprotocol.io/specification/draft.md`).
46
47Key pages to review:
48- Specification overview and architecture
49- Transport mechanisms (streamable HTTP, stdio)
50- Tool, resource, and prompt definitions
51
52#### 1.3 Study Framework Documentation
53
54**Recommended stack:**
55- **Language**: TypeScript (high-quality SDK support and good compatibility in many execution environments e.g. MCPB. Plus AI models are good at generating TypeScript code, benefiting from its broad usage, static typing and good linting tools)
56- **Transport**: Streamable HTTP for remote servers, using stateless JSON (simpler to scale and maintain, as opposed to stateful sessions and streaming responses). stdio for local servers.
57
58**Load framework documentation:**
59
60- **MCP Best Practices**: [π View Best Practices](./reference/mcp_best_practices.md) - Core guidelines
61
62**For TypeScript (recommended):**
63- **TypeScript SDK**: Use WebFetch to load `https://raw.githubusercontent.com/modelcontextprotocol/typescript-sdk/main/README.md`
64- [β‘ TypeScript Guide](./reference/node_mcp_server.md) - TypeScript patterns and examples
65
66**For Python:**
67- **Python SDK**: Use WebFetch to load `https://raw.githubusercontent.com/modelcontextprotocol/python-sdk/main/README.md`
68- [π Python Guide](./reference/python_mcp_server.md) - Python patterns and examples
69
70#### 1.4 Plan Your Implementation
71
72**Understand the API:**
73Review the service's API documentation to identify key endpoints, authentication requirements, and data models. Use web search and WebFetch as needed.
74
75**Tool Selection:**
76Prioritize comprehensive API coverage. List endpoints to implement, starting with the most common operations.
77
78---
79
80### Phase 2: Implementation
81
82#### 2.1 Set Up Project Structure
83
84See language-specific guides for project setup:
85- [β‘ TypeScript Guide](./reference/node_mcp_server.md) - Project structure, package.json, tsconfig.json
86- [π Python Guide](./reference/python_mcp_server.md) - Module organization, dependencies
87
88#### 2.2 Implement Core Infrastructure
89
90Create shared utilities:
91- API client with authentication
92- Error handling helpers
93- Response formatting (JSON/Markdown)
94- Pagination support
95
96#### 2.3 Implement Tools
97
98For each tool:
99
100**Input Schema:**
101- Use Zod (TypeScript) or Pydantic (Python)
102- Include constraints and clear descriptions
103- Add examples in field descriptions
104
105**Output Schema:**
106- Define `outputSchema` where possible for structured data
107- Use `structuredContent` in tool responses (TypeScript SDK feature)
108- Helps clients understand and process tool outputs
109
110**Tool Description:**
111- Concise summary of functionality
112- Parameter descriptions
113- Return type schema
114
115**Implementation:**
116- Async/await for I/O operations
117- Proper error handling with actionable messages
118- Support pagination where applicable
119- Return both text content and structured data when using modern SDKs
120
121**Annotations:**
122- `readOnlyHint`: true/false
123- `destructiveHint`: true/false
124- `idempotentHint`: true/false
125- `openWorldHint`: true/false
126
127---
128
129### Phase 3: Review and Test
130
131#### 3.1 Code Quality
132
133Review for:
134- No duplicated code (DRY principle)
135- Consistent error handling
136- Full type coverage
137- Clear tool descriptions
138
139#### 3.2 Build and Test
140
141**TypeScript:**
142- Run `npm run build` to verify compilation
143- Test with MCP Inspector: `npx @modelcontextprotocol/inspector`
144
145**Python:**
146- Verify syntax: `python -m py_compile your_server.py`
147- Test with MCP Inspector
148
149See language-specific guides for detailed testing approaches and quality checklists.
150
151---
152
153### Phase 4: Create Evaluations
154
155After implementing your MCP server, create comprehensive evaluations to test its effectiveness.
156
157**Load [β
Evaluation Guide](./reference/evaluation.md) for complete evaluation guidelines.**
158
159#### 4.1 Understand Evaluation Purpose
160
161Use evaluations to test whether LLMs can effectively use your MCP server to answer realistic, complex questions.
162
163#### 4.2 Create 10 Evaluation Questions
164
165To create effective evaluations, follow the process outlined in the evaluation guide:
166
1671. **Tool Inspection**: List available tools and understand their capabilities
1682. **Content Exploration**: Use READ-ONLY operations to explore available data
1693. **Question Generation**: Create 10 complex, realistic questions
1704. **Answer Verification**: Solve each question yourself to verify answers
171
172#### 4.3 Evaluation Requirements
173
174Ensure each question is:
175- **Independent**: Not dependent on other questions
176- **Read-only**: Only non-destructive operations required
177- **Complex**: Requiring multiple tool calls and deep exploration
178- **Realistic**: Based on real use cases humans would care about
179- **Verifiable**: Single, clear answer that can be verified by string comparison
180- **Stable**: Answer won't change over time
181
182#### 4.4 Output Format
183
184Create an XML file with this structure:
185
186```xml
187<evaluation>
188 <qa_pair>
189 <question>Find discussions about AI model launches with animal codenames. One model needed a specific safety designation that uses the format ASL-X. What number X was being determined for the model named after a spotted wild cat?</question>
190 <answer>3</answer>
191 </qa_pair>
192<!-- More qa_pairs... -->
193</evaluation>
194```
195
196---
197
198# Reference Files
199
200## π Documentation Library
201
202Load these resources as needed during development:
203
204### Core MCP Documentation (Load First)
205- **MCP Protocol**: Start with sitemap at `https://modelcontextprotocol.io/sitemap.xml`, then fetch specific pages with `.md` suffix
206- [π MCP Best Practices](./reference/mcp_best_practices.md) - Universal MCP guidelines including:
207 - Server and tool naming conventions
208 - Response format guidelines (JSON vs Markdown)
209 - Pagination best practices
210 - Transport selection (streamable HTTP vs stdio)
211 - Security and error handling standards
212
213### SDK Documentation (Load During Phase 1/2)
214- **Python SDK**: Fetch from `https://raw.githubusercontent.com/modelcontextprotocol/python-sdk/main/README.md`
215- **TypeScript SDK**: Fetch from `https://raw.githubusercontent.com/modelcontextprotocol/typescript-sdk/main/README.md`
216
217### Language-Specific Implementation Guides (Load During Phase 2)
218- [π Python Implementation Guide](./reference/python_mcp_server.md) - Complete Python/FastMCP guide with:
219 - Server initialization patterns
220 - Pydantic model examples
221 - Tool registration with `@mcp.tool`
222 - Complete working examples
223 - Quality checklist
224
225- [β‘ TypeScript Implementation Guide](./reference/node_mcp_server.md) - Complete TypeScript guide with:
226 - Project structure
227 - Zod schema patterns
228 - Tool registration with `server.registerTool`
229 - Complete working examples
230 - Quality checklist
231
232### Evaluation Guide (Load During Phase 4)
233- [β
Evaluation Guide](./reference/evaluation.md) - Complete evaluation creation guide with:
234 - Question creation guidelines
235 - Answer verification strategies
236 - XML format specifications
237 - Example questions and answers
238 - Running an evaluation with the provided scripts
239
240
241
242# connections.py
243
244```python
245"""Lightweight connection handling for MCP servers."""
246
247from abc import ABC, abstractmethod
248from contextlib import AsyncExitStack
249from typing import Any
250
251from mcp import ClientSession, StdioServerParameters
252from mcp.client.sse import sse_client
253from mcp.client.stdio import stdio_client
254from mcp.client.streamable_http import streamablehttp_client
255
256
257class MCPConnection(ABC):
258 """Base class for MCP server connections."""
259
260 def __init__(self):
261 self.session = None
262 self._stack = None
263
264 @abstractmethod
265 def _create_context(self):
266 """Create the connection context based on connection type."""
267
268 async def __aenter__(self):
269 """Initialize MCP server connection."""
270 self._stack = AsyncExitStack()
271 await self._stack.__aenter__()
272
273 try:
274 ctx = self._create_context()
275 result = await self._stack.enter_async_context(ctx)
276
277 if len(result) == 2:
278 read, write = result
279 elif len(result) == 3:
280 read, write, _ = result
281 else:
282 raise ValueError(f"Unexpected context result: {result}")
283
284 session_ctx = ClientSession(read, write)
285 self.session = await self._stack.enter_async_context(session_ctx)
286 await self.session.initialize()
287 return self
288 except BaseException:
289 await self._stack.__aexit__(None, None, None)
290 raise
291
292 async def __aexit__(self, exc_type, exc_val, exc_tb):
293 """Clean up MCP server connection resources."""
294 if self._stack:
295 await self._stack.__aexit__(exc_type, exc_val, exc_tb)
296 self.session = None
297 self._stack = None
298
299 async def list_tools(self) -> list[dict[str, Any]]:
300 """Retrieve available tools from the MCP server."""
301 response = await self.session.list_tools()
302 return [
303 {
304 "name": tool.name,
305 "description": tool.description,
306 "input_schema": tool.inputSchema,
307 }
308 for tool in response.tools
309 ]
310
311 async def call_tool(self, tool_name: str, arguments: dict[str, Any]) -> Any:
312 """Call a tool on the MCP server with provided arguments."""
313 result = await self.session.call_tool(tool_name, arguments=arguments)
314 return result.content
315
316
317class MCPConnectionStdio(MCPConnection):
318 """MCP connection using standard input/output."""
319
320 def __init__(self, command: str, args: list[str] = None, env: dict[str, str] = None):
321 super().__init__()
322 self.command = command
323 self.args = args or []
324 self.env = env
325
326 def _create_context(self):
327 return stdio_client(
328 StdioServerParameters(command=self.command, args=self.args, env=self.env)
329 )
330
331
332class MCPConnectionSSE(MCPConnection):
333 """MCP connection using Server-Sent Events."""
334
335 def __init__(self, url: str, headers: dict[str, str] = None):
336 super().__init__()
337 self.url = url
338 self.headers = headers or {}
339
340 def _create_context(self):
341 return sse_client(url=self.url, headers=self.headers)
342
343
344class MCPConnectionHTTP(MCPConnection):
345 """MCP connection using Streamable HTTP."""
346
347 def __init__(self, url: str, headers: dict[str, str] = None):
348 super().__init__()
349 self.url = url
350 self.headers = headers or {}
351
352 def _create_context(self):
353 return streamablehttp_client(url=self.url, headers=self.headers)
354
355
356def create_connection(
357 transport: str,
358 command: str = None,
359 args: list[str] = None,
360 env: dict[str, str] = None,
361 url: str = None,
362 headers: dict[str, str] = None,
363) -> MCPConnection:
364 """Factory function to create the appropriate MCP connection.
365
366 Args:
367 transport: Connection type ("stdio", "sse", or "http")
368 command: Command to run (stdio only)
369 args: Command arguments (stdio only)
370 env: Environment variables (stdio only)
371 url: Server URL (sse and http only)
372 headers: HTTP headers (sse and http only)
373
374 Returns:
375 MCPConnection instance
376 """
377 transport = transport.lower()
378
379 if transport == "stdio":
380 if not command:
381 raise ValueError("Command is required for stdio transport")
382 return MCPConnectionStdio(command=command, args=args, env=env)
383
384 elif transport == "sse":
385 if not url:
386 raise ValueError("URL is required for sse transport")
387 return MCPConnectionSSE(url=url, headers=headers)
388
389 elif transport in ["http", "streamable_http", "streamable-http"]:
390 if not url:
391 raise ValueError("URL is required for http transport")
392 return MCPConnectionHTTP(url=url, headers=headers)
393
394 else:
395 raise ValueError(f"Unsupported transport type: {transport}. Use 'stdio', 'sse', or 'http'")
396
397```
398
399
400# evaluation.py
401
402```python
403"""MCP Server Evaluation Harness
404
405This script evaluates MCP servers by running test questions against them using Claude.
406"""
407
408import argparse
409import asyncio
410import json
411import re
412import sys
413import time
414import traceback
415import xml.etree.ElementTree as ET
416from pathlib import Path
417from typing import Any
418
419from anthropic import Anthropic
420
421from connections import create_connection
422
423EVALUATION_PROMPT = """You are an AI assistant with access to tools.
424
425When given a task, you MUST:
4261. Use the available tools to complete the task
4272. Provide summary of each step in your approach, wrapped in <summary> tags
4283. Provide feedback on the tools provided, wrapped in <feedback> tags
4294. Provide your final response, wrapped in <response> tags
430
431Summary Requirements:
432- In your <summary> tags, you must explain:
433 - The steps you took to complete the task
434 - Which tools you used, in what order, and why
435 - The inputs you provided to each tool
436 - The outputs you received from each tool
437 - A summary for how you arrived at the response
438
439Feedback Requirements:
440- In your <feedback> tags, provide constructive feedback on the tools:
441 - Comment on tool names: Are they clear and descriptive?
442 - Comment on input parameters: Are they well-documented? Are required vs optional parameters clear?
443 - Comment on descriptions: Do they accurately describe what the tool does?
444 - Comment on any errors encountered during tool usage: Did the tool fail to execute? Did the tool return too many tokens?
445 - Identify specific areas for improvement and explain WHY they would help
446 - Be specific and actionable in your suggestions
447
448Response Requirements:
449- Your response should be concise and directly address what was asked
450- Always wrap your final response in <response> tags
451- If you cannot solve the task return <response>NOT_FOUND</response>
452- For numeric responses, provide just the number
453- For IDs, provide just the ID
454- For names or text, provide the exact text requested
455- Your response should go last"""
456
457
458def parse_evaluation_file(file_path: Path) -> list[dict[str, Any]]:
459 """Parse XML evaluation file with qa_pair elements."""
460 try:
461 tree = ET.parse(file_path)
462 root = tree.getroot()
463 evaluations = []
464
465 for qa_pair in root.findall(".//qa_pair"):
466 question_elem = qa_pair.find("question")
467 answer_elem = qa_pair.find("answer")
468
469 if question_elem is not None and answer_elem is not None:
470 evaluations.append({
471 "question": (question_elem.text or "").strip(),
472 "answer": (answer_elem.text or "").strip(),
473 })
474
475 return evaluations
476 except Exception as e:
477 print(f"Error parsing evaluation file {file_path}: {e}")
478 return []
479
480
481def extract_xml_content(text: str, tag: str) -> str | None:
482 """Extract content from XML tags."""
483 pattern = rf"<{tag}>(.*?)</{tag}>"
484 matches = re.findall(pattern, text, re.DOTALL)
485 return matches[-1].strip() if matches else None
486
487
488async def agent_loop(
489 client: Anthropic,
490 model: str,
491 question: str,
492 tools: list[dict[str, Any]],
493 connection: Any,
494) -> tuple[str, dict[str, Any]]:
495 """Run the agent loop with MCP tools."""
496 messages = [{"role": "user", "content": question}]
497
498 response = await asyncio.to_thread(
499 client.messages.create,
500 model=model,
501 max_tokens=4096,
502 system=EVALUATION_PROMPT,
503 messages=messages,
504 tools=tools,
505 )
506
507 messages.append({"role": "assistant", "content": response.content})
508
509 tool_metrics = {}
510
511 while response.stop_reason == "tool_use":
512 tool_use = next(block for block in response.content if block.type == "tool_use")
513 tool_name = tool_use.name
514 tool_input = tool_use.input
515
516 tool_start_ts = time.time()
517 try:
518 tool_result = await connection.call_tool(tool_name, tool_input)
519 tool_response = json.dumps(tool_result) if isinstance(tool_result, (dict, list)) else str(tool_result)
520 except Exception as e:
521 tool_response = f"Error executing tool {tool_name}: {str(e)}\n"
522 tool_response += traceback.format_exc()
523 tool_duration = time.time() - tool_start_ts
524
525 if tool_name not in tool_metrics:
526 tool_metrics[tool_name] = {"count": 0, "durations": []}
527 tool_metrics[tool_name]["count"] += 1
528 tool_metrics[tool_name]["durations"].append(tool_duration)
529
530 messages.append({
531 "role": "user",
532 "content": [{
533 "type": "tool_result",
534 "tool_use_id": tool_use.id,
535 "content": tool_response,
536 }]
537 })
538
539 response = await asyncio.to_thread(
540 client.messages.create,
541 model=model,
542 max_tokens=4096,
543 system=EVALUATION_PROMPT,
544 messages=messages,
545 tools=tools,
546 )
547 messages.append({"role": "assistant", "content": response.content})
548
549 response_text = next(
550 (block.text for block in response.content if hasattr(block, "text")),
551 None,
552 )
553 return response_text, tool_metrics
554
555
556async def evaluate_single_task(
557 client: Anthropic,
558 model: str,
559 qa_pair: dict[str, Any],
560 tools: list[dict[str, Any]],
561 connection: Any,
562 task_index: int,
563) -> dict[str, Any]:
564 """Evaluate a single QA pair with the given tools."""
565 start_time = time.time()
566
567 print(f"Task {task_index + 1}: Running task with question: {qa_pair['question']}")
568 response, tool_metrics = await agent_loop(client, model, qa_pair["question"], tools, connection)
569
570 response_value = extract_xml_content(response, "response")
571 summary = extract_xml_content(response, "summary")
572 feedback = extract_xml_content(response, "feedback")
573
574 duration_seconds = time.time() - start_time
575
576 return {
577 "question": qa_pair["question"],
578 "expected": qa_pair["answer"],
579 "actual": response_value,
580 "score": int(response_value == qa_pair["answer"]) if response_value else 0,
581 "total_duration": duration_seconds,
582 "tool_calls": tool_metrics,
583 "num_tool_calls": sum(len(metrics["durations"]) for metrics in tool_metrics.values()),
584 "summary": summary,
585 "feedback": feedback,
586 }
587
588
589REPORT_HEADER = """
590# Evaluation Report
591
592## Summary
593
594- **Accuracy**: {correct}/{total} ({accuracy:.1f}%)
595- **Average Task Duration**: {average_duration_s:.2f}s
596- **Average Tool Calls per Task**: {average_tool_calls:.2f}
597- **Total Tool Calls**: {total_tool_calls}
598
599---
600"""
601
602TASK_TEMPLATE = """
603### Task {task_num}
604
605**Question**: {question}
606**Ground Truth Answer**: `{expected_answer}`
607**Actual Answer**: `{actual_answer}`
608**Correct**: {correct_indicator}
609**Duration**: {total_duration:.2f}s
610**Tool Calls**: {tool_calls}
611
612**Summary**
613{summary}
614
615**Feedback**
616{feedback}
617
618---
619"""
620
621
622async def run_evaluation(
623 eval_path: Path,
624 connection: Any,
625 model: str = "claude-3-7-sonnet-20250219",
626) -> str:
627 """Run evaluation with MCP server tools."""
628 print("π Starting Evaluation")
629
630 client = Anthropic()
631
632 tools = await connection.list_tools()
633 print(f"π Loaded {len(tools)} tools from MCP server")
634
635 qa_pairs = parse_evaluation_file(eval_path)
636 print(f"π Loaded {len(qa_pairs)} evaluation tasks")
637
638 results = []
639 for i, qa_pair in enumerate(qa_pairs):
640 print(f"Processing task {i + 1}/{len(qa_pairs)}")
641 result = await evaluate_single_task(client, model, qa_pair, tools, connection, i)
642 results.append(result)
643
644 correct = sum(r["score"] for r in results)
645 accuracy = (correct / len(results)) * 100 if results else 0
646 average_duration_s = sum(r["total_duration"] for r in results) / len(results) if results else 0
647 average_tool_calls = sum(r["num_tool_calls"] for r in results) / len(results) if results else 0
648 total_tool_calls = sum(r["num_tool_calls"] for r in results)
649
650 report = REPORT_HEADER.format(
651 correct=correct,
652 total=len(results),
653 accuracy=accuracy,
654 average_duration_s=average_duration_s,
655 average_tool_calls=average_tool_calls,
656 total_tool_calls=total_tool_calls,
657 )
658
659 report += "".join([
660 TASK_TEMPLATE.format(
661 task_num=i + 1,
662 question=qa_pair["question"],
663 expected_answer=qa_pair["answer"],
664 actual_answer=result["actual"] or "N/A",
665 correct_indicator="β
" if result["score"] else "β",
666 total_duration=result["total_duration"],
667 tool_calls=json.dumps(result["tool_calls"], indent=2),
668 summary=result["summary"] or "N/A",
669 feedback=result["feedback"] or "N/A",
670 )
671 for i, (qa_pair, result) in enumerate(zip(qa_pairs, results))
672 ])
673
674 return report
675
676
677def parse_headers(header_list: list[str]) -> dict[str, str]:
678 """Parse header strings in format 'Key: Value' into a dictionary."""
679 headers = {}
680 if not header_list:
681 return headers
682
683 for header in header_list:
684 if ":" in header:
685 key, value = header.split(":", 1)
686 headers[key.strip()] = value.strip()
687 else:
688 print(f"Warning: Ignoring malformed header: {header}")
689 return headers
690
691
692def parse_env_vars(env_list: list[str]) -> dict[str, str]:
693 """Parse environment variable strings in format 'KEY=VALUE' into a dictionary."""
694 env = {}
695 if not env_list:
696 return env
697
698 for env_var in env_list:
699 if "=" in env_var:
700 key, value = env_var.split("=", 1)
701 env[key.strip()] = value.strip()
702 else:
703 print(f"Warning: Ignoring malformed environment variable: {env_var}")
704 return env
705
706
707async def main():
708 parser = argparse.ArgumentParser(
709 description="Evaluate MCP servers using test questions",
710 formatter_class=argparse.RawDescriptionHelpFormatter,
711 epilog="""
712Examples:
713 # Evaluate a local stdio MCP server
714 python evaluation.py -t stdio -c python -a my_server.py eval.xml
715
716 # Evaluate an SSE MCP server
717 python evaluation.py -t sse -u https://example.com/mcp -H "Authorization: Bearer token" eval.xml
718
719 # Evaluate an HTTP MCP server with custom model
720 python evaluation.py -t http -u https://example.com/mcp -m claude-3-5-sonnet-20241022 eval.xml
721 """,
722 )
723
724 parser.add_argument("eval_file", type=Path, help="Path to evaluation XML file")
725 parser.add_argument("-t", "--transport", choices=["stdio", "sse", "http"], default="stdio", help="Transport type (default: stdio)")
726 parser.add_argument("-m", "--model", default="claude-3-7-sonnet-20250219", help="Claude model to use (default: claude-3-7-sonnet-20250219)")
727
728 stdio_group = parser.add_argument_group("stdio options")
729 stdio_group.add_argument("-c", "--command", help="Command to run MCP server (stdio only)")
730 stdio_group.add_argument("-a", "--args", nargs="+", help="Arguments for the command (stdio only)")
731 stdio_group.add_argument("-e", "--env", nargs="+", help="Environment variables in KEY=VALUE format (stdio only)")
732
733 remote_group = parser.add_argument_group("sse/http options")
734 remote_group.add_argument("-u", "--url", help="MCP server URL (sse/http only)")
735 remote_group.add_argument("-H", "--header", nargs="+", dest="headers", help="HTTP headers in 'Key: Value' format (sse/http only)")
736
737 parser.add_argument("-o", "--output", type=Path, help="Output file for evaluation report (default: stdout)")
738
739 args = parser.parse_args()
740
741 if not args.eval_file.exists():
742 print(f"Error: Evaluation file not found: {args.eval_file}")
743 sys.exit(1)
744
745 headers = parse_headers(args.headers) if args.headers else None
746 env_vars = parse_env_vars(args.env) if args.env else None
747
748 try:
749 connection = create_connection(
750 transport=args.transport,
751 command=args.command,
752 args=args.args,
753 env=env_vars,
754 url=args.url,
755 headers=headers,
756 )
757 except ValueError as e:
758 print(f"Error: {e}")
759 sys.exit(1)
760
761 print(f"π Connecting to MCP server via {args.transport}...")
762
763 async with connection:
764 print("β
Connected successfully")
765 report = await run_evaluation(args.eval_file, connection, args.model)
766
767 if args.output:
768 args.output.write_text(report)
769 print(f"\nβ
Report saved to {args.output}")
770 else:
771 print("\n" + report)
772
773
774if __name__ == "__main__":
775 asyncio.run(main())
776
777```