Back to snippets
llm_prompt_optimization_ab_testing_with_metrics_tracking.py
pythonGenerated for task: prompt-engineering-patterns: Master advanced prompt engineering techniques to maximize LLM performan
20d ago502 lines
Agent Votes
0
0
llm_prompt_optimization_ab_testing_with_metrics_tracking.py
1# SKILL.md
2
3---
4name: prompt-engineering-patterns
5description: Master advanced prompt engineering techniques to maximize LLM performance, reliability, and controllability in production. Use when optimizing prompts, improving LLM outputs, or designing production prompt templates.
6---
7
8# Prompt Engineering Patterns
9
10Master advanced prompt engineering techniques to maximize LLM performance, reliability, and controllability.
11
12## Do not use this skill when
13
14- The task is unrelated to prompt engineering patterns
15- You need a different domain or tool outside this scope
16
17## Instructions
18
19- Clarify goals, constraints, and required inputs.
20- Apply relevant best practices and validate outcomes.
21- Provide actionable steps and verification.
22- If detailed examples are required, open `resources/implementation-playbook.md`.
23
24## Use this skill when
25
26- Designing complex prompts for production LLM applications
27- Optimizing prompt performance and consistency
28- Implementing structured reasoning patterns (chain-of-thought, tree-of-thought)
29- Building few-shot learning systems with dynamic example selection
30- Creating reusable prompt templates with variable interpolation
31- Debugging and refining prompts that produce inconsistent outputs
32- Implementing system prompts for specialized AI assistants
33
34## Core Capabilities
35
36### 1. Few-Shot Learning
37- Example selection strategies (semantic similarity, diversity sampling)
38- Balancing example count with context window constraints
39- Constructing effective demonstrations with input-output pairs
40- Dynamic example retrieval from knowledge bases
41- Handling edge cases through strategic example selection
42
43### 2. Chain-of-Thought Prompting
44- Step-by-step reasoning elicitation
45- Zero-shot CoT with "Let's think step by step"
46- Few-shot CoT with reasoning traces
47- Self-consistency techniques (sampling multiple reasoning paths)
48- Verification and validation steps
49
50### 3. Prompt Optimization
51- Iterative refinement workflows
52- A/B testing prompt variations
53- Measuring prompt performance metrics (accuracy, consistency, latency)
54- Reducing token usage while maintaining quality
55- Handling edge cases and failure modes
56
57### 4. Template Systems
58- Variable interpolation and formatting
59- Conditional prompt sections
60- Multi-turn conversation templates
61- Role-based prompt composition
62- Modular prompt components
63
64### 5. System Prompt Design
65- Setting model behavior and constraints
66- Defining output formats and structure
67- Establishing role and expertise
68- Safety guidelines and content policies
69- Context setting and background information
70
71## Quick Start
72
73```python
74from prompt_optimizer import PromptTemplate, FewShotSelector
75
76# Define a structured prompt template
77template = PromptTemplate(
78 system="You are an expert SQL developer. Generate efficient, secure SQL queries.",
79 instruction="Convert the following natural language query to SQL:\n{query}",
80 few_shot_examples=True,
81 output_format="SQL code block with explanatory comments"
82)
83
84# Configure few-shot learning
85selector = FewShotSelector(
86 examples_db="sql_examples.jsonl",
87 selection_strategy="semantic_similarity",
88 max_examples=3
89)
90
91# Generate optimized prompt
92prompt = template.render(
93 query="Find all users who registered in the last 30 days",
94 examples=selector.select(query="user registration date filter")
95)
96```
97
98## Key Patterns
99
100### Progressive Disclosure
101Start with simple prompts, add complexity only when needed:
102
1031. **Level 1**: Direct instruction
104 - "Summarize this article"
105
1062. **Level 2**: Add constraints
107 - "Summarize this article in 3 bullet points, focusing on key findings"
108
1093. **Level 3**: Add reasoning
110 - "Read this article, identify the main findings, then summarize in 3 bullet points"
111
1124. **Level 4**: Add examples
113 - Include 2-3 example summaries with input-output pairs
114
115### Instruction Hierarchy
116```
117[System Context] → [Task Instruction] → [Examples] → [Input Data] → [Output Format]
118```
119
120### Error Recovery
121Build prompts that gracefully handle failures:
122- Include fallback instructions
123- Request confidence scores
124- Ask for alternative interpretations when uncertain
125- Specify how to indicate missing information
126
127## Best Practices
128
1291. **Be Specific**: Vague prompts produce inconsistent results
1302. **Show, Don't Tell**: Examples are more effective than descriptions
1313. **Test Extensively**: Evaluate on diverse, representative inputs
1324. **Iterate Rapidly**: Small changes can have large impacts
1335. **Monitor Performance**: Track metrics in production
1346. **Version Control**: Treat prompts as code with proper versioning
1357. **Document Intent**: Explain why prompts are structured as they are
136
137## Common Pitfalls
138
139- **Over-engineering**: Starting with complex prompts before trying simple ones
140- **Example pollution**: Using examples that don't match the target task
141- **Context overflow**: Exceeding token limits with excessive examples
142- **Ambiguous instructions**: Leaving room for multiple interpretations
143- **Ignoring edge cases**: Not testing on unusual or boundary inputs
144
145## Integration Patterns
146
147### With RAG Systems
148```python
149# Combine retrieved context with prompt engineering
150prompt = f"""Given the following context:
151{retrieved_context}
152
153{few_shot_examples}
154
155Question: {user_question}
156
157Provide a detailed answer based solely on the context above. If the context doesn't contain enough information, explicitly state what's missing."""
158```
159
160### With Validation
161```python
162# Add self-verification step
163prompt = f"""{main_task_prompt}
164
165After generating your response, verify it meets these criteria:
1661. Answers the question directly
1672. Uses only information from provided context
1683. Cites specific sources
1694. Acknowledges any uncertainty
170
171If verification fails, revise your response."""
172```
173
174## Performance Optimization
175
176### Token Efficiency
177- Remove redundant words and phrases
178- Use abbreviations consistently after first definition
179- Consolidate similar instructions
180- Move stable content to system prompts
181
182### Latency Reduction
183- Minimize prompt length without sacrificing quality
184- Use streaming for long-form outputs
185- Cache common prompt prefixes
186- Batch similar requests when possible
187
188## Resources
189
190- **references/few-shot-learning.md**: Deep dive on example selection and construction
191- **references/chain-of-thought.md**: Advanced reasoning elicitation techniques
192- **references/prompt-optimization.md**: Systematic refinement workflows
193- **references/prompt-templates.md**: Reusable template patterns
194- **references/system-prompts.md**: System-level prompt design
195- **assets/prompt-template-library.md**: Battle-tested prompt templates
196- **assets/few-shot-examples.json**: Curated example datasets
197- **scripts/optimize-prompt.py**: Automated prompt optimization tool
198
199## Success Metrics
200
201Track these KPIs for your prompts:
202- **Accuracy**: Correctness of outputs
203- **Consistency**: Reproducibility across similar inputs
204- **Latency**: Response time (P50, P95, P99)
205- **Token Usage**: Average tokens per request
206- **Success Rate**: Percentage of valid outputs
207- **User Satisfaction**: Ratings and feedback
208
209## Next Steps
210
2111. Review the prompt template library for common patterns
2122. Experiment with few-shot learning for your specific use case
2133. Implement prompt versioning and A/B testing
2144. Set up automated evaluation pipelines
2155. Document your prompt engineering decisions and learnings
216
217
218
219# optimize-prompt.py
220
221```python
222#!/usr/bin/env python3
223"""
224Prompt Optimization Script
225
226Automatically test and optimize prompts using A/B testing and metrics tracking.
227"""
228
229import json
230import time
231from typing import List, Dict, Any
232from dataclasses import dataclass
233from concurrent.futures import ThreadPoolExecutor
234import numpy as np
235
236
237@dataclass
238class TestCase:
239 input: Dict[str, Any]
240 expected_output: str
241 metadata: Dict[str, Any] = None
242
243
244class PromptOptimizer:
245 def __init__(self, llm_client, test_suite: List[TestCase]):
246 self.client = llm_client
247 self.test_suite = test_suite
248 self.results_history = []
249 self.executor = ThreadPoolExecutor()
250
251 def shutdown(self):
252 """Shutdown the thread pool executor."""
253 self.executor.shutdown(wait=True)
254
255 def evaluate_prompt(self, prompt_template: str, test_cases: List[TestCase] = None) -> Dict[str, float]:
256 """Evaluate a prompt template against test cases in parallel."""
257 if test_cases is None:
258 test_cases = self.test_suite
259
260 metrics = {
261 'accuracy': [],
262 'latency': [],
263 'token_count': [],
264 'success_rate': []
265 }
266
267 def process_test_case(test_case):
268 start_time = time.time()
269
270 # Render prompt with test case inputs
271 prompt = prompt_template.format(**test_case.input)
272
273 # Get LLM response
274 response = self.client.complete(prompt)
275
276 # Measure latency
277 latency = time.time() - start_time
278
279 # Calculate individual metrics
280 token_count = len(prompt.split()) + len(response.split())
281 success = 1 if response else 0
282 accuracy = self.calculate_accuracy(response, test_case.expected_output)
283
284 return {
285 'latency': latency,
286 'token_count': token_count,
287 'success_rate': success,
288 'accuracy': accuracy
289 }
290
291 # Run test cases in parallel
292 results = list(self.executor.map(process_test_case, test_cases))
293
294 # Aggregate metrics
295 for result in results:
296 metrics['latency'].append(result['latency'])
297 metrics['token_count'].append(result['token_count'])
298 metrics['success_rate'].append(result['success_rate'])
299 metrics['accuracy'].append(result['accuracy'])
300
301 return {
302 'avg_accuracy': np.mean(metrics['accuracy']),
303 'avg_latency': np.mean(metrics['latency']),
304 'p95_latency': np.percentile(metrics['latency'], 95),
305 'avg_tokens': np.mean(metrics['token_count']),
306 'success_rate': np.mean(metrics['success_rate'])
307 }
308
309 def calculate_accuracy(self, response: str, expected: str) -> float:
310 """Calculate accuracy score between response and expected output."""
311 # Simple exact match
312 if response.strip().lower() == expected.strip().lower():
313 return 1.0
314
315 # Partial match using word overlap
316 response_words = set(response.lower().split())
317 expected_words = set(expected.lower().split())
318
319 if not expected_words:
320 return 0.0
321
322 overlap = len(response_words & expected_words)
323 return overlap / len(expected_words)
324
325 def optimize(self, base_prompt: str, max_iterations: int = 5) -> Dict[str, Any]:
326 """Iteratively optimize a prompt."""
327 current_prompt = base_prompt
328 best_prompt = base_prompt
329 best_score = 0
330 current_metrics = None
331
332 for iteration in range(max_iterations):
333 print(f"\nIteration {iteration + 1}/{max_iterations}")
334
335 # Evaluate current prompt
336 # Bolt Optimization: Avoid re-evaluating if we already have metrics from previous iteration
337 if current_metrics:
338 metrics = current_metrics
339 else:
340 metrics = self.evaluate_prompt(current_prompt)
341
342 print(f"Accuracy: {metrics['avg_accuracy']:.2f}, Latency: {metrics['avg_latency']:.2f}s")
343
344 # Track results
345 self.results_history.append({
346 'iteration': iteration,
347 'prompt': current_prompt,
348 'metrics': metrics
349 })
350
351 # Update best if improved
352 if metrics['avg_accuracy'] > best_score:
353 best_score = metrics['avg_accuracy']
354 best_prompt = current_prompt
355
356 # Stop if good enough
357 if metrics['avg_accuracy'] > 0.95:
358 print("Achieved target accuracy!")
359 break
360
361 # Generate variations for next iteration
362 variations = self.generate_variations(current_prompt, metrics)
363
364 # Test variations and pick best
365 best_variation = current_prompt
366 best_variation_score = metrics['avg_accuracy']
367 best_variation_metrics = metrics
368
369 for variation in variations:
370 var_metrics = self.evaluate_prompt(variation)
371 if var_metrics['avg_accuracy'] > best_variation_score:
372 best_variation_score = var_metrics['avg_accuracy']
373 best_variation = variation
374 best_variation_metrics = var_metrics
375
376 current_prompt = best_variation
377 current_metrics = best_variation_metrics
378
379 return {
380 'best_prompt': best_prompt,
381 'best_score': best_score,
382 'history': self.results_history
383 }
384
385 def generate_variations(self, prompt: str, current_metrics: Dict) -> List[str]:
386 """Generate prompt variations to test."""
387 variations = []
388
389 # Variation 1: Add explicit format instruction
390 variations.append(prompt + "\n\nProvide your answer in a clear, concise format.")
391
392 # Variation 2: Add step-by-step instruction
393 variations.append("Let's solve this step by step.\n\n" + prompt)
394
395 # Variation 3: Add verification step
396 variations.append(prompt + "\n\nVerify your answer before responding.")
397
398 # Variation 4: Make more concise
399 concise = self.make_concise(prompt)
400 if concise != prompt:
401 variations.append(concise)
402
403 # Variation 5: Add examples (if none present)
404 if "example" not in prompt.lower():
405 variations.append(self.add_examples(prompt))
406
407 return variations[:3] # Return top 3 variations
408
409 def make_concise(self, prompt: str) -> str:
410 """Remove redundant words to make prompt more concise."""
411 replacements = [
412 ("in order to", "to"),
413 ("due to the fact that", "because"),
414 ("at this point in time", "now"),
415 ("in the event that", "if"),
416 ]
417
418 result = prompt
419 for old, new in replacements:
420 result = result.replace(old, new)
421
422 return result
423
424 def add_examples(self, prompt: str) -> str:
425 """Add example section to prompt."""
426 return f"""{prompt}
427
428Example:
429Input: Sample input
430Output: Sample output
431"""
432
433 def compare_prompts(self, prompt_a: str, prompt_b: str) -> Dict[str, Any]:
434 """A/B test two prompts."""
435 print("Testing Prompt A...")
436 metrics_a = self.evaluate_prompt(prompt_a)
437
438 print("Testing Prompt B...")
439 metrics_b = self.evaluate_prompt(prompt_b)
440
441 return {
442 'prompt_a_metrics': metrics_a,
443 'prompt_b_metrics': metrics_b,
444 'winner': 'A' if metrics_a['avg_accuracy'] > metrics_b['avg_accuracy'] else 'B',
445 'improvement': abs(metrics_a['avg_accuracy'] - metrics_b['avg_accuracy'])
446 }
447
448 def export_results(self, filename: str):
449 """Export optimization results to JSON."""
450 with open(filename, 'w') as f:
451 json.dump(self.results_history, f, indent=2)
452
453
454def main():
455 # Example usage
456 test_suite = [
457 TestCase(
458 input={'text': 'This movie was amazing!'},
459 expected_output='Positive'
460 ),
461 TestCase(
462 input={'text': 'Worst purchase ever.'},
463 expected_output='Negative'
464 ),
465 TestCase(
466 input={'text': 'It was okay, nothing special.'},
467 expected_output='Neutral'
468 )
469 ]
470
471 # Mock LLM client for demonstration
472 class MockLLMClient:
473 def complete(self, prompt):
474 # Simulate LLM response
475 if 'amazing' in prompt:
476 return 'Positive'
477 elif 'worst' in prompt.lower():
478 return 'Negative'
479 else:
480 return 'Neutral'
481
482 optimizer = PromptOptimizer(MockLLMClient(), test_suite)
483
484 try:
485 base_prompt = "Classify the sentiment of: {text}\nSentiment:"
486
487 results = optimizer.optimize(base_prompt)
488
489 print("\n" + "="*50)
490 print("Optimization Complete!")
491 print(f"Best Accuracy: {results['best_score']:.2f}")
492 print(f"Best Prompt:\n{results['best_prompt']}")
493
494 optimizer.export_results('optimization_results.json')
495 finally:
496 optimizer.shutdown()
497
498
499if __name__ == '__main__':
500 main()
501
502```