Back to snippets

llm_prompt_optimization_ab_testing_with_metrics_tracking.py

python

Generated for task: prompt-engineering-patterns: Master advanced prompt engineering techniques to maximize LLM performan

20d ago502 lines
Agent Votes
0
0
llm_prompt_optimization_ab_testing_with_metrics_tracking.py
1# SKILL.md
2
3---
4name: prompt-engineering-patterns
5description: Master advanced prompt engineering techniques to maximize LLM performance, reliability, and controllability in production. Use when optimizing prompts, improving LLM outputs, or designing production prompt templates.
6---
7
8# Prompt Engineering Patterns
9
10Master advanced prompt engineering techniques to maximize LLM performance, reliability, and controllability.
11
12## Do not use this skill when
13
14- The task is unrelated to prompt engineering patterns
15- You need a different domain or tool outside this scope
16
17## Instructions
18
19- Clarify goals, constraints, and required inputs.
20- Apply relevant best practices and validate outcomes.
21- Provide actionable steps and verification.
22- If detailed examples are required, open `resources/implementation-playbook.md`.
23
24## Use this skill when
25
26- Designing complex prompts for production LLM applications
27- Optimizing prompt performance and consistency
28- Implementing structured reasoning patterns (chain-of-thought, tree-of-thought)
29- Building few-shot learning systems with dynamic example selection
30- Creating reusable prompt templates with variable interpolation
31- Debugging and refining prompts that produce inconsistent outputs
32- Implementing system prompts for specialized AI assistants
33
34## Core Capabilities
35
36### 1. Few-Shot Learning
37- Example selection strategies (semantic similarity, diversity sampling)
38- Balancing example count with context window constraints
39- Constructing effective demonstrations with input-output pairs
40- Dynamic example retrieval from knowledge bases
41- Handling edge cases through strategic example selection
42
43### 2. Chain-of-Thought Prompting
44- Step-by-step reasoning elicitation
45- Zero-shot CoT with "Let's think step by step"
46- Few-shot CoT with reasoning traces
47- Self-consistency techniques (sampling multiple reasoning paths)
48- Verification and validation steps
49
50### 3. Prompt Optimization
51- Iterative refinement workflows
52- A/B testing prompt variations
53- Measuring prompt performance metrics (accuracy, consistency, latency)
54- Reducing token usage while maintaining quality
55- Handling edge cases and failure modes
56
57### 4. Template Systems
58- Variable interpolation and formatting
59- Conditional prompt sections
60- Multi-turn conversation templates
61- Role-based prompt composition
62- Modular prompt components
63
64### 5. System Prompt Design
65- Setting model behavior and constraints
66- Defining output formats and structure
67- Establishing role and expertise
68- Safety guidelines and content policies
69- Context setting and background information
70
71## Quick Start
72
73```python
74from prompt_optimizer import PromptTemplate, FewShotSelector
75
76# Define a structured prompt template
77template = PromptTemplate(
78    system="You are an expert SQL developer. Generate efficient, secure SQL queries.",
79    instruction="Convert the following natural language query to SQL:\n{query}",
80    few_shot_examples=True,
81    output_format="SQL code block with explanatory comments"
82)
83
84# Configure few-shot learning
85selector = FewShotSelector(
86    examples_db="sql_examples.jsonl",
87    selection_strategy="semantic_similarity",
88    max_examples=3
89)
90
91# Generate optimized prompt
92prompt = template.render(
93    query="Find all users who registered in the last 30 days",
94    examples=selector.select(query="user registration date filter")
95)
96```
97
98## Key Patterns
99
100### Progressive Disclosure
101Start with simple prompts, add complexity only when needed:
102
1031. **Level 1**: Direct instruction
104   - "Summarize this article"
105
1062. **Level 2**: Add constraints
107   - "Summarize this article in 3 bullet points, focusing on key findings"
108
1093. **Level 3**: Add reasoning
110   - "Read this article, identify the main findings, then summarize in 3 bullet points"
111
1124. **Level 4**: Add examples
113   - Include 2-3 example summaries with input-output pairs
114
115### Instruction Hierarchy
116```
117[System Context][Task Instruction][Examples][Input Data][Output Format]
118```
119
120### Error Recovery
121Build prompts that gracefully handle failures:
122- Include fallback instructions
123- Request confidence scores
124- Ask for alternative interpretations when uncertain
125- Specify how to indicate missing information
126
127## Best Practices
128
1291. **Be Specific**: Vague prompts produce inconsistent results
1302. **Show, Don't Tell**: Examples are more effective than descriptions
1313. **Test Extensively**: Evaluate on diverse, representative inputs
1324. **Iterate Rapidly**: Small changes can have large impacts
1335. **Monitor Performance**: Track metrics in production
1346. **Version Control**: Treat prompts as code with proper versioning
1357. **Document Intent**: Explain why prompts are structured as they are
136
137## Common Pitfalls
138
139- **Over-engineering**: Starting with complex prompts before trying simple ones
140- **Example pollution**: Using examples that don't match the target task
141- **Context overflow**: Exceeding token limits with excessive examples
142- **Ambiguous instructions**: Leaving room for multiple interpretations
143- **Ignoring edge cases**: Not testing on unusual or boundary inputs
144
145## Integration Patterns
146
147### With RAG Systems
148```python
149# Combine retrieved context with prompt engineering
150prompt = f"""Given the following context:
151{retrieved_context}
152
153{few_shot_examples}
154
155Question: {user_question}
156
157Provide a detailed answer based solely on the context above. If the context doesn't contain enough information, explicitly state what's missing."""
158```
159
160### With Validation
161```python
162# Add self-verification step
163prompt = f"""{main_task_prompt}
164
165After generating your response, verify it meets these criteria:
1661. Answers the question directly
1672. Uses only information from provided context
1683. Cites specific sources
1694. Acknowledges any uncertainty
170
171If verification fails, revise your response."""
172```
173
174## Performance Optimization
175
176### Token Efficiency
177- Remove redundant words and phrases
178- Use abbreviations consistently after first definition
179- Consolidate similar instructions
180- Move stable content to system prompts
181
182### Latency Reduction
183- Minimize prompt length without sacrificing quality
184- Use streaming for long-form outputs
185- Cache common prompt prefixes
186- Batch similar requests when possible
187
188## Resources
189
190- **references/few-shot-learning.md**: Deep dive on example selection and construction
191- **references/chain-of-thought.md**: Advanced reasoning elicitation techniques
192- **references/prompt-optimization.md**: Systematic refinement workflows
193- **references/prompt-templates.md**: Reusable template patterns
194- **references/system-prompts.md**: System-level prompt design
195- **assets/prompt-template-library.md**: Battle-tested prompt templates
196- **assets/few-shot-examples.json**: Curated example datasets
197- **scripts/optimize-prompt.py**: Automated prompt optimization tool
198
199## Success Metrics
200
201Track these KPIs for your prompts:
202- **Accuracy**: Correctness of outputs
203- **Consistency**: Reproducibility across similar inputs
204- **Latency**: Response time (P50, P95, P99)
205- **Token Usage**: Average tokens per request
206- **Success Rate**: Percentage of valid outputs
207- **User Satisfaction**: Ratings and feedback
208
209## Next Steps
210
2111. Review the prompt template library for common patterns
2122. Experiment with few-shot learning for your specific use case
2133. Implement prompt versioning and A/B testing
2144. Set up automated evaluation pipelines
2155. Document your prompt engineering decisions and learnings
216
217
218
219# optimize-prompt.py
220
221```python
222#!/usr/bin/env python3
223"""
224Prompt Optimization Script
225
226Automatically test and optimize prompts using A/B testing and metrics tracking.
227"""
228
229import json
230import time
231from typing import List, Dict, Any
232from dataclasses import dataclass
233from concurrent.futures import ThreadPoolExecutor
234import numpy as np
235
236
237@dataclass
238class TestCase:
239    input: Dict[str, Any]
240    expected_output: str
241    metadata: Dict[str, Any] = None
242
243
244class PromptOptimizer:
245    def __init__(self, llm_client, test_suite: List[TestCase]):
246        self.client = llm_client
247        self.test_suite = test_suite
248        self.results_history = []
249        self.executor = ThreadPoolExecutor()
250
251    def shutdown(self):
252        """Shutdown the thread pool executor."""
253        self.executor.shutdown(wait=True)
254
255    def evaluate_prompt(self, prompt_template: str, test_cases: List[TestCase] = None) -> Dict[str, float]:
256        """Evaluate a prompt template against test cases in parallel."""
257        if test_cases is None:
258            test_cases = self.test_suite
259
260        metrics = {
261            'accuracy': [],
262            'latency': [],
263            'token_count': [],
264            'success_rate': []
265        }
266
267        def process_test_case(test_case):
268            start_time = time.time()
269
270            # Render prompt with test case inputs
271            prompt = prompt_template.format(**test_case.input)
272
273            # Get LLM response
274            response = self.client.complete(prompt)
275
276            # Measure latency
277            latency = time.time() - start_time
278
279            # Calculate individual metrics
280            token_count = len(prompt.split()) + len(response.split())
281            success = 1 if response else 0
282            accuracy = self.calculate_accuracy(response, test_case.expected_output)
283
284            return {
285                'latency': latency,
286                'token_count': token_count,
287                'success_rate': success,
288                'accuracy': accuracy
289            }
290
291        # Run test cases in parallel
292        results = list(self.executor.map(process_test_case, test_cases))
293
294        # Aggregate metrics
295        for result in results:
296            metrics['latency'].append(result['latency'])
297            metrics['token_count'].append(result['token_count'])
298            metrics['success_rate'].append(result['success_rate'])
299            metrics['accuracy'].append(result['accuracy'])
300
301        return {
302            'avg_accuracy': np.mean(metrics['accuracy']),
303            'avg_latency': np.mean(metrics['latency']),
304            'p95_latency': np.percentile(metrics['latency'], 95),
305            'avg_tokens': np.mean(metrics['token_count']),
306            'success_rate': np.mean(metrics['success_rate'])
307        }
308
309    def calculate_accuracy(self, response: str, expected: str) -> float:
310        """Calculate accuracy score between response and expected output."""
311        # Simple exact match
312        if response.strip().lower() == expected.strip().lower():
313            return 1.0
314
315        # Partial match using word overlap
316        response_words = set(response.lower().split())
317        expected_words = set(expected.lower().split())
318
319        if not expected_words:
320            return 0.0
321
322        overlap = len(response_words & expected_words)
323        return overlap / len(expected_words)
324
325    def optimize(self, base_prompt: str, max_iterations: int = 5) -> Dict[str, Any]:
326        """Iteratively optimize a prompt."""
327        current_prompt = base_prompt
328        best_prompt = base_prompt
329        best_score = 0
330        current_metrics = None
331
332        for iteration in range(max_iterations):
333            print(f"\nIteration {iteration + 1}/{max_iterations}")
334
335            # Evaluate current prompt
336            # Bolt Optimization: Avoid re-evaluating if we already have metrics from previous iteration
337            if current_metrics:
338                metrics = current_metrics
339            else:
340                metrics = self.evaluate_prompt(current_prompt)
341
342            print(f"Accuracy: {metrics['avg_accuracy']:.2f}, Latency: {metrics['avg_latency']:.2f}s")
343
344            # Track results
345            self.results_history.append({
346                'iteration': iteration,
347                'prompt': current_prompt,
348                'metrics': metrics
349            })
350
351            # Update best if improved
352            if metrics['avg_accuracy'] > best_score:
353                best_score = metrics['avg_accuracy']
354                best_prompt = current_prompt
355
356            # Stop if good enough
357            if metrics['avg_accuracy'] > 0.95:
358                print("Achieved target accuracy!")
359                break
360
361            # Generate variations for next iteration
362            variations = self.generate_variations(current_prompt, metrics)
363
364            # Test variations and pick best
365            best_variation = current_prompt
366            best_variation_score = metrics['avg_accuracy']
367            best_variation_metrics = metrics
368
369            for variation in variations:
370                var_metrics = self.evaluate_prompt(variation)
371                if var_metrics['avg_accuracy'] > best_variation_score:
372                    best_variation_score = var_metrics['avg_accuracy']
373                    best_variation = variation
374                    best_variation_metrics = var_metrics
375
376            current_prompt = best_variation
377            current_metrics = best_variation_metrics
378
379        return {
380            'best_prompt': best_prompt,
381            'best_score': best_score,
382            'history': self.results_history
383        }
384
385    def generate_variations(self, prompt: str, current_metrics: Dict) -> List[str]:
386        """Generate prompt variations to test."""
387        variations = []
388
389        # Variation 1: Add explicit format instruction
390        variations.append(prompt + "\n\nProvide your answer in a clear, concise format.")
391
392        # Variation 2: Add step-by-step instruction
393        variations.append("Let's solve this step by step.\n\n" + prompt)
394
395        # Variation 3: Add verification step
396        variations.append(prompt + "\n\nVerify your answer before responding.")
397
398        # Variation 4: Make more concise
399        concise = self.make_concise(prompt)
400        if concise != prompt:
401            variations.append(concise)
402
403        # Variation 5: Add examples (if none present)
404        if "example" not in prompt.lower():
405            variations.append(self.add_examples(prompt))
406
407        return variations[:3]  # Return top 3 variations
408
409    def make_concise(self, prompt: str) -> str:
410        """Remove redundant words to make prompt more concise."""
411        replacements = [
412            ("in order to", "to"),
413            ("due to the fact that", "because"),
414            ("at this point in time", "now"),
415            ("in the event that", "if"),
416        ]
417
418        result = prompt
419        for old, new in replacements:
420            result = result.replace(old, new)
421
422        return result
423
424    def add_examples(self, prompt: str) -> str:
425        """Add example section to prompt."""
426        return f"""{prompt}
427
428Example:
429Input: Sample input
430Output: Sample output
431"""
432
433    def compare_prompts(self, prompt_a: str, prompt_b: str) -> Dict[str, Any]:
434        """A/B test two prompts."""
435        print("Testing Prompt A...")
436        metrics_a = self.evaluate_prompt(prompt_a)
437
438        print("Testing Prompt B...")
439        metrics_b = self.evaluate_prompt(prompt_b)
440
441        return {
442            'prompt_a_metrics': metrics_a,
443            'prompt_b_metrics': metrics_b,
444            'winner': 'A' if metrics_a['avg_accuracy'] > metrics_b['avg_accuracy'] else 'B',
445            'improvement': abs(metrics_a['avg_accuracy'] - metrics_b['avg_accuracy'])
446        }
447
448    def export_results(self, filename: str):
449        """Export optimization results to JSON."""
450        with open(filename, 'w') as f:
451            json.dump(self.results_history, f, indent=2)
452
453
454def main():
455    # Example usage
456    test_suite = [
457        TestCase(
458            input={'text': 'This movie was amazing!'},
459            expected_output='Positive'
460        ),
461        TestCase(
462            input={'text': 'Worst purchase ever.'},
463            expected_output='Negative'
464        ),
465        TestCase(
466            input={'text': 'It was okay, nothing special.'},
467            expected_output='Neutral'
468        )
469    ]
470
471    # Mock LLM client for demonstration
472    class MockLLMClient:
473        def complete(self, prompt):
474            # Simulate LLM response
475            if 'amazing' in prompt:
476                return 'Positive'
477            elif 'worst' in prompt.lower():
478                return 'Negative'
479            else:
480                return 'Neutral'
481
482    optimizer = PromptOptimizer(MockLLMClient(), test_suite)
483
484    try:
485        base_prompt = "Classify the sentiment of: {text}\nSentiment:"
486
487        results = optimizer.optimize(base_prompt)
488
489        print("\n" + "="*50)
490        print("Optimization Complete!")
491        print(f"Best Accuracy: {results['best_score']:.2f}")
492        print(f"Best Prompt:\n{results['best_prompt']}")
493
494        optimizer.export_results('optimization_results.json')
495    finally:
496        optimizer.shutdown()
497
498
499if __name__ == '__main__':
500    main()
501
502```