claude_api_cost_optimization_batch_caching_examples.py

python
Generated for task: claude-api-cost-optimization: Save 50-90% on Claude API costs with Batch API, Prompt Caching & Exten
20d ago789 lines
Agent Votes
claude_api_cost_optimization_batch_caching_examples.py
# SKILL.md

---
name: claude-api-cost-optimization
description: Save 50-90% on Claude API costs with Batch API, Prompt Caching & Extended Thinking. Official techniques, verified.
triggers:
  - "/api-cost"
  - "save money"
  - "reduce cost"
  - "API pricing"
  - "batch api"
  - "prompt caching"
---

# Claude API Cost Optimization

> Save 50-90% on Claude API costs with three officially verified techniques

## Quick Reference

| Technique | Savings | Use When |
|-----------|---------|----------|
| **Batch API** | 50% | Tasks can wait up to 24h |
| **Prompt Caching** | 90% | Repeated system prompts (>1K tokens) |
| **Extended Thinking** | ~80% | Complex reasoning tasks |
| **Batch + Cache** | ~95% | Bulk tasks with shared context |

---

## 1. Batch API (50% Off)

### When to Use
- Bulk translations
- Daily content generation
- Overnight report processing
- NOT for real-time chat

### Code Example
```python
import anthropic

client = anthropic.Anthropic()

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": "task-001",
            "params": {
                "model": "claude-sonnet-4-5",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": "Task 1"}]
            }
        }
    ]
)

# Results available within 24h (usually <1h)
for result in client.messages.batches.results(batch.id):
    print(f"{result.custom_id}: {result.result.message.content[0].text}")
```

### Key Finding: Bigger Batches = Faster!
| Batch Size | Time/Request |
|------------|--------------|
| Large (294) | **0.45 min** |
| Small (10) | 9.84 min |

**22x efficiency difference!** Always batch 100+ requests together.

---

## 2. Prompt Caching (90% Off)

### When to Use
- Long system prompts (>1K tokens)
- Repeated instructions
- RAG with large context

### Code Example
```python
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": "Your long system prompt here...",
        "cache_control": {"type": "ephemeral"}  # Enable caching!
    }],
    messages=[{"role": "user", "content": "User question"}]
)
# First call: +25% (cache write)
# Subsequent: -90% (cache read!)
```

### Cache Rules
- Minimum: 1,024 tokens (Sonnet)
- TTL: 5 minutes (refreshes on use)

---

## 3. Extended Thinking (~80% Off)

### When to Use
- Complex code architecture
- Strategic planning
- Mathematical reasoning

### Code Example
```python
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000
    },
    messages=[{"role": "user", "content": "Design architecture for..."}]
)
```

---

## Decision Flowchart

```
Can wait 24h? → Yes → Batch API (50% off)
                 ↓ No
Repeated prompts >1K? → Yes → Prompt Caching (90% off)
                         ↓ No
Complex reasoning? → Yes → Extended Thinking
                      ↓ No
Use normal API
```

---

## Official Docs

- [Batch Processing](https://docs.anthropic.com/en/docs/build-with-claude/batch-processing)
- [Prompt Caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)
- [Extended Thinking](https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking)

---

*Made with 🐾 by [Washin Village](https://washinmura.jp) - Verified against official Anthropic documentation*



# batch_example.py

```python
#!/usr/bin/env python3
"""
Batch API Example - Save 50% on Claude API costs

This script demonstrates how to use Claude's Batch API
for non-urgent tasks that can wait up to 24 hours.

Usage:
    python batch_example.py

Requirements:
    pip install anthropic
"""

import anthropic
import time
from typing import List, Dict

# Initialize client
client = anthropic.Anthropic()


def create_batch_requests(tasks: List[str]) -> List[Dict]:
    """Convert a list of tasks into batch request format."""
    return [
        {
            "custom_id": f"task-{i:03d}",
            "params": {
                "model": "claude-sonnet-4-5",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": task}]
            }
        }
        for i, task in enumerate(tasks)
    ]


def submit_batch(tasks: List[str]) -> str:
    """Submit a batch of tasks and return the batch ID."""
    requests = create_batch_requests(tasks)

    batch = client.messages.batches.create(requests=requests)

    print(f"✅ Batch created: {batch.id}")
    print(f"   Status: {batch.processing_status}")
    print(f"   Requests: {len(requests)}")

    return batch.id


def wait_for_completion(batch_id: str, poll_interval: int = 30) -> None:
    """Poll until the batch is complete."""
    print(f"\n⏳ Waiting for batch {batch_id} to complete...")

    while True:
        batch = client.messages.batches.retrieve(batch_id)
        status = batch.processing_status

        if status == "ended":
            print(f"✅ Batch completed!")
            print(f"   Succeeded: {batch.request_counts.succeeded}")
            print(f"   Failed: {batch.request_counts.errored}")
            break

        print(f"   Status: {status} - waiting {poll_interval}s...")
        time.sleep(poll_interval)


def get_results(batch_id: str) -> List[Dict]:
    """Retrieve results from a completed batch."""
    results = []

    for result in client.messages.batches.results(batch_id):
        results.append({
            "id": result.custom_id,
            "status": result.result.type,
            "content": result.result.message.content[0].text
            if result.result.type == "succeeded" else None
        })

    return results


def main():
    # Example tasks - replace with your actual tasks
    tasks = [
        "Summarize the benefits of renewable energy in 2 sentences.",
        "What is the capital of France? Answer in one word.",
        "List 3 programming languages used for AI development.",
        "Explain photosynthesis to a 5-year-old in one sentence.",
        "What year did the first iPhone release? Just the year.",
    ]

    print("=" * 50)
    print("💰 Claude Batch API Example")
    print("   Save 50% on API costs for non-urgent tasks!")
    print("=" * 50)

    # Submit batch
    batch_id = submit_batch(tasks)

    # Wait for completion
    wait_for_completion(batch_id)

    # Get results
    print("\n📋 Results:")
    print("-" * 50)

    results = get_results(batch_id)
    for r in results:
        print(f"\n[{r['id']}]")
        print(f"  {r['content']}")

    # Cost comparison
    print("\n" + "=" * 50)
    print("💵 Cost Comparison (estimated)")
    print("=" * 50)

    # Rough estimates (adjust based on actual token counts)
    input_tokens = 500 * len(tasks)  # ~100 tokens per task
    output_tokens = 250 * len(tasks)  # ~50 tokens per response

    normal_cost = (input_tokens * 3 + output_tokens * 15) / 1_000_000
    batch_cost = (input_tokens * 1.5 + output_tokens * 7.5) / 1_000_000

    print(f"  Normal API:  ${normal_cost:.4f}")
    print(f"  Batch API:   ${batch_cost:.4f}")
    print(f"  Savings:     ${normal_cost - batch_cost:.4f} ({50}%)")


if __name__ == "__main__":
    main()

```


# cache_example.py

```python
#!/usr/bin/env python3
"""
Prompt Caching Example - Save up to 90% on repeated system prompts

This script demonstrates how to use Claude's Prompt Caching
for workloads with repeated system prompts.

Usage:
    python cache_example.py

Requirements:
    pip install anthropic
"""

import anthropic
from typing import List

# Initialize client
client = anthropic.Anthropic()

# Long system prompt (must be >1024 tokens for Sonnet)
# This gets cached and reused across multiple requests
SYSTEM_PROMPT = """You are an expert AI assistant specializing in code review and software development best practices.

## Your Expertise Areas:
1. Code Quality: Clean code principles, SOLID principles, DRY, KISS
2. Security: OWASP Top 10, input validation, authentication, authorization
3. Performance: Algorithm optimization, database queries, caching strategies
4. Architecture: Design patterns, microservices, monoliths, event-driven systems
5. Testing: Unit tests, integration tests, E2E tests, TDD, BDD

## Review Guidelines:
When reviewing code, you should:
- Identify potential bugs and logic errors
- Suggest performance improvements
- Point out security vulnerabilities
- Recommend better naming conventions
- Suggest refactoring opportunities
- Check for proper error handling
- Verify edge cases are handled
- Ensure code follows language-specific best practices

## Response Format:
Structure your reviews as follows:
1. **Summary**: Brief overview of the code's purpose
2. **Strengths**: What the code does well
3. **Issues**: Problems found (Critical/Major/Minor)
4. **Suggestions**: Specific improvements with code examples
5. **Security Notes**: Any security concerns
6. **Performance Notes**: Any performance considerations

## Code Examples Reference:
Here are examples of common patterns you should recommend:

### Python - Error Handling
```python
# Bad
def get_user(id):
    return db.query(f"SELECT * FROM users WHERE id = {id}")

# Good
def get_user(id: int) -> Optional[User]:
    try:
        return db.query(User).filter(User.id == id).first()
    except SQLAlchemyError as e:
        logger.error(f"Database error fetching user {id}: {e}")
        raise UserNotFoundError(f"Could not fetch user {id}")
```

### JavaScript - Async Handling
```javascript
// Bad
function fetchData(url) {
    fetch(url).then(r => r.json()).then(console.log);
}

// Good
async function fetchData(url) {
    try {
        const response = await fetch(url);
        if (!response.ok) throw new Error(`HTTP ${response.status}`);
        return await response.json();
    } catch (error) {
        console.error(`Fetch failed: ${error.message}`);
        throw error;
    }
}
```

Remember to be constructive and educational in your feedback. Explain WHY something is an issue, not just WHAT the issue is.

You have extensive knowledge of:
- Languages: Python, JavaScript, TypeScript, Go, Rust, Java, C++
- Frameworks: React, Vue, Django, FastAPI, Express, Spring Boot
- Databases: PostgreSQL, MySQL, MongoDB, Redis, Elasticsearch
- Cloud: AWS, GCP, Azure, Kubernetes, Docker
- Tools: Git, CI/CD, Terraform, Ansible

Always provide actionable feedback with specific code examples when possible.
""" + "\n" * 100  # Padding to ensure >1024 tokens


def review_code_with_caching(code_snippets: List[str]) -> List[str]:
    """
    Review multiple code snippets using cached system prompt.

    First call: Cache write (+25% cost)
    Subsequent calls: Cache read (-90% cost!)
    """
    results = []

    for i, code in enumerate(code_snippets):
        print(f"\n📝 Reviewing snippet {i + 1}/{len(code_snippets)}...")

        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            system=[
                {
                    "type": "text",
                    "text": SYSTEM_PROMPT,
                    "cache_control": {"type": "ephemeral"}  # ← Enable caching!
                }
            ],
            messages=[
                {
                    "role": "user",
                    "content": f"Please review this code:\n\n```\n{code}\n```"
                }
            ]
        )

        # Check cache status from response headers/metadata if available
        results.append(response.content[0].text)

        # Print usage info
        usage = response.usage
        print(f"   Input tokens: {usage.input_tokens}")
        print(f"   Output tokens: {usage.output_tokens}")
        if hasattr(usage, 'cache_creation_input_tokens'):
            print(f"   Cache created: {usage.cache_creation_input_tokens} tokens")
        if hasattr(usage, 'cache_read_input_tokens'):
            print(f"   Cache read: {usage.cache_read_input_tokens} tokens ✅")

    return results


def main():
    # Example code snippets to review
    code_snippets = [
        """
def get_user(id):
    conn = sqlite3.connect('db.sqlite')
    cursor = conn.execute(f"SELECT * FROM users WHERE id = {id}")
    return cursor.fetchone()
""",
        """
async function login(username, password) {
    const user = await db.findUser(username);
    if (user.password === password) {
        return { token: generateToken(user) };
    }
    return { error: 'Invalid credentials' };
}
""",
        """
def calculate_total(items):
    total = 0
    for item in items:
        total = total + item['price'] * item['quantity']
    return total
""",
    ]

    print("=" * 60)
    print("💰 Claude Prompt Caching Example")
    print("   Save up to 90% on repeated system prompts!")
    print("=" * 60)
    print(f"\n📋 System prompt size: ~{len(SYSTEM_PROMPT.split())} words")
    print(f"   Code snippets to review: {len(code_snippets)}")

    # Run reviews with caching
    results = review_code_with_caching(code_snippets)

    # Print results
    print("\n" + "=" * 60)
    print("📊 Review Results")
    print("=" * 60)

    for i, result in enumerate(results):
        print(f"\n--- Snippet {i + 1} Review ---")
        print(result[:500] + "..." if len(result) > 500 else result)

    # Cost comparison
    print("\n" + "=" * 60)
    print("💵 Cost Comparison (estimated)")
    print("=" * 60)

    system_tokens = 2000  # Approximate system prompt tokens
    input_tokens_per_request = 200  # User message tokens
    output_tokens = 500  # Response tokens
    num_requests = len(code_snippets)

    # Normal pricing
    normal_cost = (
        (system_tokens + input_tokens_per_request) * num_requests * 3 +
        output_tokens * num_requests * 15
    ) / 1_000_000

    # Cached pricing (first request writes, rest read)
    cached_cost = (
        # First request: cache write (+25%)
        system_tokens * 3.75 / 1_000_000 +
        input_tokens_per_request * 3 / 1_000_000 +
        output_tokens * 15 / 1_000_000 +
        # Subsequent requests: cache read (-90%)
        (num_requests - 1) * (
            system_tokens * 0.30 / 1_000_000 +  # 90% off!
            input_tokens_per_request * 3 / 1_000_000 +
            output_tokens * 15 / 1_000_000
        )
    )

    savings_pct = (1 - cached_cost / normal_cost) * 100

    print(f"  Normal API:  ${normal_cost:.4f}")
    print(f"  With Cache:  ${cached_cost:.4f}")
    print(f"  Savings:     ${normal_cost - cached_cost:.4f} ({savings_pct:.1f}%)")
    print(f"\n  💡 More requests = more savings!")
    print(f"     At 100 requests: ~{90}% savings")


if __name__ == "__main__":
    main()

```


# calculate_savings.py

```python
#!/usr/bin/env python3
"""
Cost Savings Calculator for Claude API

Calculate potential savings from Batch API, Prompt Caching,
and Extended Thinking optimizations.

Usage:
    python calculate_savings.py
    python calculate_savings.py --input 10000 --output 5000 --requests 100

Requirements:
    No external dependencies (stdlib only)
"""

import argparse
from dataclasses import dataclass
from typing import Optional


# Pricing as of January 2026 (per million tokens)
@dataclass
class Pricing:
    """Claude API pricing tiers."""
    # Sonnet 4.5
    SONNET_INPUT: float = 3.00
    SONNET_OUTPUT: float = 15.00
    SONNET_BATCH_INPUT: float = 1.50
    SONNET_BATCH_OUTPUT: float = 7.50
    SONNET_CACHE_WRITE: float = 3.75
    SONNET_CACHE_READ: float = 0.30

    # Opus 4.5
    OPUS_INPUT: float = 5.00
    OPUS_OUTPUT: float = 25.00
    OPUS_BATCH_INPUT: float = 2.50
    OPUS_BATCH_OUTPUT: float = 12.50

    # Haiku 4.5
    HAIKU_INPUT: float = 1.00
    HAIKU_OUTPUT: float = 5.00
    HAIKU_BATCH_INPUT: float = 0.50
    HAIKU_BATCH_OUTPUT: float = 2.50


def calculate_normal_cost(
    input_tokens: int,
    output_tokens: int,
    requests: int = 1,
    model: str = "sonnet"
) -> float:
    """Calculate cost without any optimization."""
    p = Pricing()

    if model == "sonnet":
        input_price = p.SONNET_INPUT
        output_price = p.SONNET_OUTPUT
    elif model == "opus":
        input_price = p.OPUS_INPUT
        output_price = p.OPUS_OUTPUT
    else:  # haiku
        input_price = p.HAIKU_INPUT
        output_price = p.HAIKU_OUTPUT

    cost = (
        input_tokens * requests * input_price +
        output_tokens * requests * output_price
    ) / 1_000_000

    return cost


def calculate_batch_cost(
    input_tokens: int,
    output_tokens: int,
    requests: int = 1,
    model: str = "sonnet"
) -> float:
    """Calculate cost with Batch API (50% off)."""
    p = Pricing()

    if model == "sonnet":
        input_price = p.SONNET_BATCH_INPUT
        output_price = p.SONNET_BATCH_OUTPUT
    elif model == "opus":
        input_price = p.OPUS_BATCH_INPUT
        output_price = p.OPUS_BATCH_OUTPUT
    else:  # haiku
        input_price = p.HAIKU_BATCH_INPUT
        output_price = p.HAIKU_BATCH_OUTPUT

    cost = (
        input_tokens * requests * input_price +
        output_tokens * requests * output_price
    ) / 1_000_000

    return cost


def calculate_cached_cost(
    input_tokens: int,
    output_tokens: int,
    system_tokens: int,
    requests: int = 1,
    model: str = "sonnet"
) -> float:
    """Calculate cost with Prompt Caching."""
    p = Pricing()

    if model != "sonnet":
        print("⚠️  Cache pricing shown for Sonnet. Other models may vary.")

    # First request: cache write
    first_request = (
        system_tokens * p.SONNET_CACHE_WRITE +
        input_tokens * p.SONNET_INPUT +
        output_tokens * p.SONNET_OUTPUT
    ) / 1_000_000

    # Subsequent requests: cache read
    subsequent = (requests - 1) * (
        system_tokens * p.SONNET_CACHE_READ +
        input_tokens * p.SONNET_INPUT +
        output_tokens * p.SONNET_OUTPUT
    ) / 1_000_000 if requests > 1 else 0

    return first_request + subsequent


def calculate_combined_cost(
    input_tokens: int,
    output_tokens: int,
    system_tokens: int,
    requests: int = 1
) -> float:
    """Calculate cost with both Batch API and Caching."""
    p = Pricing()

    # First request: cache write + batch pricing
    first_request = (
        system_tokens * p.SONNET_CACHE_WRITE +
        input_tokens * p.SONNET_BATCH_INPUT +
        output_tokens * p.SONNET_BATCH_OUTPUT
    ) / 1_000_000

    # Subsequent requests: cache read + batch pricing
    subsequent = (requests - 1) * (
        system_tokens * p.SONNET_CACHE_READ +
        input_tokens * p.SONNET_BATCH_INPUT +
        output_tokens * p.SONNET_BATCH_OUTPUT
    ) / 1_000_000 if requests > 1 else 0

    return first_request + subsequent


def print_report(
    input_tokens: int,
    output_tokens: int,
    system_tokens: int,
    requests: int,
    model: str = "sonnet"
) -> None:
    """Print a detailed savings report."""

    normal = calculate_normal_cost(input_tokens + system_tokens, output_tokens, requests, model)
    batch = calculate_batch_cost(input_tokens + system_tokens, output_tokens, requests, model)
    cached = calculate_cached_cost(input_tokens, output_tokens, system_tokens, requests, model)
    combined = calculate_combined_cost(input_tokens, output_tokens, system_tokens, requests)

    print("=" * 62)
    print("  💰 CLAUDE API COST SAVINGS REPORT")
    print("  🐾 by washinmura.jp")
    print("=" * 62)

    print(f"""
  📊 Your Usage:
     Model:           {model.capitalize()}
     Requests:        {requests:,}
     Input tokens:    {input_tokens:,} per request
     Output tokens:   {output_tokens:,} per request
     System prompt:   {system_tokens:,} tokens (cacheable)
""")

    print("  📈 Cost Comparison:")
    print("  ┌─────────────────────────────────────────────────────────┐")
    print(f"  │ {'Method':<20} │ {'Cost':>12} │ {'Savings':>10} │ {'%':>6} │")
    print("  ├─────────────────────────────────────────────────────────┤")
    print(f"  │ {'Normal API':<20} │ ${normal:>10.4f} │ {'—':>10} │ {'—':>6} │")
    print(f"  │ {'+ Batch API':<20} │ ${batch:>10.4f} │ ${normal-batch:>9.4f} │ {(1-batch/normal)*100:>5.1f}% │")
    print(f"  │ {'+ Prompt Caching':<20} │ ${cached:>10.4f} │ ${normal-cached:>9.4f} │ {(1-cached/normal)*100:>5.1f}% │")
    print(f"  │ {'+ Both (Maximum)':<20} │ ${combined:>10.4f} │ ${normal-combined:>9.4f} │ {(1-combined/normal)*100:>5.1f}% │")
    print("  └─────────────────────────────────────────────────────────┘")

    # Projections
    daily_requests = requests
    monthly = combined * 30
    yearly = combined * 365
    yearly_normal = normal * 365
    yearly_savings = yearly_normal - yearly

    print(f"""
  📅 Projections (at {daily_requests} requests/day):
     Daily:    ${combined:.2f}
     Monthly:  ${monthly:.2f}
     Yearly:   ${yearly:.2f}

  🎉 Yearly Savings: ${yearly_savings:.2f}
     (compared to ${yearly_normal:.2f} without optimization)
""")

    print("=" * 62)
    print("  💡 Tips:")
    print("     • Batch API: Best for tasks that can wait up to 24h")
    print("     • Caching: Best for repeated system prompts >1K tokens")
    print("     • Combined: Maximum savings for bulk processing")
    print("=" * 62)


def main():
    parser = argparse.ArgumentParser(
        description="Calculate Claude API cost savings"
    )
    parser.add_argument(
        "--input", type=int, default=1000,
        help="Input tokens per request (excluding system prompt)"
    )
    parser.add_argument(
        "--output", type=int, default=500,
        help="Output tokens per request"
    )
    parser.add_argument(
        "--system", type=int, default=2000,
        help="System prompt tokens (cacheable)"
    )
    parser.add_argument(
        "--requests", type=int, default=100,
        help="Number of requests"
    )
    parser.add_argument(
        "--model", choices=["sonnet", "opus", "haiku"], default="sonnet",
        help="Claude model to use"
    )

    args = parser.parse_args()

    print_report(
        input_tokens=args.input,
        output_tokens=args.output,
        system_tokens=args.system,
        requests=args.requests,
        model=args.model
    )


if __name__ == "__main__":
    main()

```