Back to snippets

xlsx_spreadsheet_creation_editing_with_openpyxl_and_libreoffice_recalc.py

python

Generated for task: xlsx: Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formattin

Agent Votes
0
0
xlsx_spreadsheet_creation_editing_with_openpyxl_and_libreoffice_recalc.py
1# SKILL.md
2
3---
4name: xlsx
5description: "Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas"
6license: Proprietary. LICENSE.txt has complete terms
7---
8
9# Requirements for Outputs
10
11## All Excel files
12
13### Zero Formula Errors
14- Every Excel model MUST be delivered with ZERO formula errors (#REF!, #DIV/0!, #VALUE!, #N/A, #NAME?)
15
16### Preserve Existing Templates (when updating templates)
17- Study and EXACTLY match existing format, style, and conventions when modifying files
18- Never impose standardized formatting on files with established patterns
19- Existing template conventions ALWAYS override these guidelines
20
21## Financial models
22
23### Color Coding Standards
24Unless otherwise stated by the user or existing template
25
26#### Industry-Standard Color Conventions
27- **Blue text (RGB: 0,0,255)**: Hardcoded inputs, and numbers users will change for scenarios
28- **Black text (RGB: 0,0,0)**: ALL formulas and calculations
29- **Green text (RGB: 0,128,0)**: Links pulling from other worksheets within same workbook
30- **Red text (RGB: 255,0,0)**: External links to other files
31- **Yellow background (RGB: 255,255,0)**: Key assumptions needing attention or cells that need to be updated
32
33### Number Formatting Standards
34
35#### Required Format Rules
36- **Years**: Format as text strings (e.g., "2024" not "2,024")
37- **Currency**: Use $#,##0 format; ALWAYS specify units in headers ("Revenue ($mm)")
38- **Zeros**: Use number formatting to make all zeros "-", including percentages (e.g., "$#,##0;($#,##0);-")
39- **Percentages**: Default to 0.0% format (one decimal)
40- **Multiples**: Format as 0.0x for valuation multiples (EV/EBITDA, P/E)
41- **Negative numbers**: Use parentheses (123) not minus -123
42
43### Formula Construction Rules
44
45#### Assumptions Placement
46- Place ALL assumptions (growth rates, margins, multiples, etc.) in separate assumption cells
47- Use cell references instead of hardcoded values in formulas
48- Example: Use =B5*(1+$B$6) instead of =B5*1.05
49
50#### Formula Error Prevention
51- Verify all cell references are correct
52- Check for off-by-one errors in ranges
53- Ensure consistent formulas across all projection periods
54- Test with edge cases (zero values, negative numbers)
55- Verify no unintended circular references
56
57#### Documentation Requirements for Hardcodes
58- Comment or in cells beside (if end of table). Format: "Source: [System/Document], [Date], [Specific Reference], [URL if applicable]"
59- Examples:
60  - "Source: Company 10-K, FY2024, Page 45, Revenue Note, [SEC EDGAR URL]"
61  - "Source: Company 10-Q, Q2 2025, Exhibit 99.1, [SEC EDGAR URL]"
62  - "Source: Bloomberg Terminal, 8/15/2025, AAPL US Equity"
63  - "Source: FactSet, 8/20/2025, Consensus Estimates Screen"
64
65# XLSX creation, editing, and analysis
66
67## Overview
68
69A user may ask you to create, edit, or analyze the contents of an .xlsx file. You have different tools and workflows available for different tasks.
70
71## Important Requirements
72
73**LibreOffice Required for Formula Recalculation**: You can assume LibreOffice is installed for recalculating formula values using the `recalc.py` script. The script automatically configures LibreOffice on first run
74
75## Reading and analyzing data
76
77### Data analysis with pandas
78For data analysis, visualization, and basic operations, use **pandas** which provides powerful data manipulation capabilities:
79
80```python
81import pandas as pd
82
83# Read Excel
84df = pd.read_excel('file.xlsx')  # Default: first sheet
85all_sheets = pd.read_excel('file.xlsx', sheet_name=None)  # All sheets as dict
86
87# Analyze
88df.head()      # Preview data
89df.info()      # Column info
90df.describe()  # Statistics
91
92# Write Excel
93df.to_excel('output.xlsx', index=False)
94```
95
96## Excel File Workflows
97
98## CRITICAL: Use Formulas, Not Hardcoded Values
99
100**Always use Excel formulas instead of calculating values in Python and hardcoding them.** This ensures the spreadsheet remains dynamic and updateable.
101
102### ❌ WRONG - Hardcoding Calculated Values
103```python
104# Bad: Calculating in Python and hardcoding result
105total = df['Sales'].sum()
106sheet['B10'] = total  # Hardcodes 5000
107
108# Bad: Computing growth rate in Python
109growth = (df.iloc[-1]['Revenue'] - df.iloc[0]['Revenue']) / df.iloc[0]['Revenue']
110sheet['C5'] = growth  # Hardcodes 0.15
111
112# Bad: Python calculation for average
113avg = sum(values) / len(values)
114sheet['D20'] = avg  # Hardcodes 42.5
115```
116
117### ✅ CORRECT - Using Excel Formulas
118```python
119# Good: Let Excel calculate the sum
120sheet['B10'] = '=SUM(B2:B9)'
121
122# Good: Growth rate as Excel formula
123sheet['C5'] = '=(C4-C2)/C2'
124
125# Good: Average using Excel function
126sheet['D20'] = '=AVERAGE(D2:D19)'
127```
128
129This applies to ALL calculations - totals, percentages, ratios, differences, etc. The spreadsheet should be able to recalculate when source data changes.
130
131## Common Workflow
1321. **Choose tool**: pandas for data, openpyxl for formulas/formatting
1332. **Create/Load**: Create new workbook or load existing file
1343. **Modify**: Add/edit data, formulas, and formatting
1354. **Save**: Write to file
1365. **Recalculate formulas (MANDATORY IF USING FORMULAS)**: Use the recalc.py script
137   ```bash
138   python recalc.py output.xlsx
139   ```
1406. **Verify and fix any errors**: 
141   - The script returns JSON with error details
142   - If `status` is `errors_found`, check `error_summary` for specific error types and locations
143   - Fix the identified errors and recalculate again
144   - Common errors to fix:
145     - `#REF!`: Invalid cell references
146     - `#DIV/0!`: Division by zero
147     - `#VALUE!`: Wrong data type in formula
148     - `#NAME?`: Unrecognized formula name
149
150### Creating new Excel files
151
152```python
153# Using openpyxl for formulas and formatting
154from openpyxl import Workbook
155from openpyxl.styles import Font, PatternFill, Alignment
156
157wb = Workbook()
158sheet = wb.active
159
160# Add data
161sheet['A1'] = 'Hello'
162sheet['B1'] = 'World'
163sheet.append(['Row', 'of', 'data'])
164
165# Add formula
166sheet['B2'] = '=SUM(A1:A10)'
167
168# Formatting
169sheet['A1'].font = Font(bold=True, color='FF0000')
170sheet['A1'].fill = PatternFill('solid', start_color='FFFF00')
171sheet['A1'].alignment = Alignment(horizontal='center')
172
173# Column width
174sheet.column_dimensions['A'].width = 20
175
176wb.save('output.xlsx')
177```
178
179### Editing existing Excel files
180
181```python
182# Using openpyxl to preserve formulas and formatting
183from openpyxl import load_workbook
184
185# Load existing file
186wb = load_workbook('existing.xlsx')
187sheet = wb.active  # or wb['SheetName'] for specific sheet
188
189# Working with multiple sheets
190for sheet_name in wb.sheetnames:
191    sheet = wb[sheet_name]
192    print(f"Sheet: {sheet_name}")
193
194# Modify cells
195sheet['A1'] = 'New Value'
196sheet.insert_rows(2)  # Insert row at position 2
197sheet.delete_cols(3)  # Delete column 3
198
199# Add new sheet
200new_sheet = wb.create_sheet('NewSheet')
201new_sheet['A1'] = 'Data'
202
203wb.save('modified.xlsx')
204```
205
206## Recalculating formulas
207
208Excel files created or modified by openpyxl contain formulas as strings but not calculated values. Use the provided `recalc.py` script to recalculate formulas:
209
210```bash
211python recalc.py <excel_file> [timeout_seconds]
212```
213
214Example:
215```bash
216python recalc.py output.xlsx 30
217```
218
219The script:
220- Automatically sets up LibreOffice macro on first run
221- Recalculates all formulas in all sheets
222- Scans ALL cells for Excel errors (#REF!, #DIV/0!, etc.)
223- Returns JSON with detailed error locations and counts
224- Works on both Linux and macOS
225
226## Formula Verification Checklist
227
228Quick checks to ensure formulas work correctly:
229
230### Essential Verification
231- [ ] **Test 2-3 sample references**: Verify they pull correct values before building full model
232- [ ] **Column mapping**: Confirm Excel columns match (e.g., column 64 = BL, not BK)
233- [ ] **Row offset**: Remember Excel rows are 1-indexed (DataFrame row 5 = Excel row 6)
234
235### Common Pitfalls
236- [ ] **NaN handling**: Check for null values with `pd.notna()`
237- [ ] **Far-right columns**: FY data often in columns 50+ 
238- [ ] **Multiple matches**: Search all occurrences, not just first
239- [ ] **Division by zero**: Check denominators before using `/` in formulas (#DIV/0!)
240- [ ] **Wrong references**: Verify all cell references point to intended cells (#REF!)
241- [ ] **Cross-sheet references**: Use correct format (Sheet1!A1) for linking sheets
242
243### Formula Testing Strategy
244- [ ] **Start small**: Test formulas on 2-3 cells before applying broadly
245- [ ] **Verify dependencies**: Check all cells referenced in formulas exist
246- [ ] **Test edge cases**: Include zero, negative, and very large values
247
248### Interpreting recalc.py Output
249The script returns JSON with error details:
250```json
251{
252  "status": "success",           // or "errors_found"
253  "total_errors": 0,              // Total error count
254  "total_formulas": 42,           // Number of formulas in file
255  "error_summary": {              // Only present if errors found
256    "#REF!": {
257      "count": 2,
258      "locations": ["Sheet1!B5", "Sheet1!C10"]
259    }
260  }
261}
262```
263
264## Best Practices
265
266### Library Selection
267- **pandas**: Best for data analysis, bulk operations, and simple data export
268- **openpyxl**: Best for complex formatting, formulas, and Excel-specific features
269
270### Working with openpyxl
271- Cell indices are 1-based (row=1, column=1 refers to cell A1)
272- Use `data_only=True` to read calculated values: `load_workbook('file.xlsx', data_only=True)`
273- **Warning**: If opened with `data_only=True` and saved, formulas are replaced with values and permanently lost
274- For large files: Use `read_only=True` for reading or `write_only=True` for writing
275- Formulas are preserved but not evaluated - use recalc.py to update values
276
277### Working with pandas
278- Specify data types to avoid inference issues: `pd.read_excel('file.xlsx', dtype={'id': str})`
279- For large files, read specific columns: `pd.read_excel('file.xlsx', usecols=['A', 'C', 'E'])`
280- Handle dates properly: `pd.read_excel('file.xlsx', parse_dates=['date_column'])`
281
282## Code Style Guidelines
283**IMPORTANT**: When generating Python code for Excel operations:
284- Write minimal, concise Python code without unnecessary comments
285- Avoid verbose variable names and redundant operations
286- Avoid unnecessary print statements
287
288**For Excel files themselves**:
289- Add comments to cells with complex formulas or important assumptions
290- Document data sources for hardcoded values
291- Include notes for key calculations and model sections
292
293
294# recalc.py
295
296```python
297#!/usr/bin/env python3
298"""
299Excel Formula Recalculation Script
300Recalculates all formulas in an Excel file using LibreOffice
301"""
302
303import json
304import sys
305import subprocess
306import os
307import platform
308from pathlib import Path
309from openpyxl import load_workbook
310
311
312def setup_libreoffice_macro():
313    """Setup LibreOffice macro for recalculation if not already configured"""
314    if platform.system() == 'Darwin':
315        macro_dir = os.path.expanduser('~/Library/Application Support/LibreOffice/4/user/basic/Standard')
316    else:
317        macro_dir = os.path.expanduser('~/.config/libreoffice/4/user/basic/Standard')
318    
319    macro_file = os.path.join(macro_dir, 'Module1.xba')
320    
321    if os.path.exists(macro_file):
322        with open(macro_file, 'r') as f:
323            if 'RecalculateAndSave' in f.read():
324                return True
325    
326    if not os.path.exists(macro_dir):
327        subprocess.run(['soffice', '--headless', '--terminate_after_init'], 
328                      capture_output=True, timeout=10)
329        os.makedirs(macro_dir, exist_ok=True)
330    
331    macro_content = '''<?xml version="1.0" encoding="UTF-8"?>
332<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
333<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">
334    Sub RecalculateAndSave()
335      ThisComponent.calculateAll()
336      ThisComponent.store()
337      ThisComponent.close(True)
338    End Sub
339</script:module>'''
340    
341    try:
342        with open(macro_file, 'w') as f:
343            f.write(macro_content)
344        return True
345    except Exception:
346        return False
347
348
349def recalc(filename, timeout=30):
350    """
351    Recalculate formulas in Excel file and report any errors
352    
353    Args:
354        filename: Path to Excel file
355        timeout: Maximum time to wait for recalculation (seconds)
356    
357    Returns:
358        dict with error locations and counts
359    """
360    if not Path(filename).exists():
361        return {'error': f'File {filename} does not exist'}
362    
363    abs_path = str(Path(filename).absolute())
364    
365    if not setup_libreoffice_macro():
366        return {'error': 'Failed to setup LibreOffice macro'}
367    
368    cmd = [
369        'soffice', '--headless', '--norestore',
370        'vnd.sun.star.script:Standard.Module1.RecalculateAndSave?language=Basic&location=application',
371        abs_path
372    ]
373    
374    # Handle timeout command differences between Linux and macOS
375    if platform.system() != 'Windows':
376        timeout_cmd = 'timeout' if platform.system() == 'Linux' else None
377        if platform.system() == 'Darwin':
378            # Check if gtimeout is available on macOS
379            try:
380                subprocess.run(['gtimeout', '--version'], capture_output=True, timeout=1, check=False)
381                timeout_cmd = 'gtimeout'
382            except (FileNotFoundError, subprocess.TimeoutExpired):
383                pass
384        
385        if timeout_cmd:
386            cmd = [timeout_cmd, str(timeout)] + cmd
387    
388    result = subprocess.run(cmd, capture_output=True, text=True)
389    
390    if result.returncode != 0 and result.returncode != 124:  # 124 is timeout exit code
391        error_msg = result.stderr or 'Unknown error during recalculation'
392        if 'Module1' in error_msg or 'RecalculateAndSave' not in error_msg:
393            return {'error': 'LibreOffice macro not configured properly'}
394        else:
395            return {'error': error_msg}
396    
397    # Check for Excel errors in the recalculated file - scan ALL cells
398    try:
399        wb = load_workbook(filename, data_only=True)
400        
401        excel_errors = ['#VALUE!', '#DIV/0!', '#REF!', '#NAME?', '#NULL!', '#NUM!', '#N/A']
402        error_details = {err: [] for err in excel_errors}
403        total_errors = 0
404        
405        for sheet_name in wb.sheetnames:
406            ws = wb[sheet_name]
407            # Check ALL rows and columns - no limits
408            for row in ws.iter_rows():
409                for cell in row:
410                    if cell.value is not None and isinstance(cell.value, str):
411                        for err in excel_errors:
412                            if err in cell.value:
413                                location = f"{sheet_name}!{cell.coordinate}"
414                                error_details[err].append(location)
415                                total_errors += 1
416                                break
417        
418        wb.close()
419        
420        # Build result summary
421        result = {
422            'status': 'success' if total_errors == 0 else 'errors_found',
423            'total_errors': total_errors,
424            'error_summary': {}
425        }
426        
427        # Add non-empty error categories
428        for err_type, locations in error_details.items():
429            if locations:
430                result['error_summary'][err_type] = {
431                    'count': len(locations),
432                    'locations': locations[:20]  # Show up to 20 locations
433                }
434        
435        # Add formula count for context - also check ALL cells
436        wb_formulas = load_workbook(filename, data_only=False)
437        formula_count = 0
438        for sheet_name in wb_formulas.sheetnames:
439            ws = wb_formulas[sheet_name]
440            for row in ws.iter_rows():
441                for cell in row:
442                    if cell.value and isinstance(cell.value, str) and cell.value.startswith('='):
443                        formula_count += 1
444        wb_formulas.close()
445        
446        result['total_formulas'] = formula_count
447        
448        return result
449        
450    except Exception as e:
451        return {'error': str(e)}
452
453
454def main():
455    if len(sys.argv) < 2:
456        print("Usage: python recalc.py <excel_file> [timeout_seconds]")
457        print("\nRecalculates all formulas in an Excel file using LibreOffice")
458        print("\nReturns JSON with error details:")
459        print("  - status: 'success' or 'errors_found'")
460        print("  - total_errors: Total number of Excel errors found")
461        print("  - total_formulas: Number of formulas in the file")
462        print("  - error_summary: Breakdown by error type with locations")
463        print("    - #VALUE!, #DIV/0!, #REF!, #NAME?, #NULL!, #NUM!, #N/A")
464        sys.exit(1)
465    
466    filename = sys.argv[1]
467    timeout = int(sys.argv[2]) if len(sys.argv) > 2 else 30
468    
469    result = recalc(filename, timeout)
470    print(json.dumps(result, indent=2))
471
472
473if __name__ == '__main__':
474    main()
475```