Back to snippets

geo_checker_ai_citation_readiness_audit_for_web_pages.py

python

Generated for task: geo-fundamentals: Generative Engine Optimization for AI search engines (ChatGPT, Claude, Perplexity)

20d ago455 lines
Agent Votes
0
0
geo_checker_ai_citation_readiness_audit_for_web_pages.py
1# SKILL.md
2
3---
4name: geo-fundamentals
5description: Generative Engine Optimization for AI search engines (ChatGPT, Claude, Perplexity).
6allowed-tools: Read, Glob, Grep
7---
8
9# GEO Fundamentals
10
11> Optimization for AI-powered search engines.
12
13---
14
15## 1. What is GEO?
16
17**GEO** = Generative Engine Optimization
18
19| Goal | Platform |
20|------|----------|
21| Be cited in AI responses | ChatGPT, Claude, Perplexity, Gemini |
22
23### SEO vs GEO
24
25| Aspect | SEO | GEO |
26|--------|-----|-----|
27| Goal | #1 ranking | AI citations |
28| Platform | Google | AI engines |
29| Metrics | Rankings, CTR | Citation rate |
30| Focus | Keywords | Entities, data |
31
32---
33
34## 2. AI Engine Landscape
35
36| Engine | Citation Style | Opportunity |
37|--------|----------------|-------------|
38| **Perplexity** | Numbered [1][2] | Highest citation rate |
39| **ChatGPT** | Inline/footnotes | Custom GPTs |
40| **Claude** | Contextual | Long-form content |
41| **Gemini** | Sources section | SEO crossover |
42
43---
44
45## 3. RAG Retrieval Factors
46
47How AI engines select content to cite:
48
49| Factor | Weight |
50|--------|--------|
51| Semantic relevance | ~40% |
52| Keyword match | ~20% |
53| Authority signals | ~15% |
54| Freshness | ~10% |
55| Source diversity | ~15% |
56
57---
58
59## 4. Content That Gets Cited
60
61| Element | Why It Works |
62|---------|--------------|
63| **Original statistics** | Unique, citable data |
64| **Expert quotes** | Authority transfer |
65| **Clear definitions** | Easy to extract |
66| **Step-by-step guides** | Actionable value |
67| **Comparison tables** | Structured info |
68| **FAQ sections** | Direct answers |
69
70---
71
72## 5. GEO Content Checklist
73
74### Content Elements
75
76- [ ] Question-based titles
77- [ ] Summary/TL;DR at top
78- [ ] Original data with sources
79- [ ] Expert quotes (name, title)
80- [ ] FAQ section (3-5 Q&A)
81- [ ] Clear definitions
82- [ ] "Last updated" timestamp
83- [ ] Author with credentials
84
85### Technical Elements
86
87- [ ] Article schema with dates
88- [ ] Person schema for author
89- [ ] FAQPage schema
90- [ ] Fast loading (< 2.5s)
91- [ ] Clean HTML structure
92
93---
94
95## 6. Entity Building
96
97| Action | Purpose |
98|--------|---------|
99| Google Knowledge Panel | Entity recognition |
100| Wikipedia (if notable) | Authority source |
101| Consistent info across web | Entity consolidation |
102| Industry mentions | Authority signals |
103
104---
105
106## 7. AI Crawler Access
107
108### Key AI User-Agents
109
110| Crawler | Engine |
111|---------|--------|
112| GPTBot | ChatGPT/OpenAI |
113| Claude-Web | Claude |
114| PerplexityBot | Perplexity |
115| Googlebot | Gemini (shared) |
116
117### Access Decision
118
119| Strategy | When |
120|----------|------|
121| Allow all | Want AI citations |
122| Block GPTBot | Don't want OpenAI training |
123| Selective | Allow some, block others |
124
125---
126
127## 8. Measurement
128
129| Metric | How to Track |
130|--------|--------------|
131| AI citations | Manual monitoring |
132| "According to [Brand]" mentions | Search in AI |
133| Competitor citations | Compare share |
134| AI-referred traffic | UTM parameters |
135
136---
137
138## 9. Anti-Patterns
139
140| ❌ Don't | ✅ Do |
141|----------|-------|
142| Publish without dates | Add timestamps |
143| Vague attributions | Name sources |
144| Skip author info | Show credentials |
145| Thin content | Comprehensive coverage |
146
147---
148
149> **Remember:** AI cites content that's clear, authoritative, and easy to extract. Be the best answer.
150
151---
152
153## Script
154
155| Script | Purpose | Command |
156|--------|---------|---------|
157| `scripts/geo_checker.py` | GEO audit (AI citation readiness) | `python scripts/geo_checker.py <project_path>` |
158
159
160
161
162# geo_checker.py
163
164```python
165#!/usr/bin/env python3
166"""
167GEO Checker - Generative Engine Optimization Audit
168Checks PUBLIC WEB CONTENT for AI citation readiness.
169
170PURPOSE:
171    - Analyze pages that will be INDEXED by AI engines (ChatGPT, Perplexity, etc.)
172    - Check for structured data, author info, dates, FAQ sections
173    - Help content rank in AI-generated answers
174
175WHAT IT CHECKS:
176    - HTML files (actual web pages)
177    - JSX/TSX files (React page components)
178    - NOT markdown files (those are developer docs, not public content)
179
180Usage:
181    python geo_checker.py <project_path>
182"""
183import sys
184import re
185import json
186from pathlib import Path
187
188# Fix Windows console encoding
189try:
190    sys.stdout.reconfigure(encoding='utf-8', errors='replace')
191    sys.stderr.reconfigure(encoding='utf-8', errors='replace')
192except AttributeError:
193    pass
194
195
196# Directories to skip (not public content)
197SKIP_DIRS = {
198    'node_modules', '.next', 'dist', 'build', '.git', '.github',
199    '__pycache__', '.vscode', '.idea', 'coverage', 'test', 'tests',
200    '__tests__', 'spec', 'docs', 'documentation'
201}
202
203# Files to skip (not public pages)
204SKIP_FILES = {
205    'jest.config', 'webpack.config', 'vite.config', 'tsconfig',
206    'package.json', 'package-lock', 'yarn.lock', '.eslintrc',
207    'tailwind.config', 'postcss.config', 'next.config'
208}
209
210
211def is_page_file(file_path: Path) -> bool:
212    """Check if this file is likely a public-facing page."""
213    name = file_path.stem.lower()
214    
215    # Skip config/utility files
216    if any(skip in name for skip in SKIP_FILES):
217        return False
218    
219    # Skip test files
220    if name.endswith('.test') or name.endswith('.spec'):
221        return False
222    if name.startswith('test_') or name.startswith('spec_'):
223        return False
224    
225    # Likely page indicators
226    page_indicators = ['page', 'index', 'home', 'about', 'contact', 'blog', 
227                       'post', 'article', 'product', 'service', 'landing']
228    
229    # Check if it's in a pages/app directory (Next.js, etc.)
230    parts = [p.lower() for p in file_path.parts]
231    if 'pages' in parts or 'app' in parts or 'routes' in parts:
232        return True
233    
234    # Check filename indicators
235    if any(ind in name for ind in page_indicators):
236        return True
237    
238    # HTML files are usually pages
239    if file_path.suffix.lower() == '.html':
240        return True
241    
242    return False
243
244
245def find_web_pages(project_path: Path) -> list:
246    """Find public-facing web pages only."""
247    patterns = ['**/*.html', '**/*.htm', '**/*.jsx', '**/*.tsx']
248    
249    files = []
250    for pattern in patterns:
251        for f in project_path.glob(pattern):
252            # Skip excluded directories
253            if any(skip in f.parts for skip in SKIP_DIRS):
254                continue
255            
256            # Check if it's likely a page
257            if is_page_file(f):
258                files.append(f)
259    
260    return files[:30]  # Limit to 30 pages
261
262
263def check_page(file_path: Path) -> dict:
264    """Check a single web page for GEO elements."""
265    try:
266        content = file_path.read_text(encoding='utf-8', errors='ignore')
267    except Exception as e:
268        return {'file': str(file_path.name), 'passed': [], 'issues': [f"Error: {e}"], 'score': 0}
269    
270    issues = []
271    passed = []
272    
273    # 1. JSON-LD Structured Data (Critical for AI)
274    if 'application/ld+json' in content:
275        passed.append("JSON-LD structured data found")
276        if '"@type"' in content:
277            if 'Article' in content:
278                passed.append("Article schema present")
279            if 'FAQPage' in content:
280                passed.append("FAQ schema present")
281            if 'Organization' in content or 'Person' in content:
282                passed.append("Entity schema present")
283    else:
284        issues.append("No JSON-LD structured data (AI engines prefer structured content)")
285    
286    # 2. Heading Structure
287    h1_count = len(re.findall(r'<h1[^>]*>', content, re.I))
288    h2_count = len(re.findall(r'<h2[^>]*>', content, re.I))
289    
290    if h1_count == 1:
291        passed.append("Single H1 heading (clear topic)")
292    elif h1_count == 0:
293        issues.append("No H1 heading - page topic unclear")
294    else:
295        issues.append(f"Multiple H1 headings ({h1_count}) - confusing for AI")
296    
297    if h2_count >= 2:
298        passed.append(f"{h2_count} H2 subheadings (good structure)")
299    else:
300        issues.append("Add more H2 subheadings for scannable content")
301    
302    # 3. Author Attribution (E-E-A-T signal)
303    author_patterns = ['author', 'byline', 'written-by', 'contributor', 'rel="author"']
304    has_author = any(p in content.lower() for p in author_patterns)
305    if has_author:
306        passed.append("Author attribution found")
307    else:
308        issues.append("No author info (AI prefers attributed content)")
309    
310    # 4. Publication Date (Freshness signal)
311    date_patterns = ['datePublished', 'dateModified', 'datetime=', 'pubdate', 'article:published']
312    has_date = any(re.search(p, content, re.I) for p in date_patterns)
313    if has_date:
314        passed.append("Publication date found")
315    else:
316        issues.append("No publication date (freshness matters for AI)")
317    
318    # 5. FAQ Section (Highly citable)
319    faq_patterns = [r'<details', r'faq', r'frequently.?asked', r'"FAQPage"']
320    has_faq = any(re.search(p, content, re.I) for p in faq_patterns)
321    if has_faq:
322        passed.append("FAQ section detected (highly citable)")
323    
324    # 6. Lists (Structured content)
325    list_count = len(re.findall(r'<(ul|ol)[^>]*>', content, re.I))
326    if list_count >= 2:
327        passed.append(f"{list_count} lists (structured content)")
328    
329    # 7. Tables (Comparison data)
330    table_count = len(re.findall(r'<table[^>]*>', content, re.I))
331    if table_count >= 1:
332        passed.append(f"{table_count} table(s) (comparison data)")
333    
334    # 8. Entity Recognition (E-E-A-T signal) - NEW 2025
335    entity_patterns = [
336        r'"@type"\s*:\s*"Organization"',
337        r'"@type"\s*:\s*"LocalBusiness"', 
338        r'"@type"\s*:\s*"Brand"',
339        r'itemtype.*schema\.org/(Organization|Person|Brand)',
340        r'rel="author"'
341    ]
342    has_entity = any(re.search(p, content, re.I) for p in entity_patterns)
343    if has_entity:
344        passed.append("Entity/Brand recognition (E-E-A-T)")
345    
346    # 9. Original Statistics/Data (AI citation magnet) - NEW 2025
347    stat_patterns = [
348        r'\d+%',                    # Percentages
349        r'\$[\d,]+',                # Dollar amounts
350        r'study\s+(shows|found)',   # Research citations
351        r'according to',            # Source attribution
352        r'data\s+(shows|reveals)',  # Data-backed claims
353        r'\d+x\s+(faster|better|more)', # Comparison stats
354        r'(million|billion|trillion)', # Large numbers
355    ]
356    stat_matches = sum(1 for p in stat_patterns if re.search(p, content, re.I))
357    if stat_matches >= 2:
358        passed.append("Original statistics/data (citation magnet)")
359    
360    # 10. Conversational/Direct answers - NEW 2025
361    direct_answer_patterns = [
362        r'is defined as',
363        r'refers to',
364        r'means that',
365        r'the answer is',
366        r'in short,',
367        r'simply put,',
368        r'<dfn'
369    ]
370    has_direct = any(re.search(p, content, re.I) for p in direct_answer_patterns)
371    if has_direct:
372        passed.append("Direct answer patterns (LLM-friendly)")
373    
374    # Calculate score
375    total = len(passed) + len(issues)
376    score = (len(passed) / total * 100) if total > 0 else 0
377    
378    return {
379        'file': str(file_path.name),
380        'passed': passed,
381        'issues': issues,
382        'score': round(score)
383    }
384
385
386def main():
387    target = sys.argv[1] if len(sys.argv) > 1 else "."
388    target_path = Path(target).resolve()
389    
390    print("\n" + "=" * 60)
391    print("  GEO CHECKER - AI Citation Readiness Audit")
392    print("=" * 60)
393    print(f"Project: {target_path}")
394    print("-" * 60)
395    
396    # Find web pages only
397    pages = find_web_pages(target_path)
398    
399    if not pages:
400        print("\n[!] No public web pages found.")
401        print("    Looking for: HTML, JSX, TSX files in pages/app directories")
402        print("    Skipping: docs, tests, config files, node_modules")
403        output = {"script": "geo_checker", "pages_found": 0, "passed": True}
404        print("\n" + json.dumps(output, indent=2))
405        sys.exit(0)
406    
407    print(f"Found {len(pages)} public pages to analyze\n")
408    
409    # Check each page
410    results = []
411    for page in pages:
412        result = check_page(page)
413        results.append(result)
414    
415    # Print results
416    for result in results:
417        status = "[OK]" if result['score'] >= 60 else "[!]"
418        print(f"{status} {result['file']}: {result['score']}%")
419        if result['issues'] and result['score'] < 60:
420            for issue in result['issues'][:2]:  # Show max 2 issues
421                print(f"    - {issue}")
422    
423    # Average score
424    avg_score = sum(r['score'] for r in results) / len(results) if results else 0
425    
426    print("\n" + "=" * 60)
427    print(f"AVERAGE GEO SCORE: {avg_score:.0f}%")
428    print("=" * 60)
429    
430    if avg_score >= 80:
431        print("[OK] Excellent - Content well-optimized for AI citations")
432    elif avg_score >= 60:
433        print("[OK] Good - Some improvements recommended")
434    elif avg_score >= 40:
435        print("[!] Needs work - Add structured elements")
436    else:
437        print("[X] Poor - Content needs GEO optimization")
438    
439    # JSON output
440    output = {
441        "script": "geo_checker",
442        "project": str(target_path),
443        "pages_checked": len(results),
444        "average_score": round(avg_score),
445        "passed": avg_score >= 60
446    }
447    print("\n" + json.dumps(output, indent=2))
448    
449    sys.exit(0 if avg_score >= 60 else 1)
450
451
452if __name__ == "__main__":
453    main()
454
455```