Making My Static Blog AI-Discoverable: A Complete AEO Implementation Guide
Search is changing. When someone asks Perplexity "how to build hybrid search for static blogs," or when ChatGPT browses the web to answer a RAG question, they're not using Google's PageRank — they're using AI crawlers that fetch, parse, and synthesize web content directly. If your site isn't optimized for these AI engines, your content might as well not exist in this new discovery paradigm.
After building yuxu.ge as a pure static site (no frameworks, just standards), I realized my content was invisible to AI crawlers for two reasons:
- The homepage renders articles via JavaScript — AI bots that don't execute JS see an empty page
- No machine-readable metadata existed — no sitemap, no structured data, no content feeds
This post documents how I solved both problems with a single Node.js build script that generates all the AEO (AI Engine Optimization) and SEO artifacts automatically on every deployment.
What Gets Generated
One script, build-aeo.js, produces everything:
| File | Purpose | Size |
|---|---|---|
sitemap.xml |
Standard XML sitemap with hreflang | 48K |
robots.txt |
Search engine + AI crawler permissions | <1K |
llms.txt |
AI-readable site summary | 28K |
llms-full.txt |
Complete content dump | 2.1M |
feed.xml |
Atom 1.0 feed (top 20, full content) | 264K |
urls.txt |
URL list for Baidu push API | <2K |
| JSON-LD | BlogPosting schema in every article | injected |
| SEO meta | OG, Twitter Card, canonical, hreflang | injected |
<noscript> |
Static article list on homepage | injected |
The Architecture
The script reads from two sources:
blog/posts.json— 141 posts with metadata (title, date, tags, description, language variants)_content/posts/— Raw markdown source files
It writes static files to the root directory and injects metadata directly into existing HTML files. The entire process is idempotent — old injections are stripped before new ones are added, so it's safe to run on every build.
Integration into the build pipeline is one line in build.sh:
# Build static HTML (for crawlers)
node $TOOLS_DIR/build.js
# Build AEO & SEO
node $TOOLS_DIR/build-aeo.js
1. robots.txt — Rolling Out the Red Carpet
Most sites are blocking AI crawlers. I'm doing the opposite — explicitly welcoming them:
User-agent: *
Allow: /
# Search Engine Crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Baiduspider
Allow: /
User-agent: YandexBot
Allow: /
# AI Crawlers - explicitly allowed
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Bytespider
Allow: /
User-agent: CCBot
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: cohere-ai
Allow: /
Sitemap: https://yuxu.ge/sitemap.xml
This sends a clear signal to both traditional search engines and AI crawlers: my content is open and available for indexing, training, and synthesis.
2. llms.txt — A README for AI Crawlers
The llms.txt convention provides a structured, human-and-AI-readable summary of your site. Think of it as a README.md that AI systems read first to understand what your site is about.
# Yuxu Ge - Senior AI Architect & Researcher
> Personal technical blog covering RAG systems, search architecture,
> neural networks, AI-assisted programming, and distributed systems.
## Author
- Name: Yuxu Ge
- Role: Senior AI Architect | MSc AI candidate at University of York
- Expertise: Information Retrieval, RAG Systems, Search Infrastructure
## Featured Articles (High Priority)
- [The "Green Trap" in RAG Systems](https://yuxu.ge/blog/...): Deep analysis...
- [Building Hybrid Search for Static Blogs](https://yuxu.ge/blog/...): Production-ready...
## Recent Articles
(all articles listed with URLs and descriptions)
## Blog Index
- Blog Home: https://yuxu.ge/blog/
- RSS/Atom Feed: https://yuxu.ge/feed.xml
- Sitemap: https://yuxu.ge/sitemap.xml
I also generate llms-full.txt — a 2.1MB plain text dump of every article's complete content. This lets an AI system ingest my entire blog in a single HTTP request instead of crawling 159 individual pages.
The generation logic distinguishes between featured (pinned) and recent articles:
const featured = uniqueEntries
.filter(e => e.top !== undefined)
.sort((a, b) => a.top - b.top);
const recent = uniqueEntries
.filter(e => e.top === undefined)
.sort((a, b) => b.date.localeCompare(a.date));
3. sitemap.xml — Bilingual Hreflang Support
Many of my articles exist in both English and Chinese. The sitemap uses xhtml:link to declare language alternates, which prevents search engines from treating them as duplicate content:
<url>
<loc>https://yuxu.ge/blog/2026/2026-02-20-rag-energy-efficiency-blog.html</loc>
<lastmod>2026-02-20</lastmod>
<priority>0.9</priority>
<xhtml:link rel="alternate" hreflang="zh"
href="https://yuxu.ge/blog/2026/2026-02-20-rag-energy-efficiency-blog-zh.html" />
<xhtml:link rel="alternate" hreflang="en"
href="https://yuxu.ge/blog/2026/2026-02-20-rag-energy-efficiency-blog.html" />
</url>
Pinned articles get priority 0.9, regular articles 0.7, homepage 1.0.
4. JSON-LD Structured Data — Telling Machines Exactly What's Here
JSON-LD is the gold standard for machine-readable metadata. Instead of crawlers guessing that a page is a blog post, we explicitly declare it with BlogPosting schema:
{
"@context": "https://schema.org",
"@type": "BlogPosting",
"headline": "The \"Green Trap\" in RAG Systems...",
"description": "A deep analysis of five energy-saving techniques...",
"datePublished": "2026-02-20",
"author": {
"@type": "Person",
"name": "Yuxu Ge",
"url": "https://yuxu.ge",
"jobTitle": "Senior AI Architect",
"affiliation": {
"@type": "Organization",
"name": "University of York"
}
},
"inLanguage": "en",
"keywords": ["rag", "energy-efficiency", "llm"],
"isPartOf": {
"@type": "Blog",
"name": "Yuxu Ge's Blog",
"url": "https://yuxu.ge/blog/"
}
}
The script injects this into every article's <head>, along with Open Graph and Twitter Card meta tags. It also injects WebSite schema on the homepage and Blog schema on the blog index.
The idempotency is handled by stripping old injections first:
// Remove old JSON-LD
html = html.replace(
/<script type="application\/ld\+json">[\s\S]*?<\/script>\n?/g, ''
);
// Remove old SEO meta
html = html.replace(/\s*<meta property="og:[\s\S]*?">\n?/g, '');
html = html.replace(/\s*<meta name="twitter:[\s\S]*?">\n?/g, '');
// ... then inject fresh tags before </head>
5. SEO Meta Tags — The Classics Still Matter
Every article page gets the full set:
- Canonical URL — prevents duplicate content issues
- hreflang — links English and Chinese versions together
- Open Graph — for social sharing (Facebook, LinkedIn)
- Twitter Card — for Twitter/X previews
- article:published_time — publication date for social platforms
<link rel="canonical" href="https://yuxu.ge/blog/2026/...">
<link rel="alternate" hreflang="zh" href="https://yuxu.ge/blog/2026/...-zh.html">
<link rel="alternate" hreflang="en" href="https://yuxu.ge/blog/2026/....html">
<meta property="og:type" content="article">
<meta property="og:title" content="The Green Trap in RAG Systems...">
<meta property="og:url" content="https://yuxu.ge/blog/2026/...">
<meta name="twitter:card" content="summary">
<meta name="twitter:site" content="@YuxuGe_AI">
6. The Noscript Problem — Making the Homepage Crawlable
My homepage uses JavaScript to dynamically render blog posts from posts.json. This works great for browsers but not for AI crawlers that don't execute JS. The solution: inject a <noscript> block with a static HTML list of all articles.
<div id="dynamic-sections"></div>
<noscript id="aeo-noscript">
<section>
<h2>Blog Articles</h2>
<ul>
<li><a href="/blog/2026/2026-02-20-rag-energy-efficiency-blog.html">
The "Green Trap" in RAG Systems...</a> <small>(2026-02-20)</small>
- A deep analysis of five energy-saving techniques...</li>
<!-- ... all 120+ articles ... -->
</ul>
</section>
</noscript>
This is invisible to regular users (JS is enabled) but provides full article discovery for simple crawlers.
7. Atom Feed — Full Content, Not Just Summaries
The feed.xml provides the top 20 articles with complete content in Atom 1.0 format. Unlike many feeds that only include excerpts, mine includes the full article text:
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Yuxu Ge's Blog</title>
<link href="https://yuxu.ge/feed.xml" rel="self" type="application/atom+xml" />
<entry>
<title>The "Green Trap" in RAG Systems...</title>
<content type="text">(full article content here)</content>
</entry>
</feed>
Verification Checklist
After running node build-aeo.js, verify:
-
sitemap.xml— Valid XML, includes all articles, hreflang alternates present -
robots.txt— Search engines + AI crawlers explicitly allowed, sitemap reference included -
llms.txt— Complete site summary with author info, featured and recent articles -
llms-full.txt— 2.1MB, all articles in plain text -
feed.xml— Valid Atom 1.0, top 20 articles with full content -
urls.txt— URL list for Baidu push API (excludes legacy articles) - JSON-LD —
BlogPostingschema in every article<head>(validate at schema.org validator) - SEO meta — OG, Twitter Card, canonical, hreflang on all article pages
- Homepage
<noscript>— Static article list visible when JS disabled - IndexNow — Key file deployed, Bing/Yandex push via GitHub Actions
- Baidu Push — Token stored as secret, automated via GitHub Actions
- Idempotent — Running twice produces identical output (no duplicate tags)
8. Search Engine Push Automation — Proactive Indexing
Generating AEO artifacts is only half the battle. Search engines also need to be told when new content is available. I set up two push mechanisms:
Baidu URL Push
Baidu provides a URL push API for Chinese search indexing. The build script generates urls.txt containing all non-legacy article URLs, which can be submitted manually:
curl -H 'Content-Type:text/plain' \
--data-binary @urls.txt \
"http://data.zz.baidu.com/urls?site=https://yuxu.ge&token=$BAIDU_TOKEN"
IndexNow (Bing, Yandex, and more)
IndexNow is an open protocol that lets you instantly notify search engines about new or updated URLs. One API call notifies Bing, Yandex, and all other participating engines simultaneously:
{
"host": "yuxu.ge",
"key": "your-indexnow-key",
"keyLocation": "https://yuxu.ge/your-key-file.txt",
"urlList": [
"https://yuxu.ge/blog/2026/new-article.html",
"https://yuxu.ge/blog/2026/another-article.html"
]
}
The key file is placed in the site root for verification — search engines fetch it to confirm ownership.
GitHub Actions for Automated Push
Instead of pushing manually after every deployment, I created a GitHub Actions workflow that triggers automatically when urls.txt changes:
name: Search Engine URL Push
on:
push:
branches: [gh-pages]
paths:
- 'urls.txt'
jobs:
push-urls:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Wait for GitHub Pages deployment
run: sleep 30
- name: Push URLs to Baidu
run: |
curl -s -H 'Content-Type:text/plain' \
--data-binary @urls.txt \
"http://data.zz.baidu.com/urls?site=https://yuxu.ge&token=${{ secrets.BAIDU_PUSH_TOKEN }}"
- name: Push URLs to IndexNow (Bing/Yandex)
run: |
# Build JSON payload from urls.txt and POST to IndexNow API
curl -s -X POST "https://api.indexnow.org/IndexNow" \
-H "Content-Type: application/json; charset=utf-8" \
-d '{"host":"yuxu.ge","key":"...","urlList":[...]}'
The Baidu token is stored as a GitHub Actions secret (BAIDU_PUSH_TOKEN), never hardcoded. The workflow waits 30 seconds after push for GitHub Pages deployment to complete before notifying search engines.
What's Next
- Submit
sitemap.xmlto Google Search Console - Monitor AI citation sources (Perplexity, ChatGPT Browse) for yuxu.ge mentions
- Add
Personschema withsameAslinks to strengthen author entity recognition - Consider adding
FAQschema to Q&A-style articles
The complete build-aeo.js is about 350 lines of vanilla Node.js — no dependencies beyond what the blog build system already uses (fs, path, marked). It runs in under 2 seconds and generates everything needed to make a static blog fully discoverable by both traditional search engines and the new generation of AI-powered search.