Making My Static Blog AI-Discoverable: A Complete AEO Implementation Guide

Search is changing. When someone asks Perplexity "how to build hybrid search for static blogs," or when ChatGPT browses the web to answer a RAG question, they're not using Google's PageRank — they're using AI crawlers that fetch, parse, and synthesize web content directly. If your site isn't optimized for these AI engines, your content might as well not exist in this new discovery paradigm.

After building yuxu.ge as a pure static site (no frameworks, just standards), I realized my content was invisible to AI crawlers for two reasons:

The homepage renders articles via JavaScript — AI bots that don't execute JS see an empty page
No machine-readable metadata existed — no sitemap, no structured data, no content feeds

This post documents how I solved both problems with a single Node.js build script that generates all the AEO (AI Engine Optimization) and SEO artifacts automatically on every deployment.

What Gets Generated

One script, build-aeo.js, produces everything:

File	Purpose	Size
`sitemap.xml`	Standard XML sitemap with hreflang	48K
`robots.txt`	Search engine + AI crawler permissions	<1K
`llms.txt`	AI-readable site summary	28K
`llms-full.txt`	Complete content dump	2.1M
`feed.xml`	Atom 1.0 feed (top 20, full content)	264K
`urls.txt`	URL list for Baidu push API	<2K
JSON-LD	BlogPosting schema in every article	injected
SEO meta	OG, Twitter Card, canonical, hreflang	injected
`<noscript>`	Static article list on homepage	injected

The Architecture

The script reads from two sources:

blog/posts.json — 141 posts with metadata (title, date, tags, description, language variants)
_content/posts/ — Raw markdown source files

It writes static files to the root directory and injects metadata directly into existing HTML files. The entire process is idempotent — old injections are stripped before new ones are added, so it's safe to run on every build.

Integration into the build pipeline is one line in build.sh:

# Build static HTML (for crawlers)
node $TOOLS_DIR/build.js

# Build AEO & SEO
node $TOOLS_DIR/build-aeo.js

1. robots.txt — Rolling Out the Red Carpet

Most sites are blocking AI crawlers. I'm doing the opposite — explicitly welcoming them:

User-agent: *
Allow: /

# Search Engine Crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Baiduspider
Allow: /
User-agent: YandexBot
Allow: /

# AI Crawlers - explicitly allowed
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Bytespider
Allow: /
User-agent: CCBot
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: cohere-ai
Allow: /

Sitemap: https://yuxu.ge/sitemap.xml

This sends a clear signal to both traditional search engines and AI crawlers: my content is open and available for indexing, training, and synthesis.

2. llms.txt — A README for AI Crawlers

The llms.txt convention provides a structured, human-and-AI-readable summary of your site. Think of it as a README.md that AI systems read first to understand what your site is about.

# Yuxu Ge - Senior AI Architect & Researcher

> Personal technical blog covering RAG systems, search architecture,
> neural networks, AI-assisted programming, and distributed systems.

## Author
- Name: Yuxu Ge
- Role: Senior AI Architect | MSc AI candidate at University of York
- Expertise: Information Retrieval, RAG Systems, Search Infrastructure

## Featured Articles (High Priority)
- [The "Green Trap" in RAG Systems](https://yuxu.ge/blog/...): Deep analysis...
- [Building Hybrid Search for Static Blogs](https://yuxu.ge/blog/...): Production-ready...

## Recent Articles
(all articles listed with URLs and descriptions)

## Blog Index
- Blog Home: https://yuxu.ge/blog/
- RSS/Atom Feed: https://yuxu.ge/feed.xml
- Sitemap: https://yuxu.ge/sitemap.xml

I also generate llms-full.txt — a 2.1MB plain text dump of every article's complete content. This lets an AI system ingest my entire blog in a single HTTP request instead of crawling 159 individual pages.

The generation logic distinguishes between featured (pinned) and recent articles:

const featured = uniqueEntries
    .filter(e => e.top !== undefined)
    .sort((a, b) => a.top - b.top);
const recent = uniqueEntries
    .filter(e => e.top === undefined)
    .sort((a, b) => b.date.localeCompare(a.date));

3. sitemap.xml — Bilingual Hreflang Support

Many of my articles exist in both English and Chinese. The sitemap uses xhtml:link to declare language alternates, which prevents search engines from treating them as duplicate content:

<url>
  <loc>https://yuxu.ge/blog/2026/2026-02-20-rag-energy-efficiency-blog.html</loc>
  <lastmod>2026-02-20</lastmod>
  <priority>0.9</priority>
  <xhtml:link rel="alternate" hreflang="zh"
    href="https://yuxu.ge/blog/2026/2026-02-20-rag-energy-efficiency-blog-zh.html" />
  <xhtml:link rel="alternate" hreflang="en"
    href="https://yuxu.ge/blog/2026/2026-02-20-rag-energy-efficiency-blog.html" />
</url>

Pinned articles get priority 0.9, regular articles 0.7, homepage 1.0.

4. JSON-LD Structured Data — Telling Machines Exactly What's Here

JSON-LD is the gold standard for machine-readable metadata. Instead of crawlers guessing that a page is a blog post, we explicitly declare it with BlogPosting schema:

{
  "@context": "https://schema.org",
  "@type": "BlogPosting",
  "headline": "The \"Green Trap\" in RAG Systems...",
  "description": "A deep analysis of five energy-saving techniques...",
  "datePublished": "2026-02-20",
  "author": {
    "@type": "Person",
    "name": "Yuxu Ge",
    "url": "https://yuxu.ge",
    "jobTitle": "Senior AI Architect",
    "affiliation": {
      "@type": "Organization",
      "name": "University of York"
    }
  },
  "inLanguage": "en",
  "keywords": ["rag", "energy-efficiency", "llm"],
  "isPartOf": {
    "@type": "Blog",
    "name": "Yuxu Ge's Blog",
    "url": "https://yuxu.ge/blog/"
  }
}

The script injects this into every article's <head>, along with Open Graph and Twitter Card meta tags. It also injects WebSite schema on the homepage and Blog schema on the blog index.

The idempotency is handled by stripping old injections first:

// Remove old JSON-LD
html = html.replace(
    /<script type="application\/ld\+json">[\s\S]*?<\/script>\n?/g, ''
);
// Remove old SEO meta
html = html.replace(/\s*<meta property="og:[\s\S]*?">\n?/g, '');
html = html.replace(/\s*<meta name="twitter:[\s\S]*?">\n?/g, '');
// ... then inject fresh tags before </head>

5. SEO Meta Tags — The Classics Still Matter

Every article page gets the full set:

Canonical URL — prevents duplicate content issues
hreflang — links English and Chinese versions together
Open Graph — for social sharing (Facebook, LinkedIn)
Twitter Card — for Twitter/X previews
article:published_time — publication date for social platforms

<link rel="canonical" href="https://yuxu.ge/blog/2026/...">
<link rel="alternate" hreflang="zh" href="https://yuxu.ge/blog/2026/...-zh.html">
<link rel="alternate" hreflang="en" href="https://yuxu.ge/blog/2026/....html">
<meta property="og:type" content="article">
<meta property="og:title" content="The Green Trap in RAG Systems...">
<meta property="og:url" content="https://yuxu.ge/blog/2026/...">
<meta name="twitter:card" content="summary">
<meta name="twitter:site" content="@YuxuGe_AI">

6. The Noscript Problem — Making the Homepage Crawlable

My homepage uses JavaScript to dynamically render blog posts from posts.json. This works great for browsers but not for AI crawlers that don't execute JS. The solution: inject a <noscript> block with a static HTML list of all articles.

<div id="dynamic-sections"></div>
<noscript id="aeo-noscript">
  <section>
    <h2>Blog Articles</h2>
    <ul>
      <li><a href="/blog/2026/2026-02-20-rag-energy-efficiency-blog.html">
        The "Green Trap" in RAG Systems...</a> <small>(2026-02-20)</small>
        - A deep analysis of five energy-saving techniques...</li>
      <!-- ... all 120+ articles ... -->
    </ul>
  </section>
</noscript>

This is invisible to regular users (JS is enabled) but provides full article discovery for simple crawlers.

7. Atom Feed — Full Content, Not Just Summaries

The feed.xml provides the top 20 articles with complete content in Atom 1.0 format. Unlike many feeds that only include excerpts, mine includes the full article text:

<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Yuxu Ge's Blog</title>
  <link href="https://yuxu.ge/feed.xml" rel="self" type="application/atom+xml" />
  <entry>
    <title>The "Green Trap" in RAG Systems...</title>
    <content type="text">(full article content here)</content>
  </entry>
</feed>

Verification Checklist

After running node build-aeo.js, verify:

sitemap.xml — Valid XML, includes all articles, hreflang alternates present
robots.txt — Search engines + AI crawlers explicitly allowed, sitemap reference included
llms.txt — Complete site summary with author info, featured and recent articles
llms-full.txt — 2.1MB, all articles in plain text
feed.xml — Valid Atom 1.0, top 20 articles with full content
urls.txt — URL list for Baidu push API (excludes legacy articles)
JSON-LD — BlogPosting schema in every article <head> (validate at schema.org validator)
SEO meta — OG, Twitter Card, canonical, hreflang on all article pages
Homepage <noscript> — Static article list visible when JS disabled
IndexNow — Key file deployed, Bing/Yandex push via GitHub Actions
Baidu Push — Token stored as secret, automated via GitHub Actions
Idempotent — Running twice produces identical output (no duplicate tags)

8. Search Engine Push Automation — Proactive Indexing

Generating AEO artifacts is only half the battle. Search engines also need to be told when new content is available. I set up two push mechanisms:

Baidu URL Push

Baidu provides a URL push API for Chinese search indexing. The build script generates urls.txt containing all non-legacy article URLs, which can be submitted manually:

curl -H 'Content-Type:text/plain' \
  --data-binary @urls.txt \
  "http://data.zz.baidu.com/urls?site=https://yuxu.ge&token=$BAIDU_TOKEN"

IndexNow (Bing, Yandex, and more)

IndexNow is an open protocol that lets you instantly notify search engines about new or updated URLs. One API call notifies Bing, Yandex, and all other participating engines simultaneously:

{
  "host": "yuxu.ge",
  "key": "your-indexnow-key",
  "keyLocation": "https://yuxu.ge/your-key-file.txt",
  "urlList": [
    "https://yuxu.ge/blog/2026/new-article.html",
    "https://yuxu.ge/blog/2026/another-article.html"
  ]
}

The key file is placed in the site root for verification — search engines fetch it to confirm ownership.

GitHub Actions for Automated Push

Instead of pushing manually after every deployment, I created a GitHub Actions workflow that triggers automatically when urls.txt changes:

name: Search Engine URL Push

on:
  push:
    branches: [gh-pages]
    paths:
      - 'urls.txt'

jobs:
  push-urls:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Wait for GitHub Pages deployment
        run: sleep 30

      - name: Push URLs to Baidu
        run: |
          curl -s -H 'Content-Type:text/plain' \
            --data-binary @urls.txt \
            "http://data.zz.baidu.com/urls?site=https://yuxu.ge&token=${{ secrets.BAIDU_PUSH_TOKEN }}"

      - name: Push URLs to IndexNow (Bing/Yandex)
        run: |
          # Build JSON payload from urls.txt and POST to IndexNow API
          curl -s -X POST "https://api.indexnow.org/IndexNow" \
            -H "Content-Type: application/json; charset=utf-8" \
            -d '{"host":"yuxu.ge","key":"...","urlList":[...]}'

The Baidu token is stored as a GitHub Actions secret (BAIDU_PUSH_TOKEN), never hardcoded. The workflow waits 30 seconds after push for GitHub Pages deployment to complete before notifying search engines.

What's Next

Submit sitemap.xml to Google Search Console
Monitor AI citation sources (Perplexity, ChatGPT Browse) for yuxu.ge mentions
Add Person schema with sameAs links to strengthen author entity recognition
Consider adding FAQ schema to Q&A-style articles

The complete build-aeo.js is about 350 lines of vanilla Node.js — no dependencies beyond what the blog build system already uses (fs, path, marked). It runs in under 2 seconds and generates everything needed to make a static blog fully discoverable by both traditional search engines and the new generation of AI-powered search.