让我的静态博客能被 AI 发现：一份完整的 AEO 实践指南

搜索正在发生变革。当有人问 Perplexity "如何为静态博客构建混合搜索"，或者当 ChatGPT 浏览网页来回答一个 RAG 问题时，它们并没有使用谷歌的 PageRank 算法——它们用的是 AI 爬虫，直接获取、解析和整合网页内容。如果你的网站没有为这些 AI 引擎优化，那么在这个新的内容发现范式中，你的内容可能就如同石沉大海。

在将 yuxu.ge 构建为一个纯静态网站（没有框架，只有标准）之后，我意识到我的内容对 AI 爬虫几乎是隐形的，原因有二：

首页通过 JavaScript 渲染文章列表——那些不执行 JS 的 AI 机器人看到的是一个空页面
网站没有任何机器可读的元数据——没有 sitemap，没有结构化数据，没有内容源

这篇文章记录了我如何用一个 Node.js 构建脚本解决了这两个问题。这个脚本会在每次部署时自动生成所有 AEO（AI 引擎优化）和 SEO 相关的产物。

脚本会生成哪些文件

一个脚本 build-aeo.js，搞定一切：

文件	用途	大小
`sitemap.xml`	包含 hreflang 的标准 XML sitemap	48K
`robots.txt`	搜索引擎 + AI 爬虫权限控制	<1K
`llms.txt`	AI 可读的网站摘要	28K
`llms-full.txt`	完整的网站内容文本转储	2.1M
`feed.xml`	Atom 1.0 格式的 feed（最新 20 篇，含全文）	264K
`urls.txt`	百度推送 API 用的 URL 列表	<2K
JSON-LD	在每篇文章中注入 `BlogPosting` schema	已注入
SEO meta	OG, Twitter Card, canonical, hreflang	已注入
`<noscript>`	在首页为爬虫准备的静态文章列表	已注入

架构简介

该脚本从两个数据源读取：

blog/posts.json — 包含 141 篇文章的元数据（标题、日期、标签、描述、语言版本）
_content/posts/ — 原始的 Markdown 源文件

它将静态文件写入根目录，并将元数据直接注入现有的 HTML 文件中。整个过程是幂等的——在添加新的注入内容之前，旧的会被剥离，所以每次构建时运行都是安全的。

将其集成到构建流程中，只需在 build.sh 加一行：

# 构建静态 HTML（给爬虫使用）
node $TOOLS_DIR/build.js

# 构建 AEO & SEO 产物
node $TOOLS_DIR/build-aeo.js

1. robots.txt — 为 AI 爬虫铺上红毯

大多数网站都在屏蔽 AI 爬虫。我反其道而行之——明确欢迎它们：

User-agent: *
Allow: /

# Search Engine Crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Baiduspider
Allow: /
User-agent: YandexBot
Allow: /

# AI Crawlers - explicitly allowed
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Bytespider
Allow: /
User-agent: CCBot
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: cohere-ai
Allow: /

Sitemap: https://yuxu.ge/sitemap.xml

这向传统搜索引擎和 AI 爬虫同时发出了明确信号：我的内容对索引、训练和信息整合完全开放。

2. llms.txt — 专为 AI 爬虫准备的 README

llms.txt 规范为你的网站提供了一个结构化的、人类和 AI 均可读的摘要。你可以把它想象成一个 README.md，AI 系统会首先读取它来了解你的网站是关于什么的。

# Yuxu Ge - Senior AI Architect & Researcher

> Personal technical blog covering RAG systems, search architecture,
> neural networks, AI-assisted programming, and distributed systems.

## Author
- Name: Yuxu Ge
- Role: Senior AI Architect | MSc AI candidate at University of York
- Expertise: Information Retrieval, RAG Systems, Search Infrastructure

## Featured Articles (High Priority)
- [The "Green Trap" in RAG Systems](https://yuxu.ge/blog/...): Deep analysis...
- [Building Hybrid Search for Static Blogs](https://yuxu.ge/blog/...): Production-ready...

## Recent Articles
（所有文章按时间倒序列出，附 URL 和描述）

## Blog Index
- Blog Home: https://yuxu.ge/blog/
- RSS/Atom Feed: https://yuxu.ge/feed.xml
- Sitemap: https://yuxu.ge/sitemap.xml

我还生成了 llms-full.txt——一个 2.1MB 的纯文本文件，包含每篇文章的全部内容。这使得 AI 系统可以通过一次 HTTP 请求就获取我整个博客的内容，而无需抓取 159 个单独的页面。

生成逻辑区分了精选（置顶）和近期文章：

const featured = uniqueEntries
    .filter(e => e.top !== undefined)
    .sort((a, b) => a.top - b.top);
const recent = uniqueEntries
    .filter(e => e.top === undefined)
    .sort((a, b) => b.date.localeCompare(a.date));

3. sitemap.xml — 双语 Hreflang 支持

我的许多文章都有中英两个版本。Sitemap 使用 xhtml:link 来声明不同语言的备用版本，防止搜索引擎将它们视为重复内容：

<url>
  <loc>https://yuxu.ge/blog/2026/2026-02-20-rag-energy-efficiency-blog.html</loc>
  <lastmod>2026-02-20</lastmod>
  <priority>0.9</priority>
  <xhtml:link rel="alternate" hreflang="zh"
    href="https://yuxu.ge/blog/2026/2026-02-20-rag-energy-efficiency-blog-zh.html" />
  <xhtml:link rel="alternate" hreflang="en"
    href="https://yuxu.ge/blog/2026/2026-02-20-rag-energy-efficiency-blog.html" />
</url>

置顶文章优先级为 0.9，普通文章 0.7，首页 1.0。

4. JSON-LD 结构化数据 — 精确告诉机器这里有什么

JSON-LD 是机器可读元数据的黄金标准。我们不再让爬虫猜测一个页面是博客文章，而是通过 BlogPosting schema 明确声明：

{
  "@context": "https://schema.org",
  "@type": "BlogPosting",
  "headline": "The \"Green Trap\" in RAG Systems...",
  "description": "A deep analysis of five energy-saving techniques...",
  "datePublished": "2026-02-20",
  "author": {
    "@type": "Person",
    "name": "Yuxu Ge",
    "url": "https://yuxu.ge",
    "jobTitle": "Senior AI Architect",
    "affiliation": {
      "@type": "Organization",
      "name": "University of York"
    }
  },
  "inLanguage": "en",
  "keywords": ["rag", "energy-efficiency", "llm"],
  "isPartOf": {
    "@type": "Blog",
    "name": "Yuxu Ge's Blog",
    "url": "https://yuxu.ge/blog/"
  }
}

脚本将这段 JSON-LD 以及 Open Graph 和 Twitter Card 的 meta 标签注入到每篇文章的 <head> 中。同时还会在首页注入 WebSite schema，在博客列表页注入 Blog schema。

幂等性通过先剥离旧注入来实现：

// 移除旧的 JSON-LD
html = html.replace(
    /<script type="application\/ld\+json">[\s\S]*?<\/script>\n?/g, ''
);
// 移除旧的 SEO meta
html = html.replace(/\s*<meta property="og:[\s\S]*?">\n?/g, '');
html = html.replace(/\s*<meta name="twitter:[\s\S]*?">\n?/g, '');
// ... 然后在 </head> 之前注入新标签

5. SEO Meta 标签 — 经典永不过时

每篇文章页面都配齐了全套标签：

Canonical URL — 防止重复内容问题
hreflang — 将中英文版本链接在一起
Open Graph — 用于社交分享（Facebook, LinkedIn）
Twitter Card — 用于 Twitter/X 的预览
article:published_time — 为社交平台提供发布日期

<link rel="canonical" href="https://yuxu.ge/blog/2026/...">
<link rel="alternate" hreflang="zh" href="https://yuxu.ge/blog/2026/...-zh.html">
<link rel="alternate" hreflang="en" href="https://yuxu.ge/blog/2026/....html">
<meta property="og:type" content="article">
<meta property="og:title" content="The Green Trap in RAG Systems...">
<meta property="og:url" content="https://yuxu.ge/blog/2026/...">
<meta name="twitter:card" content="summary">
<meta name="twitter:site" content="@YuxuGe_AI">

6. `<noscript>` 问题 — 让首页变得可抓取

我的首页使用 JavaScript 从 posts.json 动态渲染博客文章。这对浏览器来说体验很好，但对不执行 JS 的 AI 爬虫就不行了。解决方案是注入一个 <noscript> 块，里面包含所有文章的静态 HTML 列表。

<div id="dynamic-sections"></div>
<noscript id="aeo-noscript">
  <section>
    <h2>Blog Articles</h2>
    <ul>
      <li><a href="/blog/2026/2026-02-20-rag-energy-efficiency-blog.html">
        The "Green Trap" in RAG Systems...</a> <small>(2026-02-20)</small>
        - A deep analysis of five energy-saving techniques...</li>
      <!-- ... 全部 120+ 篇文章 ... -->
    </ul>
  </section>
</noscript>

这对启用了 JS 的普通用户完全不可见，但为简单的爬虫提供了完整的文章发现路径。

7. Atom Feed — 提供全文，而非摘要

feed.xml 以 Atom 1.0 格式提供了最新的 20 篇文章，且包含完整内容。与许多只包含摘要的 feed 不同，我的 feed 包含了全文：

<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Yuxu Ge's Blog</title>
  <link href="https://yuxu.ge/feed.xml" rel="self" type="application/atom+xml" />
  <entry>
    <title>The "Green Trap" in RAG Systems...</title>
    <content type="text">（完整文章内容）</content>
  </entry>
</feed>

验证清单

运行 node build-aeo.js 后进行验证：

sitemap.xml — XML 格式有效，包含所有文章，hreflang 备用链接存在
robots.txt — 明确允许搜索引擎 + AI 爬虫，包含 sitemap 引用
llms.txt — 完整的网站摘要，包含作者信息、精选和近期文章
llms-full.txt — 2.1MB，所有文章的纯文本
feed.xml — Atom 1.0 格式有效，最新 20 篇文章包含全文
urls.txt — 百度推送 API 的 URL 列表（排除 legacy 文章）
JSON-LD — 每篇文章的 <head> 中都有 BlogPosting schema（可在 schema.org validator 验证）
SEO meta — 所有文章页面都有 OG, Twitter Card, canonical, hreflang
首页 <noscript> — 禁用 JS 时静态文章列表可见
IndexNow — 验证密钥文件已部署，Bing/Yandex 通过 GitHub Actions 自动推送
百度推送 — Token 存储为 Secret，通过 GitHub Actions 自动化
幂等性 — 运行两次脚本产生完全相同的输出（没有重复标签）

8. 搜索引擎推送自动化 — 主动通知索引

生成 AEO 产物只是一半工作。搜索引擎还需要被主动通知有新内容可用。我设置了两套推送机制：

百度 URL 推送

百度提供了 URL 推送 API 用于中文搜索索引。构建脚本会生成 urls.txt，包含所有非 legacy 文章的 URL，可以手动提交：

curl -H 'Content-Type:text/plain' \
  --data-binary @urls.txt \
  "http://data.zz.baidu.com/urls?site=https://yuxu.ge&token=$BAIDU_TOKEN"

IndexNow（Bing、Yandex 等）

IndexNow 是一个开放协议，可以即时通知搜索引擎有新的或更新的 URL。一次 API 调用就能同时通知 Bing、Yandex 和所有参与的搜索引擎：

{
  "host": "yuxu.ge",
  "key": "your-indexnow-key",
  "keyLocation": "https://yuxu.ge/your-key-file.txt",
  "urlList": [
    "https://yuxu.ge/blog/2026/new-article.html",
    "https://yuxu.ge/blog/2026/another-article.html"
  ]
}

验证密钥文件放置在网站根目录下，搜索引擎会获取该文件来确认站点所有权。

GitHub Actions 自动推送

我没有选择每次部署后手动推送，而是创建了一个 GitHub Actions 工作流，当 urls.txt 发生变化时自动触发：

name: Search Engine URL Push

on:
  push:
    branches: [gh-pages]
    paths:
      - 'urls.txt'

jobs:
  push-urls:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Wait for GitHub Pages deployment
        run: sleep 30

      - name: Push URLs to Baidu
        run: |
          curl -s -H 'Content-Type:text/plain' \
            --data-binary @urls.txt \
            "http://data.zz.baidu.com/urls?site=https://yuxu.ge&token=${{ secrets.BAIDU_PUSH_TOKEN }}"

      - name: Push URLs to IndexNow (Bing/Yandex)
        run: |
          # 从 urls.txt 构建 JSON 并 POST 到 IndexNow API
          curl -s -X POST "https://api.indexnow.org/IndexNow" \
            -H "Content-Type: application/json; charset=utf-8" \
            -d '{"host":"yuxu.ge","key":"...","urlList":[...]}'

百度推送 Token 作为 GitHub Actions Secret（BAIDU_PUSH_TOKEN）存储，绝不硬编码在代码中。工作流在推送后等待 30 秒，确保 GitHub Pages 部署完成后再通知搜索引擎。

下一步计划

将 sitemap.xml 提交到 Google Search Console
监控 AI 引用来源（Perplexity, ChatGPT Browse）中 yuxu.ge 的出现
添加 Person schema 并使用 sameAs 链接，加强作者实体识别
考虑为问答形式的文章添加 FAQ schema

完整的 build-aeo.js 大约有 350 行原生 Node.js 代码——除了博客构建系统已有的依赖（fs, path, marked）之外没有其他依赖。它在 2 秒内运行完毕，生成了使一个静态博客能被传统搜索引擎和新一代 AI 驱动的搜索完全发现所需的一切。