← Back to Blog

Hephaestus: Building an AI Agent Pipeline to Dissect Industrial C++ Code

If you're anything like me, you have a Gitea instance (or GitHub bookmarks) full of fascinating industrial C/C++ code — distributed systems, high-performance networking libraries, anti-bot engines — and never enough time to sit down and really read them line by line.

Manual code analysis is a craft. It's rewarding, but it doesn't scale. So I set out to build a fully automated "code archaeologist": an AI Agent system that hunts for interesting code on Gitea every day, reads source files in depth, distills the design wisdom, retells the core ideas in a different programming language, and compiles the analysis into structured technical reference documents that I can later draw on when writing blog posts.

I named it Hephaestus, after the Greek god of craftsmanship.

In a previous post, I covered setting up the Docker + OpenClaw runtime environment. Today we skip the deployment basics and go straight to the interesting part: the multi-agent architecture behind Hephaestus, and the debugging war stories that kept me up at night.

The Constitution: Carving Rules in Stone

Before writing a single line of agent configuration, I wrote a SOUL.md file — not just a system prompt, but a constitution for the entire project. It contains seven inviolable rules. The four most critical:

  1. Clean Room Isolation: Absolutely no leaking of original code snippets, function names, namespaces, or project names. All analysis must be based on understanding, and only self-written demo code may be shown.
  2. Retelling, Not Improving: Demo code exists to help readers understand the original design's intent and trade-offs. Never imply the demo is "better" or "safer" than the original.
  3. No Language Bashing: You cannot say "Rust solves C++'s flaws." Every language's design represents a reasonable trade-off under specific constraints.
  4. Language Rotation: Each article randomly picks a demo language (Rust/Go/Zig/C++20/Python/Java/C#), and no two consecutive articles may use the same one.

These rules look like a few lines of text. They turned out to be the source of the most interesting challenges in the entire project.

The Three-Agent Architecture

A single agent can't handle this kind of complex, multi-stage workflow. Hephaestus uses a three-agent pipeline where each agent has a distinct role:

┌──────────────┐     ┌──────────────────┐     ┌──────────────────┐
│   Scanner    │     │    Analyzer       │     │     Writer       │
│ Gemini Flash │────▶│  Gemini Flash     │────▶│   Gemini Pro     │
│              │     │                   │     │                  │
│ Gitea API    │     │ Sparse Checkout   │     │ Bilingual Docs   │
│ Dir + README │     │ Deep Analysis     │     │ Self-Audit       │
│ → Topic Index│     │ Clean Room Demo   │     │ → ref-output     │
└──────────────┘     └──────────────────┘     └──────────────────┘

Scanner (the scout, Gemini Flash) performs lightweight scanning of repository directory structures and READMEs via the Gitea REST API — no cloning. It produces a topic index (TOPIC_INDEX.md). On its first run, it scanned 3 repositories and generated 35 candidate topics. Why no cloning? Some of the target C++ repositories are tens of gigabytes. A full clone would bankrupt the system at step one. The Scanner brings back the map, not the mountain.

Analyzer (the scholar, Gemini Flash) is the pipeline's core. Triggered every 24 hours by a heartbeat, it picks 2 topics from the index and uses the Gitea API or git sparse-checkout to pull only the target subdirectory. It follows a "Four Questions Checklist" to dissect the code:

  1. What problem does this code solve?
  2. Why did the designer choose this approach?
  3. What are the trade-offs?
  4. How would you express the same design intent in a different language?

If it can't answer all four, it skips the topic and picks another. No shallow articles. After analysis, it writes clean-room demo code and verifies it compiles — cargo clippy for Rust, go vet for Go. Ten consecutive compilation failures trigger a "failure retrospective" article instead.

Writer (the scribe, Gemini Pro) is delegated to by the Analyzer via OpenClaw's sessions_spawn mechanism. Using the more capable Gemini Pro for long-form generation, it transforms the analysis into bilingual (Chinese + English) technical reference documents, runs a self-audit (checking clean-room compliance, tone neutrality, and technical depth), then git pushes to the ref-output repository.

War Stories from the Debugging Trenches

This is the real meat. With any automation system, debugging takes far longer than building. Hephaestus was no exception.

Bug 1: The Vanishing SOUL.md

Symptom: The first article came out entirely in English, with wrong formatting, and zero adherence to any of the seven carefully crafted rules. The constitution was completely ignored.

Investigation: The logs showed the Agent claimed to have loaded SOUL.md. I exec'd into the container and cat'd the file it was actually reading:

You're not a chatbot. You're becoming someone.
Your mission is to...

That wasn't my rules file. That was OpenClaw's default template.

Root cause: OpenClaw instructs agents to read configuration from the workspace directory (/home/node/.openclaw/workspace/). My custom SOUL.md was sitting in the .openclaw config directory, but the workspace contained a framework-generated default. The agent faithfully followed instructions — just not my instructions.

Fix: Ensure the custom SOUL.md gets synced to the workspace directory. One cp command. But this bug cost me half a day.

Lesson: Never assume the agent is reading the config you think it's reading. Get into the container and look with your own eyes.

Bug 2: Taming the Tone — 5 Iterations

Symptom: After fixing the SOUL.md location, article quality jumped dramatically — Chinese text, correct format, working code. But between the lines, there was still an unmistakable "Rust saves the world" undertone.

Problem: When asked to "retell a design in another language," AI models naturally tend to rank languages. "Rust's ownership system fundamentally prevents the dangling pointer issues of C++" — technically correct, but a violation of the "no language bashing" principle.

5 rounds of iteration:

Version Strategy Result
V1 "Don't say Rust is better." Almost no effect
V2 "Focus on trade-offs." Slightly better, still biased
V3 Banned-word blacklist — "trap," "pitfall," "elegant," "superior" explicitly forbidden Notable improvement
V4 Positive examples — "C++ manages resources through RAII; Rust embeds checks in the compiler. Two different design philosophies." Close to ideal
V5 Reframe the entire analysis from "language comparison" to "design trade-offs" — all discussion must center on "why this was reasonable under the constraints of the time, and what it cost" Nailed it

The Writer finally learned to discuss technical decisions the way a historian discusses past events — with scholarly neutrality, not as a referee in a language war. This was probably the single largest Prompt Engineering investment in the entire project.

Bug 3: The Triple Gate of sessions_spawn

Symptom: The Analyzer called sessions_spawn to delegate to the Writer. Logs showed the call succeeded. But the Writer never started. No error messages whatsoever. Silent failure. This was the most maddening bug of them all.

sessions_spawn requires three conditions to be met simultaneously:

  1. Device Pairing: A paired.json file must exist and be valid. Container rebuilds can invalidate pairing, requiring regeneration.
  2. Agent Allowlist: The caller must explicitly declare which sub-agents it's allowed to spawn:
    "subagents": {
      "allowAgents": ["writer"]
    }
  3. Sandbox Disabled: OpenClaw enables sandboxing (Docker-in-Docker) by default. Running containers inside containers causes all sorts of networking and mounting failures. Must be explicitly disabled:
    "sandbox": { "mode": "off" }

If any one of these three gates is closed, sessions_spawn fails silently. The third one was the worst — it's hard to imagine a "security feature" being the reason your system doesn't work, especially when it gives you zero indication of the problem.

Bug 4: Surviving Giant Repositories

Initial design: The Analyzer clones the target repository and analyzes locally.

Reality: Some large C++ repositories weigh in at tens of gigabytes. The first attempt saturated both disk and network, and the task timed out.

Final approach: A tiered fetching strategy.

  • Scanning phase (Scanner): Strictly uses the Gitea API — only directory trees and descriptive files
  • Analysis phase (Analyzer): Gets the specific file path from TOPIC_INDEX.md, then uses sparse checkout to pull only the target directory
git clone --depth 1 --filter=blob:none --sparse \
  ssh://git@server:2222/owner/repo.git /tmp/repo
cd /tmp/repo
git sparse-checkout set src/target_module

The --filter=blob:none flag is key — it tells Git to fetch file contents only on demand, skipping the upfront download of gigabytes of blob data. What used to require tens of GB now takes tens of MB. After analysis, rm -rf /tmp/repo — leave no trace.

The Topic System and Automation Loop

Hephaestus runs on three phases:

  • Phase A (replenish): When TOPIC_INDEX.md has fewer than 5 pending topics, automatically triggers Scanner for a full scan
  • Phase B (execute): Daily, picks 2 topics from the index. Selection rules ensure different source repositories, minimal keyword overlap, and no repeated demo languages
  • Phase C (maintain): Automatically updates topic status (pending / written / skipped)

The entire pipeline is triggered by OpenClaw's built-in 24-hour heartbeat mechanism — no external cron needed. After documents are pushed to the Git repository, a notification email is sent via the Resend API. Every morning, there's a fresh batch of reference material from the "code archaeologist" waiting in my inbox, ready for me to review and use as the basis for writing blog posts.

"heartbeat": {
  "every": "24h",
  "model": "google/gemini-3-flash",
  "prompt": "Read SOUL.md and HEARTBEAT.md, execute Phase B..."
}

Results

Metric Number
First full pipeline run 1 minute 34 seconds
Token consumption (all 3 agents) 177.9k
Scanner output 3 repositories scanned, 35 candidate topics
Per heartbeat output 2 bilingual technical reference documents

Closing Thoughts

Building Hephaestus was far more than writing a few prompts. From the "paranormal" config file bug to iteratively sculpting the agent's "personality," from silent sub-agent delegation failures to survival strategies for giant repositories — every step reshaped my understanding of what it takes to build practical, reliable AI Agent systems.

It's not meant to replace human thinking. It turns the tedious "read code → extract insights → compile documentation" pipeline into a fully automated assembly line, freeing me to spend my time on review, reflection, and the final act of content creation instead of repetitive labor.

The god of craftsmanship works around the clock, delivering high-quality reference material on schedule every day. The final blog posts are still written and curated by a human — and that's perhaps the ideal form of human-AI collaboration.