Beyond Manual Configuration: Engineering AI Agent Cluster Setup

I'm running a local cluster of 5 AI agents — a main orchestrator, a blog writer, a development assistant, a smart home hub, and a communications assistant — all managed by the OpenClaw platform. Initially, all configurations were manually edited directly in the runtime directory ~/.openclaw/. The "source configuration" in the repository was merely a remnant from my Docker days. It wasn't until today, when I wanted to batch-switch models, that I realized this manual approach had become completely unmaintainable.

How Bad Was the Problem

Let's look at how many places needed changing for the "simple" task of switching models:

agents.defaults.model.primary — global default
Each of the 5 agents' model.primary
The hephaestus agent's heartbeat.model
The hardcoded payload.model within the daily-articles cron job

That's 8 places spread across 2 JSON files. Missing even one leads to inconsistencies — which is exactly what happened. I missed the cron job entry, and the daily writing task kept running on the old model.

The larger issue was configuration drift: the config/openclaw.json in the repo still listed three non-existent agents (Analyzer, Scanner, Writer) and an outdated model name like google/gemini-3-pro. The repository had completely lost its role as the source of truth.

The Approach: Declarative Templates + Automated Deployment

The core goal was simple: the Git repository should be the sole source of truth, and configurations should be safely synchronized to the runtime with a single command.

The refactored project structure:

Hephaestus/
├── config/
│   ├── openclaw.json        # Config template (secrets use placeholders)
│   ├── cron-jobs.json       # Cron job declarations (no runtime state)
│   ├── SOUL.md              # Agent persona definitions
│   └── HEARTBEAT.md         # Daily workflow
├── secrets/
│   └── .env                 # All secrets (gitignored)
├── scripts/
│   ├── deploy.sh            # One-click deployment
│   └── ctl.sh               # Service management

Configuration Templating

Transforming the runtime openclaw.json into a template involved one key change: replacing sensitive information with environment variable placeholders:

{
  "channels": {
    "discord": {
      "enabled": true,
      "token": "${DISCORD_BOT_TOKEN}"
    }
  },
  "gateway": {
    "auth": {
      "mode": "token",
      "token": "${OPENCLAW_GATEWAY_TOKEN}"
    }
  }
}

Two types of fields were also stripped:

Runtime-generated fields: meta (version timestamps), wizard (onboarding state) — maintained by OpenClaw itself
Runtime state within cron jobs: state.nextRunAtMs, state.lastRunAtMs, etc.

The template now only describes "what I want the configuration to be" — clean and environment-agnostic.

Cron Jobs: Removing Hardcoded Models

This was a pitfall I ran into. The daily-articles cron job had a hardcoded model in its payload:

{
  "payload": {
    "kind": "agentTurn",
    "message": "读取 SOUL.md...",
    "model": "google-gemini-cli/gemini-3-pro-preview"
  }
}

This meant that even after changing the agent's default model, this cron job would still use the old one. The fix was simple — remove the model field from the template and let it inherit from the agent's own configuration.

Isolated Secret Management

All values replaced by placeholders live in secrets/.env, excluded via .gitignore:

OPENCLAW_GATEWAY_TOKEN=7fd8cd94...
DISCORD_BOT_TOKEN=MTQ3Njc3...
ANTHROPIC_API_KEY=sk-ant-api03-...

The Deploy Script: Merge, Not Overwrite

deploy.sh is the core of the workflow. It doesn't simply copy the template over — that would destroy runtime state. Instead, it performs an intelligent merge.

Key steps:

# 1. Load secrets
source "$REPO_DIR/secrets/.env"

# 2. Placeholder substitution
GENERATED=$(cat "$TEMPLATE" | sed \
  -e "s|\${DISCORD_BOT_TOKEN}|${DISCORD_BOT_TOKEN}|g" \
  -e "s|\${OPENCLAW_GATEWAY_TOKEN}|${OPENCLAW_GATEWAY_TOKEN}|g"
)

# 3. Preserve runtime meta/wizard fields
FINAL_CONFIG=$(python3 -c "
import sys, json
with open('$RUNTIME_CONFIG') as f:
    runtime = json.load(f)
generated = json.loads(sys.stdin.read())
for key in ('meta', 'wizard'):
    if key in runtime:
        generated[key] = runtime[key]
json.dump(generated, sys.stdout, indent=2, ensure_ascii=False)
" <<< "$GENERATED")

The cron job merge is more refined — it reads declarative definitions from the template, and state plus timestamps from the runtime, matching by id:

FINAL_CRON=$(python3 -c "
import sys, json
with open('$CRON_TEMPLATE') as f:
    template = json.load(f)
with open('$RUNTIME_CRON') as f:
    runtime = json.load(f)

runtime_state = {}
for job in runtime.get('jobs', []):
    if 'state' in job:
        runtime_state[job['id']] = job['state']

for job in template.get('jobs', []):
    if job['id'] in runtime_state:
        job['state'] = runtime_state[job['id']]

json.dump(template, sys.stdout, indent=2, ensure_ascii=False)
")

The benefit: deployment doesn't reset cron job timers. If a task is scheduled to run in 3 hours, it will still run in 3 hours after deployment — no countdown reset.

The final two steps are restart and health check:

# Restart gateway
launchctl kickstart -k "gui/$(id -u)/ai.openclaw.gateway"

# Wait for the service to come up
for i in $(seq 1 15); do
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 3 \
    "http://127.0.0.1:18789/__openclaw__/canvas/" 2>/dev/null || echo "000")
  if [[ "$HTTP_CODE" != "000" ]]; then
    echo "Gateway is up (HTTP $HTTP_CODE)"
    break
  fi
  sleep 2
done

Two handy parameters round out the script:

--dry-run: outputs the diff without writing anything — for pre-deployment review
--set-model <model>: batch-replaces all agent models in one shot

Service Management: Unified Entry Point

Previously, every operation required typing verbose launchctl commands. Now, ctl.sh wraps all common operations:

./scripts/ctl.sh status    # Full overview
./scripts/ctl.sh restart   # Restart gateway
./scripts/ctl.sh tail      # Real-time logs
./scripts/ctl.sh cron      # Cron job overview

The status command outputs a comprehensive dashboard:

=== OpenClaw Status ===

Gateway:
  Service:  LOADED (pid=36397)
  HTTP health: OK (HTTP 200)
  Port:     18789

Agents:
  main            model=minimax-portal/MiniMax-M2.5-highspeed [DEFAULT]
  hephaestus      model=minimax-portal/MiniMax-M2.5-highspeed
  dev             model=minimax-portal/MiniMax-M2.5-highspeed
  home            model=minimax-portal/MiniMax-M2.5-highspeed
  comms           model=minimax-portal/MiniMax-M2.5-highspeed

Cron Jobs:
  ✓ daily-articles         agent=hephaestus   last=ok   next=in 5h38min
  ✓ hacker-news-daily      agent=main         last=-    next=in 9h58min
  ✓ quant-auto-evolve      agent=main         last=-    next=in 14min

All agent models and cron job schedules, visible at a glance.

Results

After the refactoring, today's requirement — "switch all models from Gemini to MiniMax M2.5 Highspeed" — became:

Edit config/openclaw.json to change the model name
Run ./scripts/deploy.sh

Done in 10 seconds, with every change tracked in Git.

Or even simpler:

./scripts/deploy.sh --set-model minimax-portal/MiniMax-M2.5-highspeed

One command. All agents, all heartbeats, everything updated.

Looking back, this was really about applying well-established Infrastructure as Code principles to AI Agent management. Configuration templating, secret separation, declarative deployment, intelligent merging — none of these are new ideas. But when combined, they deliver a real improvement in operational experience.