Beyond Manual Configuration: Engineering AI Agent Cluster Setup
I'm running a local cluster of 5 AI agents — a main orchestrator, a blog writer, a development assistant, a smart home hub, and a communications assistant — all managed by the OpenClaw platform. Initially, all configurations were manually edited directly in the runtime directory ~/.openclaw/. The "source configuration" in the repository was merely a remnant from my Docker days. It wasn't until today, when I wanted to batch-switch models, that I realized this manual approach had become completely unmaintainable.
How Bad Was the Problem
Let's look at how many places needed changing for the "simple" task of switching models:
agents.defaults.model.primary— global default- Each of the 5 agents'
model.primary - The
hephaestusagent'sheartbeat.model - The hardcoded
payload.modelwithin thedaily-articlescron job
That's 8 places spread across 2 JSON files. Missing even one leads to inconsistencies — which is exactly what happened. I missed the cron job entry, and the daily writing task kept running on the old model.
The larger issue was configuration drift: the config/openclaw.json in the repo still listed three non-existent agents (Analyzer, Scanner, Writer) and an outdated model name like google/gemini-3-pro. The repository had completely lost its role as the source of truth.
The Approach: Declarative Templates + Automated Deployment
The core goal was simple: the Git repository should be the sole source of truth, and configurations should be safely synchronized to the runtime with a single command.
The refactored project structure:
Hephaestus/
├── config/
│ ├── openclaw.json # Config template (secrets use placeholders)
│ ├── cron-jobs.json # Cron job declarations (no runtime state)
│ ├── SOUL.md # Agent persona definitions
│ └── HEARTBEAT.md # Daily workflow
├── secrets/
│ └── .env # All secrets (gitignored)
├── scripts/
│ ├── deploy.sh # One-click deployment
│ └── ctl.sh # Service management
Configuration Templating
Transforming the runtime openclaw.json into a template involved one key change: replacing sensitive information with environment variable placeholders:
{
"channels": {
"discord": {
"enabled": true,
"token": "${DISCORD_BOT_TOKEN}"
}
},
"gateway": {
"auth": {
"mode": "token",
"token": "${OPENCLAW_GATEWAY_TOKEN}"
}
}
}
Two types of fields were also stripped:
- Runtime-generated fields:
meta(version timestamps),wizard(onboarding state) — maintained by OpenClaw itself - Runtime state within cron jobs:
state.nextRunAtMs,state.lastRunAtMs, etc.
The template now only describes "what I want the configuration to be" — clean and environment-agnostic.
Cron Jobs: Removing Hardcoded Models
This was a pitfall I ran into. The daily-articles cron job had a hardcoded model in its payload:
{
"payload": {
"kind": "agentTurn",
"message": "读取 SOUL.md...",
"model": "google-gemini-cli/gemini-3-pro-preview"
}
}
This meant that even after changing the agent's default model, this cron job would still use the old one. The fix was simple — remove the model field from the template and let it inherit from the agent's own configuration.
Isolated Secret Management
All values replaced by placeholders live in secrets/.env, excluded via .gitignore:
OPENCLAW_GATEWAY_TOKEN=7fd8cd94...
DISCORD_BOT_TOKEN=MTQ3Njc3...
ANTHROPIC_API_KEY=sk-ant-api03-...
The Deploy Script: Merge, Not Overwrite
deploy.sh is the core of the workflow. It doesn't simply copy the template over — that would destroy runtime state. Instead, it performs an intelligent merge.
Key steps:
# 1. Load secrets
source "$REPO_DIR/secrets/.env"
# 2. Placeholder substitution
GENERATED=$(cat "$TEMPLATE" | sed \
-e "s|\${DISCORD_BOT_TOKEN}|${DISCORD_BOT_TOKEN}|g" \
-e "s|\${OPENCLAW_GATEWAY_TOKEN}|${OPENCLAW_GATEWAY_TOKEN}|g"
)
# 3. Preserve runtime meta/wizard fields
FINAL_CONFIG=$(python3 -c "
import sys, json
with open('$RUNTIME_CONFIG') as f:
runtime = json.load(f)
generated = json.loads(sys.stdin.read())
for key in ('meta', 'wizard'):
if key in runtime:
generated[key] = runtime[key]
json.dump(generated, sys.stdout, indent=2, ensure_ascii=False)
" <<< "$GENERATED")
The cron job merge is more refined — it reads declarative definitions from the template, and state plus timestamps from the runtime, matching by id:
FINAL_CRON=$(python3 -c "
import sys, json
with open('$CRON_TEMPLATE') as f:
template = json.load(f)
with open('$RUNTIME_CRON') as f:
runtime = json.load(f)
runtime_state = {}
for job in runtime.get('jobs', []):
if 'state' in job:
runtime_state[job['id']] = job['state']
for job in template.get('jobs', []):
if job['id'] in runtime_state:
job['state'] = runtime_state[job['id']]
json.dump(template, sys.stdout, indent=2, ensure_ascii=False)
")
The benefit: deployment doesn't reset cron job timers. If a task is scheduled to run in 3 hours, it will still run in 3 hours after deployment — no countdown reset.
The final two steps are restart and health check:
# Restart gateway
launchctl kickstart -k "gui/$(id -u)/ai.openclaw.gateway"
# Wait for the service to come up
for i in $(seq 1 15); do
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 3 \
"http://127.0.0.1:18789/__openclaw__/canvas/" 2>/dev/null || echo "000")
if [[ "$HTTP_CODE" != "000" ]]; then
echo "Gateway is up (HTTP $HTTP_CODE)"
break
fi
sleep 2
done
Two handy parameters round out the script:
--dry-run: outputs the diff without writing anything — for pre-deployment review--set-model <model>: batch-replaces all agent models in one shot
Service Management: Unified Entry Point
Previously, every operation required typing verbose launchctl commands. Now, ctl.sh wraps all common operations:
./scripts/ctl.sh status # Full overview
./scripts/ctl.sh restart # Restart gateway
./scripts/ctl.sh tail # Real-time logs
./scripts/ctl.sh cron # Cron job overview
The status command outputs a comprehensive dashboard:
=== OpenClaw Status ===
Gateway:
Service: LOADED (pid=36397)
HTTP health: OK (HTTP 200)
Port: 18789
Agents:
main model=minimax-portal/MiniMax-M2.5-highspeed [DEFAULT]
hephaestus model=minimax-portal/MiniMax-M2.5-highspeed
dev model=minimax-portal/MiniMax-M2.5-highspeed
home model=minimax-portal/MiniMax-M2.5-highspeed
comms model=minimax-portal/MiniMax-M2.5-highspeed
Cron Jobs:
✓ daily-articles agent=hephaestus last=ok next=in 5h38min
✓ hacker-news-daily agent=main last=- next=in 9h58min
✓ quant-auto-evolve agent=main last=- next=in 14min
All agent models and cron job schedules, visible at a glance.
Results
After the refactoring, today's requirement — "switch all models from Gemini to MiniMax M2.5 Highspeed" — became:
- Edit
config/openclaw.jsonto change the model name - Run
./scripts/deploy.sh
Done in 10 seconds, with every change tracked in Git.
Or even simpler:
./scripts/deploy.sh --set-model minimax-portal/MiniMax-M2.5-highspeed
One command. All agents, all heartbeats, everything updated.
Looking back, this was really about applying well-established Infrastructure as Code principles to AI Agent management. Configuration templating, secret separation, declarative deployment, intelligent merging — none of these are new ideas. But when combined, they deliver a real improvement in operational experience.