What if your server could fix itself? Not just alert you when something goes wrong, but actually diagnose the problem, apply the right fix, and tell you about it afterward. With OpenClaw, you can build exactly that.
This tutorial walks you through creating a self-healing server setup where your AI agent monitors system health, detects common issues, applies fixes automatically, and sends you a report via email.
Self-healing server architecture
What We Are Building
A system where OpenClaw:
Monitors server metrics every 5 minutes (disk, memory, CPU, services)
Detects common problems (disk full, service crashed, high load, SSL expiry)
Diagnoses the root cause using AI reasoning
Fixes the issue automatically when safe
Notifies you via email with a full report of what happened
The key principle: fix what is safe to fix, alert on everything else.
Prerequisites
OpenClaw installed and running on the server you want to monitor
Basic familiarity with Linux system administration
Step 1: Configure the SOUL.md
Add a server operations section to your SOUL.md:
## Server Operations Role
You are responsible for monitoring and maintaining this server. Your priorities:
1. Keep services running
2. Prevent disk and memory exhaustion
3. Maintain security (SSL certificates, updates)
4. Log everything you do
### Auto-Fix Rules (SAFE to do without asking)
- Clear apt/package manager cache when disk > 85%
- Delete old log files (> 30 days) when disk > 85%
- Restart a crashed service (max 3 times, then escalate)
- Renew SSL certificates via certbot
- Clear /tmp files older than 7 days
### Escalate Rules (DO NOT auto-fix, notify admin)
- Disk usage above 95% after cleanup
- Service crashes more than 3 times in 1 hour
- Unusual processes or network connections
- Failed security updates
- Any issue you are not confident about
### Notification Rules
- Auto-fixes: Send summary email via Inbounter
- Escalations: Send urgent email via Inbounter with "URGENT" in subject
- Weekly: Send a health summary every Sunday at 8 AM
### Safety Rules
- NEVER delete user data to free disk space
- NEVER modify firewall rules automatically
- NEVER update the kernel without admin approval
- NEVER restart the server (only restart individual services)
- Always log commands to /var/log/openclaw-ops.log before executing
Step 2: Create Health Check Scripts
Create a set of scripts that the agent can run to gather system information.
System Health Script
#!/bin/bash
# /home/openclaw/scripts/health-check.sh
echo "=== System Health Report ==="
echo "Date: $(date)"
echo ""
echo "=== Disk Usage ==="
df -h / /home /var /tmp 2>/dev/null | grep -v tmpfs
echo ""
echo "=== Memory ==="
free -h
echo ""
echo "=== CPU Load ==="
uptime
echo ""
echo "=== Top Processes (by memory) ==="
ps aux --sort=-%mem | head -6
echo ""
echo "=== Top Processes (by CPU) ==="
ps aux --sort=-%cpu | head -6
echo ""
echo "=== Services Status ==="
for svc in nginx postgresql redis openclaw; do
if systemctl is-active --quiet "$svc" 2>/dev/null; then
echo " $svc: RUNNING"
else
echo " $svc: STOPPED"
fi
done
echo ""
echo "=== SSL Certificates ==="
for domain in $(ls /etc/letsencrypt/live/ 2>/dev/null); do
expiry=$(openssl x509 -enddate -noout -in "/etc/letsencrypt/live/$domain/cert.pem" 2>/dev/null | cut -d= -f2)
if [ -n "$expiry" ]; then
echo " $domain: expires $expiry"
fi
done
echo ""
echo "=== Recent Errors (last 30 min) ==="
journalctl --since "30 minutes ago" --priority=err --no-pager | tail -10
echo ""
echo "=== Open Connections ==="
ss -tuln | grep LISTEN | wc -l
echo " listening ports"
echo ""
echo "=== Uptime ==="
uptime -p
Make it executable:
chmod +x /home/openclaw/scripts/health-check.sh
Health check script output example
Step 3: Set Up the Monitoring Cron Job
Schedule OpenClaw to run health checks automatically:
# Every 5 minutes: quick check
openclaw cron add --name "quick-check" "*/5 * * * *" \
"Run /home/openclaw/scripts/health-check.sh and analyze the output.
CHECK FOR:
1. Disk usage above 85% on any partition
2. Memory usage above 90%
3. CPU load average above 4.0 (for a 4-core server)
4. Any service in STOPPED state
5. SSL certificates expiring within 14 days
IF everything is normal: Do nothing. No notification needed.
IF an issue is detected:
- Apply auto-fix if it matches the Auto-Fix Rules in SOUL.md
- Log the fix to /var/log/openclaw-ops.log
- Send a summary email via Inbounter to admin@company.com
IF the issue requires escalation:
- Send an URGENT email via Inbounter to admin@company.com
- Include the full health check output and your diagnosis"
# Every hour: deeper analysis
openclaw cron add --name "hourly-analysis" "0 * * * *" \
"Run a deeper analysis:
1. Check /var/log/syslog for unusual patterns
2. Check for failed SSH login attempts
3. Verify all critical ports are responding
4. Check if any process is consuming excessive resources
5. Verify backup job completed (check /var/log/backup.log)
Only notify if something unusual is found."
# Weekly report
openclaw cron add --name "weekly-health-report" "0 8 * * 0" \
"Generate a weekly server health report:
- Uptime statistics
- Average resource usage
- Auto-fixes performed this week
- Escalations raised
- Disk usage trend
- Top 5 resource-consuming processes (average)
Send the report via Inbounter to admin@company.com with subject
'Weekly Server Health Report - [date]'"
Step 4: Define Auto-Fix Procedures
Instruct the agent on specific remediation steps for common issues.
Disk Space Recovery
# Add to SOUL.md
### Disk Space Recovery Procedure
When disk usage exceeds 85%:
1. Check what is using space:
`du -sh /var/log/* /tmp/* /var/cache/* | sort -rh | head -20`
2. Safe cleanup actions (in order):
a. Clear apt cache: `sudo apt clean`
b. Remove old kernels: `sudo apt autoremove -y`
c. Clear journal logs: `sudo journalctl --vacuum-time=7d`
d. Delete old log files: `sudo find /var/log -name "*.gz" -mtime +30 -delete`
e. Clear /tmp: `sudo find /tmp -type f -mtime +7 -delete`
f. Clear Docker unused images: `docker system prune -f` (if Docker is installed)
3. After cleanup, re-check disk usage.
4. If still above 90%, escalate to admin.
Service Restart
### Service Restart Procedure
When a service is in STOPPED state:
1. Check why it stopped: `journalctl -u [service] --since "1 hour ago" --no-pager | tail -30`
2. Attempt restart: `sudo systemctl restart [service]`
3. Wait 10 seconds, check status: `sudo systemctl status [service]`
4. If restart succeeds, log it and send notification
5. If restart fails:
- Try once more after 30 seconds
- If still failing, escalate to admin with the error logs
6. Track restart count: if a service restarts 3+ times in 1 hour, escalate
SSL Certificate Renewal
### SSL Certificate Renewal
When a certificate expires within 14 days:
1. Attempt renewal: `sudo certbot renew --cert-name [domain]`
2. If successful, reload nginx: `sudo systemctl reload nginx`
3. Verify: `echo | openssl s_client -servername [domain] -connect [domain]:443 2>/dev/null | openssl x509 -noout -enddate`
4. If renewal fails, escalate with the certbot error output
Auto-fix decision tree
Step 5: Set Up Email Notifications via Inbounter
Configure your agent to send notifications through Inbounter:
The agent will use the email skill to send notifications. Example prompts it will generate:
Auto-fix notification:
Subject: [Auto-Fix] Disk cleanup performed on production server
At 14:35 UTC, disk usage on /var reached 87%.
Actions taken:
- Cleared apt cache: freed 1.2 GB
- Removed old journals: freed 800 MB
- Deleted old log archives: freed 400 MB
Current disk usage: 62%
No further action needed.
Escalation notification:
Subject: [URGENT] PostgreSQL service crashed on production server
At 14:35 UTC, PostgreSQL was found in STOPPED state.
Diagnosis:
- Error log shows: "FATAL: could not map anonymous shared memory: Cannot allocate memory"
- System memory is at 94% usage
- Top process: java (PID 1234) using 6.2 GB RAM
Actions taken:
- Attempted restart: FAILED (same error)
- Second attempt after 30s: FAILED
Recommended actions:
1. Investigate the Java process (PID 1234) consuming excessive memory
2. Consider increasing server RAM or adding swap
3. Restart PostgreSQL after memory issue is resolved
Full health check output attached.
[2026-04-05 14:35:12] ACTION: Disk cleanup - clearing apt cache | CMD: sudo apt clean
[2026-04-05 14:35:15] RESULT: Success - freed 1.2 GB
[2026-04-05 14:35:16] ACTION: Disk cleanup - clearing old journals | CMD: sudo journalctl --vacuum-time=7d
[2026-04-05 14:35:18] RESULT: Success - freed 800 MB
[2026-04-05 14:35:20] ACTION: Post-cleanup disk check | CMD: df -h /var
[2026-04-05 14:35:20] RESULT: /var at 62% - within acceptable range
Step 7: Testing Your Setup
Simulate Disk Full
# Create a large temporary file
dd if=/dev/zero of=/tmp/test-fill bs=1M count=5000
# Wait for the next health check (or trigger manually)
openclaw cron run --name "quick-check"
# Verify the agent detected and cleaned up
cat /var/log/openclaw-ops.log | tail -10
# Clean up test file
rm /tmp/test-fill
Simulate Service Crash
# Stop a non-critical service
sudo systemctl stop redis
# Trigger health check
openclaw cron run --name "quick-check"
# Check if it was restarted
sudo systemctl status redis
Verify Email Notifications
# Trigger a test notification
openclaw run "Send a test notification email via Inbounter to admin@company.com
with subject 'Test: Self-Healing Server Alert' and body
'This is a test notification from your self-healing server setup.'"
Advanced: Multi-Server Monitoring
If you manage multiple servers, your OpenClaw agent can monitor them remotely via SSH:
#!/bin/bash
# /home/openclaw/scripts/remote-health.sh
SERVERS=("web1:192.168.1.10" "web2:192.168.1.11" "db1:192.168.1.20")
for entry in "${SERVERS[@]}"; do
name="${entry%%:*}"
ip="${entry##*:}"
echo "=== $name ($ip) ==="
ssh -o ConnectTimeout=5 "openclaw@$ip" '/home/openclaw/scripts/health-check.sh' 2>/dev/null
if [ $? -ne 0 ]; then
echo " CONNECTION FAILED"
fi
echo ""
done
openclaw cron add --name "multi-server-check" "*/10 * * * *" \
"Run /home/openclaw/scripts/remote-health.sh and analyze all servers.
Report any issues found, specifying which server has the problem."
Multi-server monitoring setup
Safety Guardrails
Self-healing is powerful but dangerous if misconfigured. Implement these safeguards:
1. Rate Limit Auto-Fixes
# SOUL.md
### Rate Limits
- Maximum 5 auto-fix actions per hour
- Maximum 3 service restarts per service per hour
- If limits exceeded, stop auto-fixing and escalate everything
2. Dry Run Mode
Test your setup in dry-run mode first:
# config.yaml
ops:
dry_run: true # Log what would be done without executing
3. Kill Switch
If the agent starts causing problems, stop it immediately:
openclaw cron pause --all
openclaw stop
4. Undo Log
Log enough information to undo each action if needed:
# SOUL.md
### Undo Logging
For each auto-fix action, also log the undo command:
How do I prevent the agent from making things worse?
Strict SOUL.md rules, rate limits, and a conservative auto-fix list. Start with only disk cleanup and service restarts. Add more fixes gradually as you build confidence.
Can I use this for production servers?
Yes, with caution. Start with monitoring-only (no auto-fix) for 2-4 weeks. Review the alerts. Then enable auto-fix for the most common, safest fixes.
How much does this cost in API tokens?
A health check every 5 minutes uses approximately 3,000 tokens per check. At Claude Sonnet pricing, that is about $3-5/month. Most checks will find nothing wrong and use fewer tokens.
Can I combine this with existing monitoring (Datadog, Grafana)?
Yes. Use OpenClaw as the "intelligent response layer" that receives alerts from your existing monitoring and decides what to do. Forward Grafana alerts to OpenClaw via webhook.
What if the agent itself goes down?
Use an external health check service (UptimeRobot, Healthchecks.io) to monitor the OpenClaw health endpoint. If it goes down, you get notified independently.
Build faster with SuperBuilder
Run parallel Claude Code agents with built-in cost tracking, task queuing, and worktree isolation. Free and open source.