生产环境的 OpenClaw 需要完善的可观测性—— 日志、健康检查、指标,这些是运维稳定性的基础。
日志配置
基础日志设置
json
{
"logging": {
"level": "info",
"format": "json",
"file": "/var/log/openclaw/gateway.log"
}
}日志级别(从详细到简洁):
debug:所有调试信息(开发用,不要在生产环境开)info:正常运行信息(推荐生产环境默认)warn:警告(潜在问题,不影响运行)error:错误(影响功能的问题)
JSON 格式日志(推荐生产环境)
json
{
"logging": {
"level": "info",
"format": "json",
"file": "/var/log/openclaw/gateway.log",
"rotate": {
"maxSize": "100mb",
"maxFiles": 7
}
}
}JSON 格式输出示例:
json
{"level":"info","ts":"2026-03-25T14:23:01Z","channel":"telegram",
"userId":"@alice","tokens":{"input":1234,"output":456},"latencyMs":1823}便于 Elasticsearch/Loki 等日志系统解析。
按组件分级
json
{
"logging": {
"level": "info",
"components": {
"gateway": "info",
"channels.telegram": "debug",
"channels.slack": "warn",
"providers.anthropic": "info",
"exec": "debug"
}
}
}可以对特定渠道开启 debug,其他保持 info,精准排查问题。
命令行查看日志
bash
# 实时日志流
openclaw logs --follow
# 最近 100 行
openclaw logs --tail 100
# 按级别过滤
openclaw logs --level error
openclaw logs --level warn
# 按渠道过滤
openclaw logs --channel telegram
openclaw logs --channel slack
# 搜索关键词
openclaw logs --grep "rate_limit"
openclaw logs --grep "403"
# 时间范围
openclaw logs --since 30m
openclaw logs --since "2026-03-25 14:00"
# 导出
openclaw logs --since 1h > /tmp/debug.logHealth Check 端点
Gateway 提供标准 HTTP 健康检查接口:
bash
# 基础健康检查
curl http://127.0.0.1:18789/health
# 正常响应(200 OK)
{
"status": "ok",
"version": "1.5.0",
"uptime": 86400,
"channels": {
"telegram": "connected",
"slack": "connected",
"matrix": "disconnected"
},
"providers": {
"anthropic": "ok",
"deepseek": "ok"
}
}
# 异常响应(503 Service Unavailable)
{
"status": "degraded",
"errors": ["anthropic: api key invalid"]
}Kubernetes 探针配置
yaml
# deployment.yaml
livenessProbe:
httpGet:
path: /health
port: 18789
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 18789
initialDelaySeconds: 10
periodSeconds: 10/health:存活检查(进程是否运行)
/health/ready:就绪检查(所有渠道是否连接成功)
Prometheus 指标
json
{
"metrics": {
"enabled": true,
"path": "/metrics",
"port": 9090
}
}关键指标:
# 请求计数(按渠道/Agent/状态)
openclaw_requests_total{channel="telegram",agent="default",status="success"} 1234
# 响应延迟(P50/P95/P99)
openclaw_latency_seconds{quantile="0.95"} 2.3
# Token 使用量
openclaw_tokens_total{provider="anthropic",type="input"} 234567
# 活跃会话数
openclaw_sessions_active 42
# 错误计数
openclaw_errors_total{type="rate_limit"} 5
Prometheus 配置:
yaml
# prometheus.yml
scrape_configs:
- job_name: 'openclaw'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 30sGrafana Dashboard
配置 Prometheus 数据源后,导入 OpenClaw 官方 Dashboard:
Grafana → + → Import → 输入 Dashboard ID(见官网)
Dashboard 面板包括:
- 实时请求量(按渠道分布)
- Token 消耗趋势(日/周/月)
- 响应延迟 P95 趋势
- 错误率告警图
- 模型使用分布
告警配置
Prometheus AlertManager
yaml
# alerts.yml
groups:
- name: openclaw
rules:
- alert: HighErrorRate
expr: rate(openclaw_errors_total[5m]) > 0.1
annotations:
summary: "OpenClaw 错误率过高"
- alert: APIKeyExpired
expr: openclaw_errors_total{type="auth_error"} > 0
annotations:
summary: "API Key 认证失败,请检查密钥"简单告警(OpenClaw 内置)
json
{
"alerts": {
"errorRate": {
"threshold": 10,
"window": "5m",
"channel": "telegram"
},
"providerDown": {
"channel": "telegram",
"message": "AI 服务商连接异常,已自动切换备用模型"
}
}
}来源:OpenClaw 官方文档 - docs.openclaw.ai/gateway/logging