Server Cluster Monitoring
The monitoring module provides a comprehensive observability stack for the server cluster using Prometheus (metrics), Loki (logs), and Grafana (visualization). All components are configured as reusable NixOS modules with automatic cross-host discovery.
Overview
The system consists of three layers:
-
Exporters (run on all servers)
- node_exporter for system-level metrics (CPU, memory, disk, network)
- Promtail for shipping journald logs to Loki
- Application-specific exporters (Caddy, PostgreSQL, Redis) enabled automatically
-
Collectors (run on the monitoring primary host)
- Prometheus for metrics aggregation with 90-day retention
- Loki for log aggregation with 90-day retention
- Alertmanager for alert routing and notifications
-
Visualization (runs on the monitoring primary host)
- Grafana with provisioned datasources and dashboards
- Native Kanidm OAuth2 authentication
Architecture
┌─────────────────────────────────────────────────────┐
│ nixmon (Monitoring Primary) │
│ ┌──────────┐ ┌──────┐ ┌─────────┐ ┌──────────┐ │
│ │Prometheus │ │ Loki │ │ Grafana │ │Alertmgr │ │
│ │ :9090 │ │:3100 │ │ :3000 │ │ :9093 │ │
│ └────┬──┬──┘ └──┬───┘ └─────────┘ └────┬─────┘ │
│ │ │ │ │ │
│ ┌────┘ │ ┌────┘ ┌────────────────┘ │
│ │ scrape│ │ push │ webhooks │
├──┼───────┼───┼─────────────┼────────────────────────┤
│ ▼ ▼ ▼ ▼ │
│ All servers: Home Assistant / Nextcloud │
│ - node_exporter :9100 │
│ - promtail → Loki │
│ - caddy metrics :2019 (if proxy configured) │
│ - postgres_exporter :9187 (if postgres configured) │
│ - redis_exporter :9121 (if redis configured) │
│ - pve_exporter :9221 (nixmon only, Proxmox API) │
└─────────────────────────────────────────────────────┘
Configuration
Enabling Monitoring
Monitoring is enabled by default on all servers (server.monitoring.enable = true).
The monitoring primary host is configured via the allocations.server.monitoringPrimaryHost
option, currently set to nixmon.
Options Reference
All options live under server.monitoring:
| Option | Type | Default | Description |
|---|---|---|---|
enable | bool | true | Enable monitoring for this server |
retention.metrics | string | "90d" | Prometheus TSDB retention period |
retention.logs | string | "90d" | Loki log retention period |
exporters.node.enable | bool | true | Enable node_exporter |
exporters.caddy.enable | bool | auto | Enable Caddy metrics (auto if proxy configured) |
exporters.postgres.enable | bool | auto | Enable PostgreSQL exporter (auto on IO host) |
exporters.redis.enable | bool | auto | Enable Redis exporter (auto on IO host) |
logs.enable | bool | true | Enable Promtail log shipping |
collector.enable | bool | auto | Enable collectors (auto on monitoring host) |
collector.grafana.kanidm.enable | bool | true | Enable Kanidm OAuth2 for Grafana |
collector.alerting.enable | bool | true | Enable Alertmanager |
collector.alerting.homeAssistant.enable | bool | false | Enable Home Assistant webhook alerting |
collector.alerting.nextcloudTalk.enable | bool | false | Enable Nextcloud Talk webhook alerting |
collector.proxmox.enable | bool | true | Enable Proxmox VE metrics collection |
Auto-Detection
The module automatically detects and enables exporters based on host role:
- Caddy exporter: Enabled when
server.proxy.virtualHostsis non-empty - PostgreSQL exporter: Enabled on the IO primary host when postgres databases are configured
- Redis exporter: Enabled on the IO primary host when redis instances are configured
- Collector services: Enabled only on the monitoring primary host
Secrets
The monitoring module requires the following secrets in hosts/server/nixmon/secrets.yaml:
MONITORING:
GRAFANA:
SECRET_KEY: <random-secret-key>
OAUTH_SECRET: <kanidm-oauth2-secret>
HOME_ASSISTANT:
WEBHOOK_URL: <ha-webhook-url>
NEXTCLOUD_TALK:
WEBHOOK_URL: <nc-talk-webhook-url>
PROXMOX:
USER: <proxmox-user-at-realm>
TOKEN_ID: <proxmox-token-name>
TOKEN_SECRET: <proxmox-token-secret>
Generating Secrets
Generate the Grafana secret key:
cat /dev/urandom | tr -dc 'A-Za-z0-9' | head -c 48
The MONITORING/GRAFANA/OAUTH_SECRET must match the value in hosts/server/nixcloud/secrets.yaml
under KANIDM/OAUTH2/GRAFANA_SECRET (the Kanidm provisioning side).
Caddy Virtual Hosts
The module configures three virtual hosts on nixmon:
| Service | Subdomain | Access |
|---|---|---|
| Grafana | grafana.<domain> | Public |
| Prometheus | prometheus.<domain> | LAN |
| Loki | loki.<domain> | LAN |
These are defined in hosts/server/nixmon/default.nix and collected by the IO
primary host’s Caddy configuration.
Alert Rules
The following alerts are configured by default:
| Alert | Condition | Severity |
|---|---|---|
HostDown | up{job="node"} == 0 for 2 minutes | Critical |
DiskSpaceCritical | Root filesystem < 10% free for 5 minutes | Critical |
HighCPUUsage | CPU usage > 90% for 5 minutes | Warning |
HighMemoryUsage | Memory usage > 90% for 5 minutes | Warning |
ServiceDown | up{job!="node"} == 0 for 2 minutes | Critical |
Alerts are routed to:
- Home Assistant: All critical and warning alerts via webhook (requires
collector.alerting.homeAssistant.enable = true) - Nextcloud Talk: Critical alerts only via webhook (requires
collector.alerting.nextcloudTalk.enable = true)
Module Structure
modules/nixos/server/monitoring/
├── default.nix # Entry point, imports sub-modules
├── options.nix # All server.monitoring.* options
├── collector/
│ ├── default.nix # Imports collector sub-modules
│ ├── prometheus.nix # Prometheus server + scrape targets
│ ├── loki.nix # Loki server + storage config
│ ├── grafana.nix # Grafana + Kanidm OAuth2
│ ├── alerting.nix # Alertmanager + alert rules
│ └── dashboards.nix # Dashboard provisioning
├── exporters/
│ ├── default.nix # Imports exporter sub-modules
│ ├── node.nix # node_exporter
│ ├── caddy.nix # Caddy metrics
│ ├── postgres.nix # PostgreSQL exporter
│ └── redis.nix # Redis exporter
├── logs/
│ └── promtail.nix # Promtail log shipping
└── integrations/
└── proxmox.nix # PVE exporter for Proxmox API
Troubleshooting
Checking Service Status
On the monitoring host (nixmon):
systemctl status prometheus.service
systemctl status loki.service
systemctl status grafana.service
systemctl status prometheus-alertmanager.service
systemctl status prometheus-pve-exporter.service
On any server:
systemctl status prometheus-node-exporter.service
systemctl status promtail.service
Verifying Metrics Collection
Check Prometheus targets are up:
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {instance: .labels.instance, health: .health}'
Verifying Log Collection
Check Promtail is shipping logs:
journalctl -u promtail.service -f
Query Loki directly:
curl -s 'http://localhost:3100/loki/api/v1/labels' | jq
Common Issues
Grafana OAuth login fails:
- Verify
GRAFANA_OAUTH_SECRETin nixmon matchesKANIDM/OAUTH2/GRAFANA_SECRETin nixcloud - Check Kanidm provisioning has the grafana OAuth2 client configured
- Verify DNS resolves
auth.<domain>correctly
Prometheus targets showing as down:
- Check firewall rules allow traffic on exporter ports from the monitoring host
- Verify the exporter service is running on the target host
- Check network connectivity between nixmon and the target host
Proxmox metrics missing:
- Verify
proxmox/token_idandproxmox/token_secretare valid - Check PVE API is accessible from nixmon:
curl -k https://pve.<domain>/api2/json - Review PVE exporter logs:
journalctl -u prometheus-pve-exporter.service