Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Server Cluster Monitoring

The monitoring module provides a comprehensive observability stack for the server cluster using Prometheus (metrics), Loki (logs), and Grafana (visualization). All components are configured as reusable NixOS modules with automatic cross-host discovery.

Overview

The system consists of three layers:

  1. Exporters (run on all servers)

    • node_exporter for system-level metrics (CPU, memory, disk, network)
    • Promtail for shipping journald logs to Loki
    • Application-specific exporters (Caddy, PostgreSQL, Redis) enabled automatically
  2. Collectors (run on the monitoring primary host)

    • Prometheus for metrics aggregation with 90-day retention
    • Loki for log aggregation with 90-day retention
    • Alertmanager for alert routing and notifications
  3. Visualization (runs on the monitoring primary host)

    • Grafana with provisioned datasources and dashboards
    • Native Kanidm OAuth2 authentication

Architecture

┌─────────────────────────────────────────────────────┐
│                    nixmon (Monitoring Primary)        │
│  ┌──────────┐  ┌──────┐  ┌─────────┐  ┌──────────┐ │
│  │Prometheus │  │ Loki │  │ Grafana │  │Alertmgr  │ │
│  │  :9090    │  │:3100 │  │  :3000  │  │  :9093   │ │
│  └────┬──┬──┘  └──┬───┘  └─────────┘  └────┬─────┘ │
│       │  │        │                         │       │
│  ┌────┘  │   ┌────┘        ┌────────────────┘       │
│  │ scrape│   │ push        │ webhooks               │
├──┼───────┼───┼─────────────┼────────────────────────┤
│  ▼       ▼   ▼             ▼                        │
│  All servers:          Home Assistant / Nextcloud    │
│  - node_exporter :9100                              │
│  - promtail → Loki                                  │
│  - caddy metrics :2019 (if proxy configured)        │
│  - postgres_exporter :9187 (if postgres configured) │
│  - redis_exporter :9121 (if redis configured)       │
│  - pve_exporter :9221 (nixmon only, Proxmox API)    │
└─────────────────────────────────────────────────────┘

Configuration

Enabling Monitoring

Monitoring is enabled by default on all servers (server.monitoring.enable = true). The monitoring primary host is configured via the allocations.server.monitoringPrimaryHost option, currently set to nixmon.

Options Reference

All options live under server.monitoring:

OptionTypeDefaultDescription
enablebooltrueEnable monitoring for this server
retention.metricsstring"90d"Prometheus TSDB retention period
retention.logsstring"90d"Loki log retention period
exporters.node.enablebooltrueEnable node_exporter
exporters.caddy.enableboolautoEnable Caddy metrics (auto if proxy configured)
exporters.postgres.enableboolautoEnable PostgreSQL exporter (auto on IO host)
exporters.redis.enableboolautoEnable Redis exporter (auto on IO host)
logs.enablebooltrueEnable Promtail log shipping
collector.enableboolautoEnable collectors (auto on monitoring host)
collector.grafana.kanidm.enablebooltrueEnable Kanidm OAuth2 for Grafana
collector.alerting.enablebooltrueEnable Alertmanager
collector.alerting.homeAssistant.enableboolfalseEnable Home Assistant webhook alerting
collector.alerting.nextcloudTalk.enableboolfalseEnable Nextcloud Talk webhook alerting
collector.proxmox.enablebooltrueEnable Proxmox VE metrics collection

Auto-Detection

The module automatically detects and enables exporters based on host role:

  • Caddy exporter: Enabled when server.proxy.virtualHosts is non-empty
  • PostgreSQL exporter: Enabled on the IO primary host when postgres databases are configured
  • Redis exporter: Enabled on the IO primary host when redis instances are configured
  • Collector services: Enabled only on the monitoring primary host

Secrets

The monitoring module requires the following secrets in hosts/server/nixmon/secrets.yaml:

MONITORING:
    GRAFANA:
        SECRET_KEY: <random-secret-key>
        OAUTH_SECRET: <kanidm-oauth2-secret>
    HOME_ASSISTANT:
        WEBHOOK_URL: <ha-webhook-url>
    NEXTCLOUD_TALK:
        WEBHOOK_URL: <nc-talk-webhook-url>
PROXMOX:
    USER: <proxmox-user-at-realm>
    TOKEN_ID: <proxmox-token-name>
    TOKEN_SECRET: <proxmox-token-secret>

Generating Secrets

Generate the Grafana secret key:

cat /dev/urandom | tr -dc 'A-Za-z0-9' | head -c 48

The MONITORING/GRAFANA/OAUTH_SECRET must match the value in hosts/server/nixcloud/secrets.yaml under KANIDM/OAUTH2/GRAFANA_SECRET (the Kanidm provisioning side).

Caddy Virtual Hosts

The module configures three virtual hosts on nixmon:

ServiceSubdomainAccess
Grafanagrafana.<domain>Public
Prometheusprometheus.<domain>LAN
Lokiloki.<domain>LAN

These are defined in hosts/server/nixmon/default.nix and collected by the IO primary host’s Caddy configuration.

Alert Rules

The following alerts are configured by default:

AlertConditionSeverity
HostDownup{job="node"} == 0 for 2 minutesCritical
DiskSpaceCriticalRoot filesystem < 10% free for 5 minutesCritical
HighCPUUsageCPU usage > 90% for 5 minutesWarning
HighMemoryUsageMemory usage > 90% for 5 minutesWarning
ServiceDownup{job!="node"} == 0 for 2 minutesCritical

Alerts are routed to:

  • Home Assistant: All critical and warning alerts via webhook (requires collector.alerting.homeAssistant.enable = true)
  • Nextcloud Talk: Critical alerts only via webhook (requires collector.alerting.nextcloudTalk.enable = true)

Module Structure

modules/nixos/server/monitoring/
├── default.nix              # Entry point, imports sub-modules
├── options.nix              # All server.monitoring.* options
├── collector/
│   ├── default.nix          # Imports collector sub-modules
│   ├── prometheus.nix       # Prometheus server + scrape targets
│   ├── loki.nix             # Loki server + storage config
│   ├── grafana.nix          # Grafana + Kanidm OAuth2
│   ├── alerting.nix         # Alertmanager + alert rules
│   └── dashboards.nix       # Dashboard provisioning
├── exporters/
│   ├── default.nix          # Imports exporter sub-modules
│   ├── node.nix             # node_exporter
│   ├── caddy.nix            # Caddy metrics
│   ├── postgres.nix         # PostgreSQL exporter
│   └── redis.nix            # Redis exporter
├── logs/
│   └── promtail.nix         # Promtail log shipping
└── integrations/
    └── proxmox.nix          # PVE exporter for Proxmox API

Troubleshooting

Checking Service Status

On the monitoring host (nixmon):

systemctl status prometheus.service
systemctl status loki.service
systemctl status grafana.service
systemctl status prometheus-alertmanager.service
systemctl status prometheus-pve-exporter.service

On any server:

systemctl status prometheus-node-exporter.service
systemctl status promtail.service

Verifying Metrics Collection

Check Prometheus targets are up:

curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {instance: .labels.instance, health: .health}'

Verifying Log Collection

Check Promtail is shipping logs:

journalctl -u promtail.service -f

Query Loki directly:

curl -s 'http://localhost:3100/loki/api/v1/labels' | jq

Common Issues

Grafana OAuth login fails:

  • Verify GRAFANA_OAUTH_SECRET in nixmon matches KANIDM/OAUTH2/GRAFANA_SECRET in nixcloud
  • Check Kanidm provisioning has the grafana OAuth2 client configured
  • Verify DNS resolves auth.<domain> correctly

Prometheus targets showing as down:

  • Check firewall rules allow traffic on exporter ports from the monitoring host
  • Verify the exporter service is running on the target host
  • Check network connectivity between nixmon and the target host

Proxmox metrics missing:

  • Verify proxmox/token_id and proxmox/token_secret are valid
  • Check PVE API is accessible from nixmon: curl -k https://pve.<domain>/api2/json
  • Review PVE exporter logs: journalctl -u prometheus-pve-exporter.service