Monitoring

The monitoring stack provides metrics collection, visualization, a service dashboard, and external uptime monitoring. It runs as Docker containers on a dedicated VM in each environment.

Deployment order

Monitoring deploys after Networking, Step-CA, and NTP. It is the last infrastructure service to deploy before applications.

Architecture

graph TD
    subgraph Monitoring VM
        Prometheus[Prometheus]
        Grafana[Grafana]
        Homepage[Homepage]
    end

    subgraph Targets
        NE1[Node Exporter\nmonitoring VM]
        NE2[Node Exporter\nmedia-stack]
        NE3[Node Exporter\nnetworking]
        NE4[Node Exporter\nwebsite]
    end

    subgraph External VPS
        UptimeKuma[Uptime Kuma]
    end

    NE1 & NE2 & NE3 & NE4 -->|metrics| Prometheus
    Prometheus -->|datasource| Grafana
    UptimeKuma -->|health checks| Prometheus

Prometheus scrapes node exporters across all VMs for system metrics
Grafana visualizes metrics from Prometheus with dashboards
Homepage provides a service dashboard with status widgets and quick links
Uptime Kuma runs on an external VPS for independent uptime monitoring via Tailscale

Components

Component	Image	Port	Purpose
Prometheus	`prom/prometheus`	9090	Metrics collection and storage
Grafana	`grafana/grafana`	3000	Metrics visualization
Homepage	`ghcr.io/gethomepage/homepage`	3002	Service dashboard
Uptime Kuma	`louislam/uptime-kuma`	3001	External uptime monitoring

Hosts

Environment	VM	IP
WIL	Monitoring	`10.2.20.30`
LDN	Monitoring	`10.3.20.30`
External	VPS	`178.156.190.134`

File Locations

Monitoring Stack

File	Purpose
`playbooks/infrastructure/monitoring/deploy.yml`	Main playbook
`playbooks/infrastructure/monitoring/tasks/monitoring-stack.yml`	Deployment task
`playbooks/infrastructure/monitoring/templates/compose.yaml.j2`	Docker Compose definition
`playbooks/infrastructure/monitoring/templates/prometheus.yml.j2`	Prometheus scrape config
`playbooks/infrastructure/monitoring/templates/grafana-datasources.yml.j2`	Grafana datasource provisioning
`playbooks/infrastructure/monitoring/templates/services.yaml.j2`	Homepage services dashboard
`playbooks/infrastructure/monitoring/templates/bookmarks.yaml.j2`	Homepage bookmarks
`playbooks/infrastructure/monitoring/templates/settings.yaml.j2`	Homepage theme and layout
`playbooks/infrastructure/monitoring/templates/widgets.yaml.j2`	Homepage widgets
`playbooks/infrastructure/monitoring/handlers/main.yml`	Container lifecycle handlers
`environments/<env>/group_vars/infra_monitoring/`	Per-environment variables

External Monitoring

File	Purpose
`playbooks/infrastructure/external-monitoring/deploy.yml`	Main playbook
`playbooks/infrastructure/external-monitoring/templates/compose.yaml.j2`	Docker Compose definition
`environments/external/group_vars/infra_externalmonitoring/vars.yml`	External monitoring variables

Deployment

# Deploy internal monitoring stack
task ansible:deploy-monitoring ENV=wil

# Deploy external uptime monitoring
task ansible:deploy-external-monitoring ENV=external

Monitoring Stack Deployment

The task file:

Creates directory structure under /opt/monitoring/
Sets ownership to monitoring_uid:monitoring_gid
Deploys Prometheus configuration
Deploys Grafana datasource provisioning
Deploys Homepage configuration files (services, bookmarks, settings, widgets)
Deploys Docker Compose file
Starts all containers

External Monitoring Deployment

The external monitoring playbook:

Runs the common role (timezone, apt cache)
Installs and configures Tailscale as a client (to reach internal services)
Deploys Uptime Kuma via the docker_service role

Prometheus

Prometheus scrapes metrics from node exporters running on infrastructure and application VMs.

Scrape Configuration

The prometheus.yml.j2 template generates the scrape config:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets:
          - 'localhost:9100'
          - '10.2.0.5:9100'
          # ... all node targets

Container Configuration

Memory: 2GB limit, 1GB reservation
Retention: configurable via prometheus_retention
Storage: persistent volume at /prometheus
User: runs as monitoring_uid:monitoring_gid

Source: ansible/playbooks/infrastructure/monitoring/templates/prometheus.yml.j2

Configuration Reference

Parameter	Type	Description	Default
`prometheus_image`	`string`	Docker image for Prometheus	(per-env)
`prometheus_retention`	`string`	Metrics data retention period	`"30d"`
`prometheus_node_targets`	`list[string]`	Node exporter endpoints to scrape (`host:port`)	(per-env)
`node_exporter_image`	`string`	Docker image for Node Exporter sidecar	(per-env)

Sources: ansible/environments/<env>/group_vars/infra_monitoring/prometheus.yml · ansible/environments/<env>/group_vars/infra_monitoring/containers.yml

Grafana

Grafana connects to Prometheus as its default datasource and provides metric visualization dashboards.

Datasource Provisioning

Grafana auto-provisions a Prometheus datasource on startup via grafana-datasources.yml.j2:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

Container Configuration

Memory: 2GB limit, 1GB reservation
Depends on: Prometheus
Provisioning: mounted read-only at /etc/grafana/provisioning
Sign-up: disabled (GF_USERS_ALLOW_SIGN_UP=false)

Configuration Reference

Parameter	Type	Description	Default
`grafana_image`	`string`	Docker image for Grafana	(per-env)
`grafana_admin_user`	`string`	Admin username for the web UI	`"admin"`
`grafana_admin_password`	`string`	Admin password (SOPS-encrypted)	(required)
`grafana_url`	`string`	Root URL for links in notifications and dashboards	(per-env)

Sources: ansible/environments/<env>/group_vars/infra_monitoring/grafana.yml · ansible/playbooks/infrastructure/monitoring/templates/grafana-datasources.yml.j2

Homepage

Homepage provides a service dashboard with status widgets, service health indicators, and quick links to all homelab services.

Container Configuration

Memory: 512MB limit, 256MB reservation
Docker socket: mounted read-only for container status widgets
Environment: receives API keys for service widgets (Sonarr, Radarr, Bazarr, Prowlarr, SABnzbd, Plex)

Configuration Reference

Parameter	Type	Description	Default
`homepage_image`	`string`	Docker image for Homepage	(per-env)
`homepage_allowed_hosts`	`string`	Comma-separated hostnames Homepage responds to	(per-env)

Dashboard Configuration

Homepage is configured via four YAML template files:

Template	Purpose
`services.yaml.j2`	Service cards with status widgets (media, monitoring, node exporters)
`bookmarks.yaml.j2`	Quick links (TrueNAS, Proxmox)
`settings.yaml.j2`	Theme (dark, slate), layout (row style, column counts)
`widgets.yaml.j2`	Global widgets (search bar, datetime)

Sources: ansible/environments/<env>/group_vars/infra_monitoring/containers.yml · ansible/playbooks/infrastructure/monitoring/templates/services.yaml.j2

External Monitoring (Uptime Kuma)

Uptime Kuma runs on an external VPS to provide independent uptime monitoring. It connects to the internal network via Tailscale to monitor services that are not publicly exposed.

Configuration Reference

Parameter	Type	Description	Default
`uptime_kuma_image`	`string`	Docker image for Uptime Kuma	(per-env)
`external_monitoring_uptime_kuma_listen`	`string`	Listen port for Uptime Kuma	`"3001"`

Tailscale Integration

The external VPS runs Tailscale as a client with route acceptance enabled:

tailscale_mode: "client"
tailscale_hostname: "external-monitor"
tailscale_accept_routes: true

This allows Uptime Kuma to reach internal services (e.g., 10.2.20.53) through the Tailscale mesh without exposing them publicly. See VPN (Tailscale) for details.

Sources: ansible/environments/external/group_vars/infra_externalmonitoring/vars.yml · ansible/playbooks/infrastructure/external-monitoring/templates/compose.yaml.j2

Shared Configuration

These variables apply to all monitoring containers:

Parameter	Type	Description	Default
`monitoring_uid`	`string`	Container file ownership UID	`"1000"`
`monitoring_gid`	`string`	Container file ownership GID	`"1000"`
`backup_targets`	`list[string]`	Service directories under `/opt/monitoring/` to back up	(per-env)

Common Tasks

Add a new Prometheus scrape target

Edit ansible/environments/<env>/group_vars/infra_monitoring/prometheus.yml:

prometheus_node_targets:
  # ... existing targets
  - "10.2.20.60:9100"   # new VM

Deploy:
```
task ansible:deploy-monitoring ENV=wil
```

Change metrics retention

Edit ansible/environments/<env>/group_vars/infra_monitoring/prometheus.yml:
```
prometheus_retention: "90d"
```
Deploy:
```
task ansible:deploy-monitoring ENV=wil
```

Edit ansible/playbooks/infrastructure/monitoring/templates/services.yaml.j2
Add a new service entry under the appropriate section
If the service requires an API key, add the key to secrets.sops.yml and pass it through compose.yaml.j2
Deploy:
```
task ansible:deploy-monitoring ENV=wil
```