Monitoring

The homelab uses kube-prometheus-stack for cluster monitoring: Prometheus for metrics collection and alerting, Grafana for dashboards and visualization, plus node-exporter and kube-state-metrics for comprehensive cluster observability.

Architecture

flowchart TD
    subgraph tailnet["Tailscale Network"]
        Clients["Devices on tailnet"]
        TServe["tailscale serve\nHTTPS :8444 → localhost:30090"]
    end

    subgraph orb["OrbStack Kubernetes"]
        subgraph monNs["monitoring namespace"]
            Grafana["Grafana\nNodePort :30090\nDashboards + Visualization"]
            Prom["Prometheus\n15d retention, 10Gi storage\nMetrics + Alerting"]
            AM["Alertmanager\n2Gi storage"]
            NE["node-exporter\nHost metrics"]
            KSM["kube-state-metrics\nK8s object metrics"]
            PromOp["Prometheus Operator\nManages CRDs"]
        end

        subgraph targets["Scrape Targets"]
            Kubelet["kubelet"]
            CoreDNS["CoreDNS"]
            APIServer["API Server"]
        end
    end

    Clients -- "WireGuard" --> TServe
    TServe --> Grafana
    Grafana --> Prom
    Prom --> NE & KSM & Kubelet & CoreDNS & APIServer
    Prom --> AM

Access

Interface	URL	Auth
Grafana	`https://holdens-mac-mini.story-larch.ts.net:8444`	SSO via Authentik (auto-redirects)
Grafana (local)	`http://localhost:30090`	SSO via Authentik

Grafana is configured with SSO-only access — the local login form is disabled and users are auto-redirected to Authentik. The admin password in grafana-secret is retained for API and break-glass access.

Directory Contents

File	Purpose
`kustomization.yaml`	Lists resources for Kustomize/ArgoCD rendering
`external-secret.yaml`	`ExternalSecret` that pulls Grafana admin password and OAuth secret from Infisical → `grafana-secret`

Note: The monitoring stack is deployed via the Helm chart source defined in k8s/apps/argocd/applications/monitoring-app.yaml. This directory only contains the ExternalSecret that provides credentials to the Helm release. The monitoring-config ArgoCD Application syncs this directory, while the monitoring Application syncs the upstream Helm chart.

Security

The monitoring stack components are configured to run as non-root users:

Grafana: runs as UID 472 (fsGroup 472).
Prometheus: runs as UID 65534 (nobody) with fsGroup 65534.
Alertmanager: runs as UID 1000 and GID 2000 with fsGroup 2000.
node-exporter and kube-state-metrics run as non-root by default in the upstream chart.

The monitoring namespace enforces the baseline Pod Security Standard (with restricted audit/warn) because node-exporter requires host namespaces and hostPort.

What's Included

The kube-prometheus-stack Helm chart deploys:

Component	Purpose
Prometheus	Time-series metrics collection, PromQL queries, alerting rules
Grafana	Dashboards and visualization with 30+ pre-built K8s dashboards
Alertmanager	Alert routing and notification
node-exporter	Host-level metrics (CPU, memory, disk, network)
kube-state-metrics	Kubernetes object state metrics (pods, deployments, nodes)
Prometheus Operator	Manages Prometheus/Alertmanager CRDs declaratively

Pre-built Dashboards

Grafana ships with dashboards for: - Cluster overview (CPU, memory, network, disk) - Node metrics - Pod/container resource usage - Namespace resource quotas - Persistent volume usage - CoreDNS performance - API server request rates and latency

Custom Dashboards and Alerting Rules

This directory (k8s/apps/monitoring/) contains custom dashboards and alerting rules managed as GitOps resources.

Directory Structure

k8s/apps/monitoring/
├── kustomization.yaml
├── external-secret.yaml
├── dashboards/
│   ├── dashboard-homelab-overview.yaml    # 🔴 Critical - Main monitoring view
│   ├── dashboard-node-health.yaml         # 🔴 Critical - Infrastructure health
│   ├── dashboard-network.yaml             # 🟡 Important - Network monitoring
│   ├── dashboard-argocd.yaml              # 🟡 Important - GitOps monitoring
│   ├── dashboard-authentik.yaml           # 🟢 Service - Authentication
│   └── dashboard-infisical.yaml           # 🟢 Service - Secrets management
└── rules/
    ├── recording-namespace.yaml           # Recording rules for fast queries
    ├── recording-requests.yaml            # HTTP request rate recording
    ├── alerts-pods.yaml                   # Pod crash/OOM alerts
    ├── alerts-pvc.yaml                    # Storage usage alerts
    ├── alerts-nodes.yaml                  # Node resource alerts
    ├── alerts-prometheus.yaml             # Prometheus health alerts
    ├── alerts-certificates.yaml           # Certificate expiry alerts
    └── alerts-argocd.yaml                 # GitOps sync alerts

Dashboard Priority System

🔴 Critical Dashboards (Daily Monitoring): - Homelab Overview: Single-pane view of cluster health, resource usage, and active alerts - Node Health: M4 Mac mini infrastructure monitoring (CPU, memory, disk, temperature)

🟡 Important Dashboards (Weekly/As-Needed): - Network: Traffic patterns, policy rules, potential network issues - ArgoCD: GitOps sync status, deployment operations

🟢 Service Dashboards (On-Demand): - Authentik: SSO service health, authentication flows, user metrics - Infisical: Secrets management service health, API usage

Monitoring Workflow

Daily Check: Start with Homelab Overview for cluster status
Infrastructure: Check Node Health for hardware issues
Investigate: Use specific dashboards when alerts trigger or issues arise
Trends: Review Network dashboard for traffic patterns and policy effectiveness

Dashboard Provisioning

Custom dashboards are deployed as ConfigMaps with the label grafana_dashboard: "1". The Grafana sidecar (configured in the Helm chart) automatically picks up these ConfigMaps and loads them into Grafana without requiring manual UI changes.

Each dashboard is a separate ConfigMap file in dashboards/, following the naming convention:

dashboard-<name>.yaml

The file contains a ConfigMap with the dashboard JSON in data.dashboard.json. Dashboard UIDs should be unique across all dashboards.

Prometheus Rules

Prometheus recording and alerting rules are defined as PrometheusRule CRDs in rules/. These are picked up by the Prometheus Operator automatically.

Two types of rules:

Recording rules (recording-*.yaml) — pre-compute expensive queries for faster dashboard rendering
Alerting rules (alerts-*.yaml) — generate alerts when conditions match

Naming conventions:

recording-<domain>.yaml   # e.g., recording-namespace.yaml, recording-requests.yaml
alerts-<domain>.yaml      # e.g., alerts-pods.yaml, alerts-nodes.yaml

Adding a New Dashboard

Create the dashboard JSON in Grafana (export as JSON)

Create a new file dashboards/dashboard-<name>.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: dashboard-<name>
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
    app.kubernetes.io/name: monitoring
    app.kubernetes.io/instance: monitoring
data:
  dashboard.json: |
    { ... }

Commit and push — ArgoCD will sync automatically; the sidecar will load the dashboard.

Adding a New Alerting or Recording Rule

Create a new file rules/<type>-<domain>.yaml:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: <name>
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
    app.kubernetes.io/name: monitoring
    app.kubernetes.io/instance: monitoring
spec:
  groups:
    - name: <group-name>
      rules:
        - alert: <alert-name>
          expr: <promql expression>
          for: <duration>
          labels:
            severity: <severity>
          annotations:
            summary: "..."
            description: "..."

Commit and push — the Prometheus Operator will reload rules automatically.

Testing Alerting Rules

To verify alert rules work correctly:

Generate a synthetic condition (e.g., kill a pod repeatedly to trigger PodCrashLooping, fill a PVC to trigger PVCUsageTooHigh)

Check alerts in Grafana or via Prometheus API:

kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/alerts

Verify Alertmanager receives the alert:

kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-alertmanager 9093:9093
# Open http://localhost:9093/

Silence or resolve the condition, ensure the alert clears.

Notes for OpenClaw Agents

All metrics are scraped by the Prometheus instance deployed with the kube-prometheus-stack Helm chart.
Service-specific metrics (e.g., http_requests_total) come from the Prometheus instrumentation of those services. Ensure the services are correctly instrumented.
For metrics not currently available (e.g., PostgreSQL-specific metrics), consider adding a PostgreSQL exporter or instrumenting the application.
Dashboard UIDs should remain stable across updates. Changing a dashboard's UID will create a new dashboard in Grafana instead of updating the existing one.
To modify an existing dashboard, edit the JSON in place and keep the same UID.

Configuration

The monitoring stack is deployed via ArgoCD using the Helm chart source directly (no local manifests). All configuration is in the Application CR at k8s/apps/argocd/applications/monitoring-app.yaml.

Key settings: - Prometheus retention: 15 days - Prometheus storage: 10Gi PVC - Scrape interval: 60s (explicitly configured) - Evaluation interval: 60s (explicitly configured) - Alertmanager storage: 2Gi PVC - Grafana storage: 2Gi PVC (dashboard persistence) - Disabled scrapers: kubeProxy, kubeEtcd, kubeScheduler, kubeControllerManager (not applicable to OrbStack single-node)

Secrets in Infisical

Key	Purpose
`GRAFANA_ADMIN_PASSWORD`	Break-glass admin access (SSO is primary auth)
`GRAFANA_OAUTH_CLIENT_SECRET`	OIDC client secret for Authentik SSO integration

Modifying Configuration

Edit the helm.valuesObject in k8s/apps/argocd/applications/monitoring-app.yaml, then push to main. ArgoCD will sync the changes.

Upgrading the Chart

Update targetRevision in the Application CR to the desired chart version, then push to main.

Networking

Layer	Value
Grafana container port	3000
NodePort	30090
Tailscale HTTPS	8444
URL	`https://holdens-mac-mini.story-larch.ts.net:8444`

One-time Tailscale Serve setup:

tailscale serve --bg --https 8444 http://localhost:30090

Operational Commands

# Check monitoring pods
kubectl get pods -n monitoring

# Check Prometheus targets
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090
# Then open http://localhost:9090/targets

# Check Alertmanager
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-alertmanager 9093:9093

# View Prometheus storage usage
kubectl exec -n monitoring prometheus-monitoring-kube-prometheus-prometheus-0 -- df -h /prometheus

# Check PVCs
kubectl get pvc -n monitoring

# Check ExternalSecret status
kubectl get externalsecret -n monitoring

# Force secret re-sync
kubectl annotate externalsecret grafana-secret -n monitoring \
  force-sync=$(date +%s) --overwrite

# Check ArgoCD application status
kubectl get application monitoring monitoring-config -n argocd

Troubleshooting

Symptom	Cause	Fix
Grafana login fails	SSO misconfigured	Check Authentik OIDC provider has `openid`, `email`, `profile` scope mappings; verify `api_url` has no trailing slash
No metrics in dashboards	Prometheus targets down	Check `kubectl get pods -n monitoring`; verify targets via port-forward
Dashboard panels show "No data"	Service decommissioned or metrics not available	Check if service is running: `kubectl get deployments -A`; verify metrics exist in Prometheus
High memory usage	Retention too long or too many metrics	Reduce `retention` or add `retentionSize` limit in Prometheus spec
PVC pending	No storage provisioner	Verify `local-path` provisioner is running in `kube-system`
Grafana unreachable via Tailscale	Serve not configured	`tailscale serve --bg --https 8444 http://localhost:30090`
Grafana 404 on `/userinfo/emails`	Authentik provider missing scope mappings	Assign `openid`, `email`, `profile` scope mappings to the Grafana provider
Dashboard not appearing	ArgoCD sync issue	Check `kubectl get application monitoring-config -n argocd`; force refresh if needed

Dashboard-Specific Issues

Homelab Overview: - "No data" on pod counts: Ensure kube-state-metrics is running - Missing network metrics: Check node-exporter pod status

Node Health: - Temperature data missing: May not be available on all systems - Disk I/O shows no data: Check node-exporter configuration

Network Dashboard: - "Pod Network via Cilium" empty: Cilium metrics not enabled (expected) - Packet drops show no data: Normal if no network issues

Service Dashboards (Authentik/Infisical): - Metrics missing: Check if service is running and exposing metrics - HTTP request data empty: Verify Prometheus is scraping service endpoints

Alert Testing

To test alerting rules work correctly:

# Test PodCrashLooping alert
kubectl run test-crash --image=busybox --restart=Always -- sh -c "sleep 10 && exit 1"

# Check alerts in Prometheus
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090
# Visit http://localhost:9090/alerts

# Clean up test pod
kubectl delete pod test-crash