Skip to content

Monitoring & Observability ​

Enterprise-grade monitoring stack providing comprehensive observability across the entire 28-container infrastructure with real-time alerting and centralized logging.

📊 Monitoring Architecture ​

The monitoring stack consists of 9 specialized containers working together to provide complete infrastructure visibility:

📈
Metrics Collection
Prometheus, PVE Exporter
📊
Visualization
Grafana, Glance Dashboard
📝
Centralized Logging
Loki, Promtail
🚨
Intelligent Alerting
AlertManager, ntfy
🌐
External Monitoring
Blackbox Exporter, Uptime Kuma

đŸ—ī¸ Core Components ​

Metrics & Time Series ​

  • Prometheus (109): Primary metrics collection and storage
  • PVE Exporter (106): Proxmox infrastructure metrics
  • Blackbox Exporter (132): External endpoint monitoring
  • Node Exporters: System-level metrics from all containers

Visualization & Dashboards ​

  • Grafana (110): Primary dashboard and visualization platform
  • Glance (119): Lightweight status dashboard
  • Custom Dashboards: Tailored views for each service category

Centralized Logging ​

  • Loki (130): Log aggregation and storage (31-day retention)
  • Promtail (133): Log collection and shipping agent
  • Log Correlation: Unified logging across all 28 containers

Alerting & Notifications ​

  • AlertManager (131): Intelligent alert routing and management
  • Uptime Kuma (123): Service uptime monitoring
  • ntfy (124): Mobile push notifications to iPhone

📱 Real-Time Alerting ​

Alert Categories ​

🔴
Critical
Container failures, storage full, service down
🟡
Warning
90%+ resource usage, service degradation
â„šī¸
Info
Deployment status, backup completion
✅
Success
Service recovery, successful operations

Mobile Notifications ​

  • Instant Push: Real-time alerts to iPhone via ntfy
  • Alert Grouping: Intelligent bundling to prevent spam
  • Priority Routing: Critical alerts bypass Do Not Disturb
  • Rich Content: Detailed alert context with quick actions

📈 Key Metrics Tracked ​

Infrastructure Metrics ​

yaml
Host Resources:
  - CPU utilization across all cores
  - Memory usage and availability  
  - Disk space and I/O performance
  - Network throughput and latency

Container Health:
  - Container status and restarts
  - Resource consumption per container
  - Service response times
  - Application-specific metrics

Application Metrics ​

yaml
Media Stack:
  - Plex transcoding sessions
  - Download/upload rates
  - Storage usage trends
  - User activity patterns

Business Applications:
  - Database performance
  - User sessions
  - Document processing rates
  - Photo analysis progress

Security Services:
  - VPN connection status
  - Authentication attempts
  - Certificate expiration
  - Firewall rule matches

🔍 Advanced Monitoring Features ​

Log Correlation ​

Intelligent Alerting ​

  • Threshold-based Alerts: CPU, memory, disk usage warnings
  • Service-specific Rules: Application health checks
  • Trend Analysis: Predictive alerting based on patterns
  • Alert Correlation: Related events grouped together

External Monitoring ​

  • Blackbox Exporter: HTTP/HTTPS endpoint monitoring
  • DNS Resolution: External DNS health checks
  • Certificate Monitoring: SSL certificate expiration tracking
  • Network Connectivity: Latency and availability testing

📊 Dashboard Categories ​

Infrastructure Overview ​

  • Host Resource Usage: CPU, memory, disk, network
  • Container Status: Health, resource consumption, restarts
  • Network Performance: Bandwidth utilization, connection counts
  • Storage Analysis: Disk usage, I/O performance, growth trends

Service-Specific Dashboards ​

  • Media Stack: Plex performance, download statistics, library metrics
  • Security Services: VPN usage, authentication logs, certificate status
  • Business Applications: ERP usage, document processing, photo analysis
  • GitOps Metrics: Deployment success rates, pipeline performance

Operational Dashboards ​

  • Alert Summary: Current alerts, resolution times, trends
  • Backup Status: Completion rates, storage usage, failure tracking
  • Update Management: Package versions, security patches, system updates
  • Capacity Planning: Growth projections, resource forecasting

đŸŽ¯ Monitoring Best Practices ​

Data Retention ​

  • Metrics: 90 days of high-resolution data
  • Logs: 31 days of centralized log retention
  • Alerts: 30 days of alert history
  • Dashboards: Automatic cleanup of old snapshots

Performance Optimization ​

  • Efficient Queries: Optimized PromQL for fast dashboards
  • Data Compression: Loki compression for log storage
  • Resource Allocation: Right-sized monitoring containers
  • Network Efficiency: Local metric collection to minimize bandwidth

Security & Access ​

  • Role-based Access: Different dashboard views for different users
  • Secure Communication: TLS encryption for all monitoring traffic
  • Audit Logging: Complete access log for monitoring systems
  • Data Privacy: Sensitive data filtering in logs and metrics

📱 Mobile Experience ​

ntfy Integration ​

yaml
Notification Channels:
  - homelab-critical: 🔴 Critical system alerts
  - homelab-warnings: 🟡 Performance warnings
  - homelab-info: â„šī¸ Status updates
  - homelab-success: ✅ Operation confirmations

Features:
  - Push notifications with custom sounds
  - Rich text formatting and emojis
  - Action buttons for quick responses
  - Offline message queuing

Real-world Examples ​

  • Container Failure: "🔴 Plex container stopped - restarting automatically"
  • Storage Warning: "🟡 Disk usage 85% on media storage - cleanup recommended"
  • Backup Success: "✅ All 28 containers backed up successfully"
  • Update Available: "â„šī¸ Security updates available for 3 containers"

🔧 Maintenance & Operations ​

Automated Maintenance ​

  • Log Rotation: Automatic cleanup of old log files
  • Metrics Cleanup: Removal of stale time series data
  • Dashboard Updates: Automatic refresh of dynamic content
  • Alert Rule Validation: Regular testing of alert conditions

Health Checks ​

  • Self-Monitoring: Monitoring stack monitors itself
  • Dependency Checks: Validation of service dependencies
  • Data Quality: Metrics and log ingestion validation
  • Performance Monitoring: Monitoring system performance tracking

This comprehensive monitoring implementation provides enterprise-grade observability that ensures the 28-container homelab operates reliably with proactive issue detection and resolution capabilities.

Enterprise-Grade Homelab Infrastructure