Monitoring & Observability â
Enterprise-grade monitoring stack providing comprehensive observability across the entire 28-container infrastructure with real-time alerting and centralized logging.
đ Monitoring Architecture â
The monitoring stack consists of 9 specialized containers working together to provide complete infrastructure visibility:
Metrics Collection
Prometheus, PVE Exporter
Visualization
Grafana, Glance Dashboard
Centralized Logging
Loki, Promtail
Intelligent Alerting
AlertManager, ntfy
External Monitoring
Blackbox Exporter, Uptime Kuma
đī¸ Core Components â
Metrics & Time Series â
- Prometheus (109): Primary metrics collection and storage
- PVE Exporter (106): Proxmox infrastructure metrics
- Blackbox Exporter (132): External endpoint monitoring
- Node Exporters: System-level metrics from all containers
Visualization & Dashboards â
- Grafana (110): Primary dashboard and visualization platform
- Glance (119): Lightweight status dashboard
- Custom Dashboards: Tailored views for each service category
Centralized Logging â
- Loki (130): Log aggregation and storage (31-day retention)
- Promtail (133): Log collection and shipping agent
- Log Correlation: Unified logging across all 28 containers
Alerting & Notifications â
- AlertManager (131): Intelligent alert routing and management
- Uptime Kuma (123): Service uptime monitoring
- ntfy (124): Mobile push notifications to iPhone
đą Real-Time Alerting â
Alert Categories â
Critical
Container failures, storage full, service down
Warning
90%+ resource usage, service degradation
Info
Deployment status, backup completion
Success
Service recovery, successful operations
Mobile Notifications â
- Instant Push: Real-time alerts to iPhone via ntfy
- Alert Grouping: Intelligent bundling to prevent spam
- Priority Routing: Critical alerts bypass Do Not Disturb
- Rich Content: Detailed alert context with quick actions
đ Key Metrics Tracked â
Infrastructure Metrics â
yaml
Host Resources:
- CPU utilization across all cores
- Memory usage and availability
- Disk space and I/O performance
- Network throughput and latency
Container Health:
- Container status and restarts
- Resource consumption per container
- Service response times
- Application-specific metricsApplication Metrics â
yaml
Media Stack:
- Plex transcoding sessions
- Download/upload rates
- Storage usage trends
- User activity patterns
Business Applications:
- Database performance
- User sessions
- Document processing rates
- Photo analysis progress
Security Services:
- VPN connection status
- Authentication attempts
- Certificate expiration
- Firewall rule matchesđ Advanced Monitoring Features â
Log Correlation â
Intelligent Alerting â
- Threshold-based Alerts: CPU, memory, disk usage warnings
- Service-specific Rules: Application health checks
- Trend Analysis: Predictive alerting based on patterns
- Alert Correlation: Related events grouped together
External Monitoring â
- Blackbox Exporter: HTTP/HTTPS endpoint monitoring
- DNS Resolution: External DNS health checks
- Certificate Monitoring: SSL certificate expiration tracking
- Network Connectivity: Latency and availability testing
đ Dashboard Categories â
Infrastructure Overview â
- Host Resource Usage: CPU, memory, disk, network
- Container Status: Health, resource consumption, restarts
- Network Performance: Bandwidth utilization, connection counts
- Storage Analysis: Disk usage, I/O performance, growth trends
Service-Specific Dashboards â
- Media Stack: Plex performance, download statistics, library metrics
- Security Services: VPN usage, authentication logs, certificate status
- Business Applications: ERP usage, document processing, photo analysis
- GitOps Metrics: Deployment success rates, pipeline performance
Operational Dashboards â
- Alert Summary: Current alerts, resolution times, trends
- Backup Status: Completion rates, storage usage, failure tracking
- Update Management: Package versions, security patches, system updates
- Capacity Planning: Growth projections, resource forecasting
đ¯ Monitoring Best Practices â
Data Retention â
- Metrics: 90 days of high-resolution data
- Logs: 31 days of centralized log retention
- Alerts: 30 days of alert history
- Dashboards: Automatic cleanup of old snapshots
Performance Optimization â
- Efficient Queries: Optimized PromQL for fast dashboards
- Data Compression: Loki compression for log storage
- Resource Allocation: Right-sized monitoring containers
- Network Efficiency: Local metric collection to minimize bandwidth
Security & Access â
- Role-based Access: Different dashboard views for different users
- Secure Communication: TLS encryption for all monitoring traffic
- Audit Logging: Complete access log for monitoring systems
- Data Privacy: Sensitive data filtering in logs and metrics
đą Mobile Experience â
ntfy Integration â
yaml
Notification Channels:
- homelab-critical: đ´ Critical system alerts
- homelab-warnings: đĄ Performance warnings
- homelab-info: âšī¸ Status updates
- homelab-success: â
Operation confirmations
Features:
- Push notifications with custom sounds
- Rich text formatting and emojis
- Action buttons for quick responses
- Offline message queuingReal-world Examples â
- Container Failure: "đ´ Plex container stopped - restarting automatically"
- Storage Warning: "đĄ Disk usage 85% on media storage - cleanup recommended"
- Backup Success: "â All 28 containers backed up successfully"
- Update Available: "âšī¸ Security updates available for 3 containers"
đ§ Maintenance & Operations â
Automated Maintenance â
- Log Rotation: Automatic cleanup of old log files
- Metrics Cleanup: Removal of stale time series data
- Dashboard Updates: Automatic refresh of dynamic content
- Alert Rule Validation: Regular testing of alert conditions
Health Checks â
- Self-Monitoring: Monitoring stack monitors itself
- Dependency Checks: Validation of service dependencies
- Data Quality: Metrics and log ingestion validation
- Performance Monitoring: Monitoring system performance tracking
This comprehensive monitoring implementation provides enterprise-grade observability that ensures the 28-container homelab operates reliably with proactive issue detection and resolution capabilities.