Expand description
§Health Check System
Provides comprehensive health monitoring for Air daemon services, ensuring VSCode stability and security through multi-level health checks, dependency validation, and automatic recovery mechanisms.
§Responsibilities
- Monitor critical Air services (authentication, updates, downloader, indexing, gRPC, connections)
- Implement multi-level health checks (Alive, Responsive, Functional)
- Provide automatic recovery actions when services fail
- Track health history and performance metrics
- Integrate with VSCode’s stability patterns for service health monitoring
§VSCode Stability References
This health check system aligns with VSCode’s health monitoring patterns:
- Service health tracking similar to VSCode’s workbench service health
- Dependency validation matching VSCode’s extension host health checks
- Recovery patterns inspired by VSCode’s crash recovery mechanisms
- Performance monitoring patterns from VSCode’s telemetry system
Referenced from: vs/workbench/services/telemetry
§Mountain Monitoring Integration
Health check results are integrated with Mountain monitoring system:
- Health status updates flow to Mountain’s monitoring dashboards
- Critical health events trigger alerts in Mountain’s alerting system
- Health metrics are aggregated for system-wide health assessment
- Recovery actions are coordinated with Mountain’s service management
§Monitoring Patterns
§Multi-Level Health Checks
- Alive: Basic service process check
- Responsive: Service responds to health check queries
- Functional: Service performs its core operations correctly
§Circuit Breaking
- Services are temporarily marked as unhealthy after consecutive failures
- Circuit breaker prevents cascading failures
- Automatic circuit breaker reset after cool-down period
- Manual circuit breaker reset available for administrative overrides
§Timeout Handling
- Each health check has a configurable timeout
- Timeout events trigger immediate recovery actions
- Timeout history tracked to identify performance degradation
- Adaptive timeout adjustment based on observed performance
§Recovery Mechanisms
Recovery actions are triggered based on:
- Consecutive failure count exceeding threshold
- Response time exceeding configured threshold
- Service unresponsiveness detected
- Manual-triggered recovery
Recovery actions include:
- Service restart (graceful shutdown and restart)
- Connection reset (re-establish network connections)
- Cache clearing (remove stale or corrupted cache)
- Configuration reload (refresh service configuration)
- Escalation (notify administrators for manual intervention)
§TODO: Advanced Features
- Implement advanced metrics collection (latency percentiles, error rates)
- Add health check scheduling automation (cron-like scheduling)
- Implement predictive health analysis (machine learning-based)
- Add security compliance checks (PCI-DSS, GDPR, etc.)
- Implement distributed health checks for clustered deployments
- Add health check export formats (Prometheus, Grafana, etc.)
- Implement health check alerting through multiple channels (email, Slack, etc.)
- Add health check simulation for testing and validation
§Configuration
Health check behavior is configurable through HealthCheckConfig:
default_check_interval: Time between automatic health checkshistory_retention: Number of health check records to keepconsecutive_failures_threshold: Failures before triggering recoveryresponse_time_threshold_ms: Response time threshold for recoveryenable_auto_recovery: Enable/disable automatic recoveryrecovery_timeout_sec: Maximum time for recovery actions
Structs§
- Health
Check Config - Health check configuration
- Health
Check Manager - Health check manager
- Health
Check Record - Health check record for history tracking
- Health
Check Response - Health check response for gRPC
- Health
Statistics - Health statistics
- Performance
Indicators - Performance degradation indicators
- Recovery
Action - Recovery action configuration
- Resource
Warning - Resource warning types
- Service
Health - Service health information
Enums§
- Degradation
Level - Degradation levels
- Health
Check Level - Health check level
- Health
Status - Health status enum
- Recovery
Action Type - Recovery action types
- Recovery
Trigger - Recovery trigger conditions
- Resource
Warning Type - Resource warning types
- Warning
Severity - Warning severity levels