Module HealthCheck

Expand description

§Health Check System

Provides comprehensive health monitoring for Air daemon services, ensuring VSCode stability and security through multi-level health checks, dependency validation, and automatic recovery mechanisms.

§Responsibilities

Monitor critical Air services (authentication, updates, downloader, indexing, gRPC, connections)
Implement multi-level health checks (Alive, Responsive, Functional)
Provide automatic recovery actions when services fail
Track health history and performance metrics
Integrate with VSCode’s stability patterns for service health monitoring

§VSCode Stability References

This health check system aligns with VSCode’s health monitoring patterns:

Service health tracking similar to VSCode’s workbench service health
Dependency validation matching VSCode’s extension host health checks
Recovery patterns inspired by VSCode’s crash recovery mechanisms
Performance monitoring patterns from VSCode’s telemetry system

Referenced from: vs/workbench/services/telemetry

§Mountain Monitoring Integration

Health check results are integrated with Mountain monitoring system:

Health status updates flow to Mountain’s monitoring dashboards
Critical health events trigger alerts in Mountain’s alerting system
Health metrics are aggregated for system-wide health assessment
Recovery actions are coordinated with Mountain’s service management

§Monitoring Patterns

§Multi-Level Health Checks

Alive: Basic service process check
Responsive: Service responds to health check queries
Functional: Service performs its core operations correctly

§Circuit Breaking

Services are temporarily marked as unhealthy after consecutive failures
Circuit breaker prevents cascading failures
Automatic circuit breaker reset after cool-down period
Manual circuit breaker reset available for administrative overrides

§Timeout Handling

Each health check has a configurable timeout
Timeout events trigger immediate recovery actions
Timeout history tracked to identify performance degradation
Adaptive timeout adjustment based on observed performance

§Recovery Mechanisms

Recovery actions are triggered based on:

Consecutive failure count exceeding threshold
Response time exceeding configured threshold
Service unresponsiveness detected
Manual-triggered recovery

Recovery actions include:

Service restart (graceful shutdown and restart)
Connection reset (re-establish network connections)
Cache clearing (remove stale or corrupted cache)
Configuration reload (refresh service configuration)
Escalation (notify administrators for manual intervention)

§TODO: Advanced Features

Implement advanced metrics collection (latency percentiles, error rates)
Add health check scheduling automation (cron-like scheduling)
Implement predictive health analysis (machine learning-based)
Add security compliance checks (PCI-DSS, GDPR, etc.)
Implement distributed health checks for clustered deployments
Add health check export formats (Prometheus, Grafana, etc.)
Implement health check alerting through multiple channels (email, Slack, etc.)
Add health check simulation for testing and validation

§Configuration

Health check behavior is configurable through HealthCheckConfig:

default_check_interval: Time between automatic health checks
history_retention: Number of health check records to keep
consecutive_failures_threshold: Failures before triggering recovery
response_time_threshold_ms: Response time threshold for recovery
enable_auto_recovery: Enable/disable automatic recovery
recovery_timeout_sec: Maximum time for recovery actions

Structs§

HealthCheckConfig: Health check configuration
HealthCheckManager: Health check manager
HealthCheckRecord: Health check record for history tracking
HealthCheckResponse: Health check response for gRPC
HealthStatistics: Health statistics
PerformanceIndicators: Performance degradation indicators
RecoveryAction: Recovery action configuration
ResourceWarning: Resource warning types
ServiceHealth: Service health information

Enums§

DegradationLevel: Degradation levels
HealthCheckLevel: Health check level
HealthStatus: Health status enum
RecoveryActionType: Recovery action types
RecoveryTrigger: Recovery trigger conditions
ResourceWarningType: Resource warning types
WarningSeverity: Warning severity levels