Health Checks

Cbox Init provides comprehensive health monitoring with TCP, HTTP, and exec-based health checks. Health checks prevent restart loops, enable dependency verification, and support both readiness and liveness probes.

Overview

Features:

🌐 TCP Health Checks - Port connectivity testing
📡 HTTP Health Checks - Endpoint validation with status codes
⚙️ Exec Health Checks - Custom command validation
🎯 Success Thresholds - Prevent flapping with consecutive success requirements
🔄 Configurable Retries - Automatic retry with timeouts
📊 Prometheus Metrics - Health check duration and status tracking

Quick Start

processes:
  nginx:
    enabled: true
    command: ["nginx", "-g", "daemon off;"]
    health_check:
      type: http
      address: "http://127.0.0.1:80/health"
      interval: 10
      timeout: 5
      retries: 3
      success_threshold: 2

Health Check Types

1. TCP Health Check

Tests port connectivity - Useful for databases, caches, and services without HTTP endpoints.

processes:
  redis:
    enabled: true
    command: ["redis-server", "/etc/redis/redis.conf"]
    health_check:
      type: tcp
      address: "127.0.0.1:6379"
      interval: 5
      timeout: 2
      retries: 3
      success_threshold: 1

Configuration:

type: tcp - Required
address - Format: host:port (e.g., 127.0.0.1:6379, localhost:3306)
Tests if port accepts connections

Use cases:

MySQL: 127.0.0.1:3306
PostgreSQL: 127.0.0.1:5432
Redis: 127.0.0.1:6379
PHP-FPM: 127.0.0.1:9000
Memcached: 127.0.0.1:11211

2. HTTP Health Check

Tests HTTP endpoints - Validates HTTP status codes and response bodies.

processes:
  nginx:
    enabled: true
    command: ["nginx", "-g", "daemon off;"]
    health_check:
      type: http
      address: "http://127.0.0.1:80/health"
      interval: 10
      timeout: 5
      retries: 3
      success_threshold: 2

Configuration:

type: http - Required
address - Full URL (e.g., http://localhost:80/health)
Expects: HTTP status 200-299 (success)
Fails: HTTP status ≥300 or connection error

Use cases:

Nginx health endpoint: http://127.0.0.1:80/health
Application health check: http://127.0.0.1:80/api/health
Custom health route: http://127.0.0.1:8080/status

PHP health endpoint example:

// Simple health endpoint (works with any framework)
<?php
http_response_code(200);
header('Content-Type: application/json');
echo json_encode(['status' => 'healthy']);

3. Exec Health Check

Runs custom commands - Maximum flexibility for complex health validation.

processes:
  horizon:
    enabled: true
    command: ["php", "artisan", "horizon"]
    health_check:
      type: exec
      command: ["php", "artisan", "horizon:status"]
      interval: 30
      timeout: 10
      retries: 2
      success_threshold: 1

Configuration:

type: exec - Required
command - Array of command and arguments
Expects: Exit code 0 (success)
Fails: Exit code ≠0 or timeout

Use cases:

Process checks: ["pgrep", "-f", "queue:work"]
Horizon status: ["php", "artisan", "horizon:status"]
Database ping: ["mysql", "-u", "root", "-e", "SELECT 1"]
Custom scripts: ["./scripts/healthcheck.sh"]

Custom health check script:

#!/bin/bash
# scripts/healthcheck.sh

# Check if process is running
if pgrep -f "queue:work" > /dev/null; then
  exit 0  # Healthy
else
  exit 1  # Unhealthy
fi

Configuration Parameters

Core Settings

health_check:
  type: http                  # Required: tcp, http, exec
  address: "..."              # Required for tcp/http
  command: [...]              # Required for exec
  interval: 10                # Seconds between checks (default: 30)
  timeout: 5                  # Max wait time per check (default: 30)
  retries: 3                  # Failed attempts before unhealthy (default: 3)
  success_threshold: 2        # Consecutive successes to mark healthy (default: 1)

Parameter Details

interval - Time between health checks

Default: 30 seconds
Range: 1-300 seconds
Recommendation: 10-30s for most applications
Too low: High overhead, noise
Too high: Slow failure detection

timeout - Maximum wait time per check

Default: 30 seconds
Range: 1-60 seconds
Recommendation: 5-10s for HTTP/TCP, 10-30s for exec
Too low: False positives during load spikes
Too high: Delayed failure detection

retries - Failed attempts before marking unhealthy

Default: 3
Range: 1-10
Recommendation: 3-5 retries for transient failures
Higher: More tolerance for temporary issues
Lower: Faster reaction to real failures

success_threshold - Consecutive successes to mark healthy

Default: 1
Range: 1-10
Recommendation: 1-2 for stable services, 2-3 for flaky services
Use case: Prevent restart flapping after recovery

Health Check Lifecycle

State Machine

[Starting] → [Checking] → [Healthy] → [Checking] → ...
                 ↓            ↓
              [Unhealthy] ← [Failed]
                 ↓
              [Restart]

States:

Starting - Process just started, health checks not yet active
Checking - Health check in progress
Healthy - Check passed, process operational
Failed - Single check failed, retries remaining
Unhealthy - All retries exhausted, restart triggered

Example Timeline

Time  | State       | Check Result | Retries | Action
------|-------------|--------------|---------|--------
0s    | Starting    | -            | -       | Process starts
10s   | Checking    | Success      | 0       | Mark healthy (threshold=1)
20s   | Healthy     | Success      | 0       | Continue
30s   | Healthy     | Failed       | 1/3     | Retry
35s   | Checking    | Failed       | 2/3     | Retry
40s   | Checking    | Failed       | 3/3     | Mark unhealthy
40s   | Unhealthy   | -            | -       | Trigger restart

Success Threshold Pattern

Prevents restart flapping by requiring multiple consecutive successes after failure.

Without Success Threshold (threshold=1)

Check: ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ...
State: H H U R U R U R ...  (Flapping - many restarts)

With Success Threshold (threshold=3)

Check: ✓ ✓ ✗ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ...
State: H H U ? ? H U ? ? H ...  (Stable - fewer restarts)

Configuration:

health_check:
  type: http
  address: "http://127.0.0.1:80/health"
  retries: 3
  success_threshold: 3  # Require 3 consecutive successes

Use cases:

Services with warm-up period
Apps with transient initialization issues
Prevent restart storms during load spikes

Integration with Restart Policies

Health checks work with restart policies to control process lifecycle:

processes:
  worker:
    enabled: true
    command: ["php", "artisan", "queue:work"]
    restart: on-failure     # Only restart on failures
    health_check:
      type: exec
      command: ["pgrep", "-f", "queue:work"]
      interval: 30
      retries: 3

Restart policy behaviors:

always - Restart on health check failure
on-failure - Restart on health check failure (same as always)
never - Health checks still run, but no restart triggered

Prometheus Metrics

Health check metrics exported when metrics_enabled: true:

# Health check status (1=healthy, 0=unhealthy)
cbox_init_health_check_status{process="nginx", type="http"}

# Health check duration in seconds
cbox_init_health_check_duration_seconds{process="nginx", type="http"}

# Total health check failures
cbox_init_health_check_failures_total{process="nginx", type="http"}

Grafana alerts:

- alert: HealthCheckFailing
  expr: cbox_init_health_check_status == 0
  for: 5m
  annotations:
    summary: "Process {{$labels.process}} health check failing"

- alert: SlowHealthCheck
  expr: cbox_init_health_check_duration_seconds > 5
  for: 5m
  annotations:
    summary: "Slow health check for {{$labels.process}}"

Best Practices

1. Match Check Type to Service

# ✅ TCP for database
redis:
  health_check:
    type: tcp
    address: "127.0.0.1:6379"

# ✅ HTTP for web server
nginx:
  health_check:
    type: http
    address: "http://127.0.0.1:80/health"

# ✅ Exec for queue worker
worker:
  health_check:
    type: exec
    command: ["pgrep", "-f", "queue:work"]

2. Set Appropriate Timeouts

# ✅ Fast services - low timeout
health_check:
  type: tcp
  address: "127.0.0.1:6379"
  timeout: 2  # Redis is fast

# ✅ Slow services - higher timeout
health_check:
  type: exec
  command: ["./complex-health-check.sh"]
  timeout: 30  # Complex checks need time

3. Use Success Thresholds for Flaky Services

# ✅ Prevent restart flapping
health_check:
  type: http
  address: "http://127.0.0.1:80/health"
  success_threshold: 3  # Require 3 consecutive successes

4. Combine with Dependencies

processes:
  php-fpm:
    enabled: true
    command: ["php-fpm", "-F"]
    health_check:
      type: tcp
      address: "127.0.0.1:9000"

  nginx:
    enabled: true
    command: ["nginx", "-g", "daemon off;"]
    depends_on: [php-fpm]  # Wait for PHP-FPM
    health_check:
      type: http
      address: "http://127.0.0.1:80/health"

Troubleshooting

Health Checks Always Failing

Issue: Process keeps restarting due to failed health checks

Solutions:

Verify endpoint is reachable:

# HTTP
curl -v http://127.0.0.1:80/health

# TCP
telnet 127.0.0.1 6379

# Exec
./healthcheck.sh && echo "Success" || echo "Failed"

Increase timeout:

health_check:
  timeout: 10  # Increase from 5

Check process startup time:

health_check:
  interval: 30  # Give more time to start
  retries: 5    # More tolerance

Add success threshold:

health_check:
  success_threshold: 2  # Require 2 successes

Process Not Restarting on Failure

Issue: Health check fails but process doesn't restart

Solutions:

Check restart policy:
```
restart: always  # Or on-failure
```

Verify health check is configured:

health_check:
  type: tcp
  address: "..."  # Must be present

Check logs:

# View Cbox logs
journalctl -u cbox-init -f | grep "health"

False Positives During Load

Issue: Health checks fail during high load

Solutions:

Increase timeout:

health_check:
  timeout: 10  # More tolerance for load

Reduce check frequency:

health_check:
  interval: 60  # Less frequent checks

Add success threshold:

health_check:
  success_threshold: 3  # Prevent flapping

Examples

PHP Application

processes:
  php-fpm:
    enabled: true
    command: ["php-fpm", "-F", "-R"]
    health_check:
      type: tcp
      address: "127.0.0.1:9000"
      interval: 10
      timeout: 5

  nginx:
    enabled: true
    command: ["nginx", "-g", "daemon off;"]
    depends_on: [php-fpm]
    health_check:
      type: http
      address: "http://127.0.0.1:80/health"
      interval: 10
      timeout: 5
      retries: 3
      success_threshold: 2

  # Laravel Horizon example
  horizon:
    enabled: true
    command: ["php", "artisan", "horizon"]
    health_check:
      type: exec
      command: ["php", "artisan", "horizon:status"]
      interval: 30
      timeout: 10
      retries: 2

Database Services

processes:
  redis:
    enabled: true
    command: ["redis-server"]
    health_check:
      type: tcp
      address: "127.0.0.1:6379"
      interval: 5
      timeout: 2

  mysql:
    enabled: true
    command: ["mysqld"]
    health_check:
      type: tcp
      address: "127.0.0.1:3306"
      interval: 10
      timeout: 5

Queue Workers

processes:
  queue-default:
    enabled: true
    command: ["php", "artisan", "queue:work", "--tries=3"]
    scale: 3
    health_check:
      type: exec
      command: ["pgrep", "-f", "queue:work"]
      interval: 30
      timeout: 5
      retries: 3
      success_threshold: 1

Next Steps

Configuration Reference - Complete configuration options
Restart Policies - Process restart strategies
Prometheus Metrics - Metrics and monitoring
Examples - Real-world configurations