Health Checks Configuration

Configure health monitoring for your processes to ensure reliability and enable proper dependency management.

Overview

Health checks monitor process health and enable:

✅ Automatic restart on failure
✅ Dependency waiting (processes wait for healthy dependencies)
✅ Health status reporting via metrics and API
✅ Graceful degradation patterns

Basic Configuration

processes:
  nginx:
    command: ["nginx", "-g", "daemon off;"]
    health_check:
      type: http
      address: "http://127.0.0.1:80/health"
      interval: 10
      timeout: 5
      retries: 3
      success_threshold: 2

Health Check Types

HTTP Health Check

health_check:
  type: http
  address: "http://127.0.0.1:80/health"
  interval: 10
  timeout: 5
  retries: 3
  success_threshold: 2
  expected_status: 200
  expected_body: "OK"

Settings:

address - Full HTTP URL to check
expected_status - Expected HTTP status code (default: 200)
expected_body - Optional body content match

Best for:

Web servers (Nginx, Apache)
HTTP APIs
Services with health endpoints

Example Endpoint:

// Simple health endpoint (any PHP framework)
<?php
http_response_code(200);
header('Content-Type: application/json');
echo json_encode(['status' => 'healthy']);

TCP Health Check

health_check:
  type: tcp
  address: "127.0.0.1:9000"
  interval: 10
  timeout: 3
  retries: 3

Settings:

address - TCP address in format host:port

Best for:

PHP-FPM (port 9000)
Redis (port 6379)
MySQL (port 3306)
Services that listen on TCP ports

Example:

processes:
  php-fpm:
    command: ["php-fpm", "-F", "-R"]
    health_check:
      type: tcp
      address: "127.0.0.1:9000"
      interval: 5

Exec Health Check

health_check:
  type: exec
  command: ["php", "artisan", "health:check"]
  interval: 30
  timeout: 10
  retries: 2

Settings:

command - Command to execute (array format)
Process is healthy if exit code is 0

Best for:

Custom health logic
Database connectivity checks
Application-specific validation
Multi-service health aggregation

Example Health Check Script:

<?php
// health-check.php - works with any PHP framework/app

// Check database (PDO)
try {
    $pdo = new PDO('mysql:host=localhost;dbname=myapp', 'user', 'pass');
    $pdo->query('SELECT 1');
} catch (PDOException $e) {
    echo "Database connection failed\n";
    exit(1);
}

// Check Redis (if using)
$redis = new Redis();
if (!$redis->connect('localhost')) {
    echo "Redis connection failed\n";
    exit(1);
}

echo "All systems healthy\n";
exit(0);  // Success

Common Settings

interval

Type: integer (seconds) Default: 10 Description: Time between health checks.

health_check:
  interval: 30  # Check every 30 seconds

Recommendations:

Critical services: 5-10 seconds
Standard services: 10-30 seconds
Heavy checks: 30-60 seconds
Exec commands: 30-120 seconds (depending on execution time)

timeout

Type: integer (seconds) Default: 5 Description: Maximum time to wait for health check response.

health_check:
  timeout: 10  # Wait up to 10 seconds

Guidelines:

Should be less than interval
HTTP checks: 3-10 seconds
TCP checks: 1-5 seconds
Exec checks: 5-30 seconds (match command execution time)

retries

Type: integer Default: 3 Description: Number of consecutive failures before marking unhealthy.

health_check:
  retries: 5  # Allow 5 failures before marking unhealthy

Recommendations:

Stable services: 2-3 retries
Flaky services: 5-10 retries
Critical services: 1-2 retries (fail fast)

success_threshold

Type: integer Default: 1 Description: Number of consecutive successes before marking healthy.

health_check:
  success_threshold: 3  # Require 3 successes to become healthy

Use cases:

Services with slow startup: Require multiple successes (2-5)
Fast services: Single success (1)
Flaky services: Require sustained success (3-5)

Health Check Lifecycle

[Process Starts]
      ↓
  [Starting] ← Health checks not yet running
      ↓
  [Healthy] ← success_threshold consecutive successes
      ↓
  [Unhealthy] ← retries consecutive failures
      ↓
  [Restart] ← If restart policy allows

State Transitions:

Process starts → Starting state
After success_threshold successes → Healthy
After retries failures → Unhealthy
Unhealthy process restarts (if restart: always)

Advanced Patterns

Dependency Waiting

processes:
  php-fpm:
    health_check:
      type: tcp
      address: "127.0.0.1:9000"

  nginx:
    depends_on: [php-fpm]  # Waits for PHP-FPM to be healthy
    health_check:
      type: http
      address: "http://127.0.0.1:80/health"

Behavior:

Nginx waits for PHP-FPM to reach Healthy state
If PHP-FPM becomes unhealthy, Nginx continues running
On restart, Nginx waits again for PHP-FPM

Multi-Layer Health Checks

processes:
  app:
    command: ["./my-app"]
    health_check:
      type: exec
      command: ["/health-check.sh"]
      interval: 30

health-check.sh:

#!/bin/bash
set -e

# Check HTTP endpoint
curl -f http://localhost:9180/health || exit 1

# Check database
psql -U user -d db -c "SELECT 1" || exit 1

# Check disk space
df -h / | awk 'NR==2 {if ($5+0 > 90) exit 1}'

echo "Health check passed"
exit 0

Graceful Degradation

processes:
  primary-service:
    command: ["./primary"]
    health_check:
      type: http
      address: "http://127.0.0.1:8080/health"
      retries: 10  # Tolerate more failures
      success_threshold: 3  # Require sustained success

  fallback-service:
    enabled: false  # Enable manually when primary fails
    command: ["./fallback"]

Complete Examples

PHP Application

processes:
  # PHP-FPM with TCP check
  php-fpm:
    command: ["php-fpm", "-F", "-R"]
    health_check:
      type: tcp
      address: "127.0.0.1:9000"
      interval: 5
      timeout: 2
      retries: 3

  # Nginx with HTTP check
  nginx:
    command: ["nginx", "-g", "daemon off;"]
    depends_on: [php-fpm]
    health_check:
      type: http
      address: "http://127.0.0.1:80/health"
      interval: 10
      timeout: 5
      retries: 3
      expected_status: 200

  # Queue worker with exec check (Laravel example)
  queue-worker:
    command: ["php", "artisan", "queue:work"]
    health_check:
      type: exec
      command: ["pgrep", "-f", "queue:work"]
      interval: 60
      timeout: 10
      retries: 2

Microservices Stack

processes:
  database:
    command: ["postgres"]
    health_check:
      type: tcp
      address: "127.0.0.1:5432"
      interval: 5

  cache:
    command: ["redis-server"]
    health_check:
      type: tcp
      address: "127.0.0.1:6379"
      interval: 5

  api:
    command: ["./api-server"]
    depends_on: [database, cache]
    health_check:
      type: http
      address: "http://127.0.0.1:8080/health"
      interval: 10
      expected_body: '{"status":"ok"}'

  worker:
    command: ["./worker"]
    depends_on: [database, cache]
    health_check:
      type: exec
      command: ["./worker-health-check"]
      interval: 30

Troubleshooting

Health Check Always Failing

# Increase timeout
health_check:
  timeout: 15  # Was 5, too short

# Allow more retries
health_check:
  retries: 5  # Was 3

# Check less frequently
health_check:
  interval: 30  # Was 10

Health Check Too Sensitive

# Require sustained success
health_check:
  success_threshold: 5  # Was 1

# Tolerate transient failures
health_check:
  retries: 10  # Was 3

Slow Startup Detection

# Wait longer for initial health
health_check:
  success_threshold: 3  # Require 3 successes
  interval: 10
  timeout: 10
  # First success after 10s, healthy after 30s (3 × 10s)

Monitoring Health Status

Via Metrics

# Prometheus metrics
curl http://localhost:9090/metrics | grep health

# Example output
cbox_init_process_health_status{process="nginx"} 1  # 1 = healthy

Via Management API

# Get process status
curl http://localhost:9180/api/v1/processes | jq '.[] | {name, health_status}'

# Example output
{
  "name": "nginx",
  "health_status": "healthy"
}

Health Checks Configuration

Health Checks Configuration

Overview

Basic Configuration

Health Check Types

HTTP Health Check

TCP Health Check

Exec Health Check

Common Settings

interval

timeout

retries

success_threshold

Health Check Lifecycle

Advanced Patterns

Dependency Waiting

Multi-Layer Health Checks

Graceful Degradation

Complete Examples

PHP Application

Microservices Stack

Troubleshooting

Health Check Always Failing

Health Check Too Sensitive

Slow Startup Detection

Monitoring Health Status

Via Metrics

Via Management API

See Also