Resource Monitoring
Resource Monitoring
Cbox Init tracks detailed resource usage (CPU, memory, threads, file descriptors) for all managed process instances. Historical data is stored in time series ring buffers and exposed via both REST API and Prometheus metrics.
Overview
Features:
- 📊 Per-instance resource tracking
- ⏱️ Configurable collection interval
- 💾 Time series ring buffer with history
- 🔌 REST API for querying historical data
- 📈 Prometheus exposition for dashboards
- 🎯 Zero overhead when disabled
Quick Start
Enable resource monitoring:
global:
# Enable resource tracking
resource_metrics_enabled: true
resource_metrics_interval: 5 # Collection interval in seconds
resource_metrics_max_samples: 720 # Max samples per instance (1 hour at 5s)
# Enable Prometheus export (optional)
metrics_enabled: true
metrics_port: 9090
Query via REST API:
# Get last 20 samples for php-fpm instance 0
curl "http://localhost:9180/api/v1/metrics/history?process=php-fpm&instance=php-fpm-0&limit=20"
Query via Prometheus:
# View all resource metrics
curl http://localhost:9090/metrics | grep cbox_init_process
Configuration
Resource Metrics Settings
global:
# Enable/disable resource tracking
resource_metrics_enabled: true # Default: false
# Collection interval in seconds
resource_metrics_interval: 5 # Default: 5, Range: 1-300
# Maximum samples per instance (ring buffer size)
resource_metrics_max_samples: 720 # Default: 720 (1 hour at 5s interval)
# Enable Prometheus HTTP server
metrics_enabled: true # Default: false
metrics_port: 9090 # Default: 9090
metrics_path: /metrics # Default: /metrics
Memory Usage Calculation
Per-instance ring buffer memory:
Sample size: ~96 bytes
Total memory = sample_size × max_samples × instance_count
Example:
720 samples × 96 bytes × 10 instances = ~691 KB
Recommended limits:
- Development: 360 samples (30 min at 5s interval) = ~346 KB per 10 instances
- Production: 720 samples (1 hour at 5s interval) = ~691 KB per 10 instances
- High-scale: 1440 samples (2 hours at 5s interval) = ~1.38 MB per 10 instances
Collected Metrics
CPU Usage
cpu_percent - CPU usage percentage
- Unit: Percent (0-100 per core)
- Meaning: Total CPU time consumed by process
- Can exceed 100%: Yes (multi-threaded processes on multi-core systems)
- Example: 250% = using 2.5 cores fully
Memory Usage
memory_rss_bytes - Resident Set Size (Physical Memory)
- Unit: Bytes
- Meaning: Physical RAM currently used by process
- Includes: Code, data, shared libraries actively in memory
- Most useful for: Detecting memory leaks
memory_vms_bytes - Virtual Memory Size
- Unit: Bytes
- Meaning: Total virtual memory allocated (may not be resident)
- Includes: All memory mappings, swap, etc.
- Note: Can be much larger than RSS
memory_percent - Memory as % of Total System Memory
- Unit: Percent (0-100)
- Meaning: RSS as percentage of total system RAM
- Useful for: Capacity planning
Threads
threads - Number of Threads
- Unit: Count
- Meaning: Total threads (or processes if not multi-threaded)
- For PHP-FPM: Shows number of child processes
- For multi-threaded apps: Shows all threads
File Descriptors
file_descriptors - Open File Descriptors
- Unit: Count
- Meaning: Number of open files, sockets, pipes
- Platform: Linux only (not available on macOS)
- Useful for: Detecting fd leaks
REST API
Query Historical Metrics
Endpoint: GET /api/v1/metrics/history
Parameters:
| Parameter | Required | Description | Default |
|---|---|---|---|
process |
✅ Yes | Process name | - |
instance |
✅ Yes | Instance ID (e.g., php-fpm-0) |
- |
since |
No | Start time (RFC3339 or Unix timestamp) | 1 hour ago |
limit |
No | Max samples to return (1-10000) | 100 |
Example requests:
# Get last 20 samples
curl "http://localhost:9180/api/v1/metrics/history?process=php-fpm&instance=php-fpm-0&limit=20"
# Get samples since specific time (RFC3339)
curl "http://localhost:9180/api/v1/metrics/history?process=nginx&instance=nginx-0&since=2025-11-23T08:00:00Z&limit=50"
# Get samples since Unix timestamp
curl "http://localhost:9180/api/v1/metrics/history?process=horizon&instance=horizon-0&since=1732348800&limit=100"
# With authentication
curl -H "Authorization: Bearer your-token" \
"http://localhost:9180/api/v1/metrics/history?process=queue-worker&instance=queue-worker-0&limit=10"
Response Format
{
"process": "php-fpm",
"instance": "php-fpm-0",
"since": "2025-11-23T08:00:00Z",
"limit": 20,
"samples": 20,
"data": [
{
"timestamp": "2025-11-23T09:27:02.263Z",
"cpu_percent": 12.5,
"memory_rss_bytes": 134217728,
"memory_vms_bytes": 445747003392,
"memory_percent": 1.95,
"threads": 8,
"file_descriptors": 42
},
{
"timestamp": "2025-11-23T09:27:07.315Z",
"cpu_percent": 15.2,
"memory_rss_bytes": 135266304,
"memory_vms_bytes": 445747003392,
"memory_percent": 1.97,
"threads": 8,
"file_descriptors": 43
}
]
}
Field descriptions:
process- Process name from configurationinstance- Instance ID (format:{process}-{index})since- Start time for query (RFC3339 format)limit- Maximum samples requestedsamples- Actual number of samples returneddata- Array of metric samples (chronological order)
Error Responses
404 Not Found:
{
"error": "Process 'unknown' not found"
}
400 Bad Request:
{
"error": "Invalid 'since' parameter: must be RFC3339 or Unix timestamp"
}
400 Bad Request:
{
"error": "Invalid 'limit': must be between 1 and 10000"
}
Prometheus Metrics
When both resource_metrics_enabled and metrics_enabled are true, resource metrics are exposed on the Prometheus endpoint.
Metric Names
All metrics use the prefix cbox_init_process_ and include labels:
process- Process nameinstance- Instance ID
Available metrics:
# CPU usage (percentage, can exceed 100%)
cbox_init_process_cpu_percent{process="php-fpm", instance="php-fpm-0"}
# Memory RSS in bytes
cbox_init_process_memory_bytes{process="php-fpm", instance="php-fpm-0", type="rss"}
# Memory VMS in bytes
cbox_init_process_memory_bytes{process="php-fpm", instance="php-fpm-0", type="vms"}
# Memory as % of total system RAM
cbox_init_process_memory_percent{process="php-fpm", instance="php-fpm-0"}
# Number of threads/processes
cbox_init_process_threads{process="php-fpm", instance="php-fpm-0"}
# Open file descriptors (Linux only)
cbox_init_process_file_descriptors{process="php-fpm", instance="php-fpm-0"}
Collection Metadata
Error tracking:
# Total collection errors per instance
cbox_init_resource_collection_errors_total{process="...", instance="..."}
Performance:
# Collection duration in seconds
cbox_init_resource_collection_duration_seconds
Query Examples
Average CPU across all PHP-FPM instances:
avg(cbox_init_process_cpu_percent{process="php-fpm"})
Memory usage trend for specific instance:
cbox_init_process_memory_bytes{process="nginx", instance="nginx-0", type="rss"}
Total threads across all processes:
sum(cbox_init_process_threads)
Top 5 processes by memory:
topk(5, cbox_init_process_memory_bytes{type="rss"})
Memory usage rate of change (MB/hour):
rate(cbox_init_process_memory_bytes{type="rss"}[1h]) * 3600 / 1024 / 1024
Grafana Integration
Dashboard Setup
1. Add Prometheus Data Source:
- URL:
http://localhost:9090 - Scrape interval: Match
resource_metrics_interval(e.g., 5s)
2. Create Dashboard Panels:
CPU Usage Panel
{
"title": "CPU Usage by Process",
"targets": [
{
"expr": "cbox_init_process_cpu_percent",
"legendFormat": "{{process}}-{{instance}}"
}
],
"yaxis": {
"label": "CPU %",
"format": "percent"
}
}
Memory Usage Panel
{
"title": "Memory Usage (RSS)",
"targets": [
{
"expr": "cbox_init_process_memory_bytes{type=\"rss\"}",
"legendFormat": "{{process}}-{{instance}}"
}
],
"yaxis": {
"label": "Memory",
"format": "bytes"
}
}
Thread Count Panel
{
"title": "Thread Count",
"targets": [
{
"expr": "cbox_init_process_threads",
"legendFormat": "{{process}}-{{instance}}"
}
]
}
File Descriptors Panel
{
"title": "Open File Descriptors",
"targets": [
{
"expr": "cbox_init_process_file_descriptors",
"legendFormat": "{{process}}-{{instance}}"
}
]
}
Alert Rules
High CPU usage:
- alert: HighCPUUsage
expr: cbox_init_process_cpu_percent > 80
for: 5m
annotations:
summary: "High CPU usage detected"
description: "{{$labels.process}}-{{$labels.instance}} using {{$value}}% CPU"
Memory leak detection:
- alert: MemoryLeak
expr: rate(cbox_init_process_memory_bytes{type="rss"}[1h]) > 10485760 # 10MB/hour
for: 3h
annotations:
summary: "Potential memory leak"
description: "{{$labels.process}}-{{$labels.instance}} growing {{$value | humanize}}B/s"
File descriptor exhaustion:
- alert: FileDescriptorLeak
expr: cbox_init_process_file_descriptors > 1000
for: 5m
annotations:
summary: "High file descriptor count"
description: "{{$labels.process}}-{{$labels.instance}} has {{$value}} open FDs"
Performance Considerations
CPU Overhead
Collection cost: ~1ms per process instance per collection cycle
Example:
- 10 instances × 1ms = 10ms per collection
- At 5s interval = 10ms / 5000ms = 0.2% CPU overhead
- Negligible for most workloads
Memory Usage
Ring buffer memory:
- Each sample: ~96 bytes
- Per instance:
96 bytes × max_samples - Total:
96 bytes × max_samples × instance_count
Examples:
- 720 samples × 10 instances = ~691 KB
- 1440 samples × 20 instances = ~2.7 MB
- 360 samples × 50 instances = ~1.7 MB
Recommended Intervals
| Environment | Interval | Max Samples | History | Overhead |
|---|---|---|---|---|
| Development | 2-5s | 360 | 30 min | Very low |
| Production | 10-30s | 720 | 2-6 hours | Low |
| High-Scale | 60s+ | 1440 | 24 hours | Minimal |
Tuning guidelines:
- Lower interval = more granular data, higher overhead
- Higher interval = less overhead, coarser granularity
- Adjust
max_samplesto control history retention
Troubleshooting
Metrics Not Collected
Issue: Resource metrics not appearing in API or Prometheus
Solutions:
-
Check configuration:
global: resource_metrics_enabled: true # Must be true metrics_enabled: true # For Prometheus -
Verify processes are running:
curl http://localhost:9180/api/v1/processes -
Check for collection errors:
# View Cbox logs journalctl -u cbox-init -f | grep "resource collection"
Missing Prometheus Metrics
Issue: cbox_init_process_* metrics not in Prometheus
Solutions:
-
Ensure both flags enabled:
global: resource_metrics_enabled: true # Collect metrics metrics_enabled: true # Expose via HTTP -
Check Prometheus scrape config:
scrape_configs: - job_name: 'cbox-init' static_configs: - targets: ['localhost:9090'] -
Test endpoint manually:
curl http://localhost:9090/metrics | grep cbox_init_process
File Descriptors Always Zero
Issue: file_descriptors metric is always 0
Reason: File descriptor tracking only available on Linux
macOS behavior:
- CPU, memory, threads: ✅ Available
- File descriptors: ❌ Not available (OS limitation)
High Memory Usage
Issue: Cbox Init using more memory than expected
Solutions:
-
Reduce max samples:
global: resource_metrics_max_samples: 360 # Reduce from 720 -
Increase interval:
global: resource_metrics_interval: 15 # Increase from 5s -
Calculate expected memory:
Memory = 96 bytes × max_samples × instance_count
Collection Errors in Logs
Issue: Failed to collect resource metrics in logs
Causes:
- Process PID no longer exists
- Permission denied (rare)
- System resource temporarily unavailable
Solution:
- Errors are logged but don't crash processes
- Graceful degradation - collection skipped for that cycle
- Check if processes are healthy with health checks
Use Cases
Memory Leak Detection
Monitor memory growth over time:
# Query last hour of data
curl "http://localhost:9180/api/v1/metrics/history?process=horizon&instance=horizon-0&limit=720" \
| jq '.data[] | {timestamp, memory_rss_mb: (.memory_rss_bytes / 1024 / 1024)}'
Grafana alert:
- alert: MemoryGrowth
expr: rate(cbox_init_process_memory_bytes{type="rss"}[6h]) > 1048576 # 1MB/hour
for: 12h
Performance Profiling
Identify CPU-intensive instances:
# Top 3 CPU consumers
topk(3, avg_over_time(cbox_init_process_cpu_percent[5m]))
Capacity Planning
Aggregate resource usage:
# Total memory across all instances
sum(cbox_init_process_memory_bytes{type="rss"})
# Average CPU across all processes
avg(cbox_init_process_cpu_percent)
Thread Monitoring
Detect thread pool exhaustion:
# PHP-FPM child process count
cbox_init_process_threads{process="php-fpm"}
Alert on low spare workers:
- alert: LowSpareWorkers
expr: cbox_init_process_threads{process="php-fpm"} > 50 # Max children threshold
for: 5m
Next Steps
- Prometheus Metrics - Complete metrics documentation
- Management API - REST API reference
- Examples - Real-world monitoring configuration