Distributed Tracing
Distributed Tracing
Cbox Init supports distributed tracing using OpenTelemetry for deep observability into process lifecycle operations. Integrate with Jaeger, Grafana Tempo, Honeycomb, or any OpenTelemetry-compatible backend.
Overview
Features:
- 🔭 OpenTelemetry Protocol (OTLP) - Industry-standard tracing
- 📡 gRPC Export - Efficient binary protocol
- 🔍 Process Lifecycle Spans - Trace start, stop, restart operations
- 🎯 Contextual Attributes - Process names, instance IDs, errors
- 📊 Sampling Control - Configurable sample rates
- 🚀 Low Overhead - <1ms per span with batched export
Quick Start
Enable tracing:
global:
# Enable distributed tracing
tracing_enabled: true
tracing_exporter: otlp-grpc
tracing_endpoint: localhost:4317
tracing_sample_rate: 1.0
tracing_service_name: cbox-init
With Jaeger:
# Start Jaeger all-in-one
docker run -d --name jaeger \
-p 6831:6831/udp \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/all-in-one:latest
# Configure Cbox Init
global:
tracing_enabled: true
tracing_exporter: otlp-grpc
tracing_endpoint: localhost:4317
# View traces: http://localhost:16686
Configuration
Tracing Settings
global:
# Enable/disable distributed tracing
tracing_enabled: true # Default: false
# Exporter type
tracing_exporter: otlp-grpc # Options: otlp-grpc, stdout
# Exporter endpoint
tracing_endpoint: localhost:4317 # Default depends on exporter
# Sampling rate (0.0-1.0)
tracing_sample_rate: 1.0 # Default: 1.0 (100%)
# Service name in traces
tracing_service_name: cbox-init # Default: cbox-init
Exporter Types
1. OTLP gRPC (Production)
Best for: Production deployments with Jaeger, Grafana Tempo, etc.
global:
tracing_enabled: true
tracing_exporter: otlp-grpc
tracing_endpoint: tempo:4317 # Grafana Tempo
# OR
tracing_endpoint: jaeger:4317 # Jaeger
# OR
tracing_endpoint: collector:4317 # OpenTelemetry Collector
Advantages:
- ✅ Efficient binary protocol (Protocol Buffers)
- ✅ Batched export for performance
- ✅ Industry-standard compatibility
- ✅ Low overhead (~0.5ms per span)
2. Stdout (Development/Debugging)
Best for: Local development and debugging
global:
tracing_enabled: true
tracing_exporter: stdout
tracing_sample_rate: 1.0
Output format:
{
"Name": "process_manager.start_process",
"SpanContext": {
"TraceID": "4bf92f3577b34da6a3ce929d0e0e4736",
"SpanID": "00f067aa0ba902b7",
"TraceFlags": "01"
},
"Parent": {
"TraceID": "4bf92f3577b34da6a3ce929d0e0e4736",
"SpanID": "b5b4e0c8e1a2b3d4"
},
"SpanKind": "Internal",
"StartTime": "2025-11-23T10:30:15.123456Z",
"EndTime": "2025-11-23T10:30:15.234567Z",
"Attributes": [
{"Key": "process.name", "Value": {"Type": "STRING", "Value": "php-fpm"}},
{"Key": "process.scale", "Value": {"Type": "INT64", "Value": 2}}
],
"Status": {"Code": "Ok"}
}
Advantages:
- ✅ No external dependencies
- ✅ Immediate visibility
- ✅ Pretty-printed JSON
- ❌ Not suitable for production (high overhead)
Sampling Rates
Control what percentage of operations are traced:
global:
tracing_sample_rate: 1.0 # 100% - trace everything (development)
tracing_sample_rate: 0.1 # 10% - production sampling
tracing_sample_rate: 0.01 # 1% - high-traffic production
Guidelines:
- Development: 1.0 (100%) - trace all operations
- Staging: 0.5 (50%) - balance coverage and overhead
- Production (low traffic): 0.1 (10%) - sufficient sampling
- Production (high traffic): 0.01 (1%) - minimize overhead
Instrumented Operations
Cbox Init automatically creates spans for key process lifecycle operations:
Process Manager Operations
1. process_manager.start (Root Span)
Triggered: Overall Cbox Init startup
Attributes:
process.count- Number of processes being started
Example:
process_manager.start [123.45ms]
├─ process_manager.start_process (php-fpm) [45.23ms]
├─ process_manager.start_process (nginx) [12.34ms]
└─ process_manager.start_process (horizon) [65.88ms]
2. process_manager.start_process (Child Span)
Triggered: Starting individual process
Attributes:
process.name- Process name (e.g., "php-fpm")process.scale- Number of instanceserror- Error message (if failed)
Example span:
{
"name": "process_manager.start_process",
"attributes": {
"process.name": "php-fpm",
"process.scale": 2
},
"duration_ms": 45.23,
"status": "OK"
}
3. process_manager.shutdown (Root Span)
Triggered: Graceful shutdown initiated
Attributes:
process.count- Number of processes being stoppedshutdown.reason- Why shutdown triggered (e.g., "SIGTERM", "user request")
Example:
process_manager.shutdown [3.5s]
├─ stop horizon [1.2s]
├─ stop queue-workers [0.8s]
├─ stop nginx [0.5s]
└─ stop php-fpm [1.0s]
Span Hierarchy
Typical trace structure:
process_manager.start (root)
│
├─ process_manager.start_process (php-fpm)
│ └─ [attributes: name=php-fpm, scale=2]
│
├─ process_manager.start_process (nginx)
│ └─ [attributes: name=nginx, scale=1, depends_on=[php-fpm]]
│
└─ process_manager.start_process (horizon)
└─ [attributes: name=horizon, scale=1]
On error:
process_manager.start (root)
│
└─ process_manager.start_process (queue-worker)
└─ [attributes: name=queue-worker, error="command not found: nonexistent"]
└─ [status: ERROR]
Integration Examples
Jaeger
Deploy Jaeger:
# Docker (all-in-one)
docker run -d \
--name jaeger \
-p 6831:6831/udp \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
Configure Cbox Init:
global:
tracing_enabled: true
tracing_exporter: otlp-grpc
tracing_endpoint: localhost:4317
tracing_sample_rate: 1.0
tracing_service_name: cbox-init-production
View traces:
- Open: http://localhost:16686
- Service:
cbox-init-production - Operation:
process_manager.start,process_manager.start_process
Grafana Tempo
Deploy Tempo:
# docker-compose.yml
services:
tempo:
image: grafana/tempo:latest
ports:
- "4317:4317" # OTLP gRPC
- "3200:3200" # Tempo HTTP
volumes:
- ./tempo-config.yaml:/etc/tempo.yaml
command: ["-config.file=/etc/tempo.yaml"]
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
Configure Cbox Init:
global:
tracing_enabled: true
tracing_exporter: otlp-grpc
tracing_endpoint: tempo:4317
tracing_sample_rate: 0.1
tracing_service_name: cbox-init
Query in Grafana:
- Add Tempo data source
- Explore → Tempo
- Search for service:
cbox-init
Honeycomb
Configure Cbox Init:
global:
tracing_enabled: true
tracing_exporter: otlp-grpc
tracing_endpoint: api.honeycomb.io:443
tracing_sample_rate: 1.0
tracing_service_name: cbox-init
Set API key:
# Via environment variable
export OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=YOUR_API_KEY"
# Run Cbox Init
./cbox-init
OpenTelemetry Collector
Deploy Collector:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
exporters:
jaeger:
endpoint: jaeger:14250
tempo:
endpoint: tempo:4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger, tempo]
Configure Cbox Init:
global:
tracing_enabled: true
tracing_exporter: otlp-grpc
tracing_endpoint: otel-collector:4317
tracing_sample_rate: 1.0
Docker Compose Example
Complete stack with Jaeger:
version: '3.8'
services:
app:
build: .
environment:
- CBOX_INIT_GLOBAL_TRACING_ENABLED=true
- CBOX_INIT_GLOBAL_TRACING_EXPORTER=otlp-grpc
- CBOX_INIT_GLOBAL_TRACING_ENDPOINT=jaeger:4317
- CBOX_INIT_GLOBAL_TRACING_SAMPLE_RATE=1.0
depends_on:
- jaeger
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
environment:
- COLLECTOR_OTLP_ENABLED=true
Access Jaeger UI: http://localhost:16686
Use Cases
1. Startup Performance Analysis
Trace process startup times:
- Enable tracing with 100% sampling
- Start Cbox Init
- Query for
process_manager.startspan - View child spans for each process
- Identify slow-starting processes
Example insights:
php-fpmstarts in 45msnginxstarts in 12mshorizonstarts in 66ms (slow - investigate)
2. Dependency Chain Visualization
Understand startup order:
- View
process_manager.starttrace - Observe span hierarchy matching
depends_on - Verify correct startup sequence
- Identify parallelization opportunities
3. Error Debugging
Trace failed process starts:
- Process fails to start
- Query for spans with
status=ERROR - View error attribute with failure reason
- Correlate with logs using timestamps
Example error span:
{
"name": "process_manager.start_process",
"attributes": {
"process.name": "queue-worker",
"process.scale": 3,
"error": "exec: \"nonexistent\": executable file not found in $PATH"
},
"status": "ERROR"
}
4. Shutdown Analysis
Trace graceful shutdown:
- Trigger shutdown (SIGTERM)
- Query for
process_manager.shutdownspan - View shutdown duration per process
- Identify processes slow to terminate
- Adjust timeouts if needed
Performance Impact
Overhead Measurements
With OTLP gRPC exporter:
- Span creation: ~0.1ms
- Span export (batched): ~0.5ms per batch
- Total overhead: <1ms per operation
- Negligible for most workloads
With stdout exporter:
- Span creation: ~0.1ms
- Pretty-print JSON: ~5-10ms per span
- Total overhead: ~10ms per operation
- Not recommended for production
Sampling Strategy
Adaptive sampling:
# Development: Trace everything
tracing_sample_rate: 1.0
# Production (low traffic): 10% sampling
tracing_sample_rate: 0.1
# Production (high traffic): 1% sampling
tracing_sample_rate: 0.01
Cost-benefit analysis:
- 100% sampling: Full visibility, higher overhead
- 10% sampling: Good balance for most workloads
- 1% sampling: Minimal overhead, sufficient for trends
Troubleshooting
Traces Not Appearing
Issue: No traces in backend (Jaeger, Tempo, etc.)
Solutions:
-
Verify tracing enabled:
global: tracing_enabled: true -
Check endpoint connectivity:
# Test connection to Jaeger telnet jaeger 4317 # Test connection to Tempo telnet tempo 4317 -
Check sampling rate:
global: tracing_sample_rate: 1.0 # Ensure not 0.0 -
Use stdout exporter for debugging:
global: tracing_exporter: stdout # Verify spans are created
Connection Refused
Issue: connection refused error in logs
Solutions:
-
Verify endpoint format:
# Correct tracing_endpoint: localhost:4317 # Wrong tracing_endpoint: http://localhost:4317 # Don't include protocol -
Check backend is running:
docker ps | grep jaeger # OR docker ps | grep tempo -
Verify port mapping:
# docker-compose.yml services: jaeger: ports: - "4317:4317" # Must match tracing_endpoint port
High Overhead
Issue: Performance degradation with tracing enabled
Solutions:
-
Reduce sampling rate:
global: tracing_sample_rate: 0.1 # Reduce from 1.0 -
Switch from stdout to OTLP:
global: tracing_exporter: otlp-grpc # Much more efficient than stdout -
Disable if not needed:
global: tracing_enabled: false
Incomplete Traces
Issue: Missing child spans or broken traces
Cause: Sampling decision made at root span
Solution:
- OpenTelemetry samples at trace level (root span)
- If root is sampled, all children are included
- If root is not sampled, entire trace is dropped
- This is expected behavior for sampling <100%
Best Practices
1. Use OTLP gRPC in Production
# ✅ Production
global:
tracing_exporter: otlp-grpc
tracing_endpoint: tempo:4317
# ❌ Not for production
global:
tracing_exporter: stdout # High overhead
2. Adjust Sampling by Environment
# Development
tracing_sample_rate: 1.0
# Staging
tracing_sample_rate: 0.5
# Production (low traffic)
tracing_sample_rate: 0.1
# Production (high traffic)
tracing_sample_rate: 0.01
3. Use Meaningful Service Names
# ✅ Good (includes environment)
global:
tracing_service_name: cbox-init-production
# ❌ Generic
global:
tracing_service_name: cbox-init
4. Combine with Metrics
Correlate traces with metrics:
- Use traces for deep debugging
- Use metrics for trends and alerts
- Both provide complementary insights
5. Set Up Alerts
Alert on high error rates:
# Prometheus alert
- alert: HighTraceErrorRate
expr: rate(trace_errors_total[5m]) > 0.1
Future Enhancements
Planned features:
- TLS support for OTLP gRPC
- Additional exporters (Jaeger native, Zipkin)
- HTTP health check span instrumentation
- Custom span events for process state changes
- Trace context propagation to child processes
Next Steps
- Prometheus Metrics - Complementary observability
- Resource Monitoring - Resource usage tracking
- Management API - Runtime process control