Backlog Drain
Backlog Drain
SLA-focused algorithm for preventing service level agreement breaches.
Overview
The backlog drain algorithm provides SLA protection by:
- Monitoring oldest job age
- Detecting imminent SLA breaches
- Aggressively scaling to meet deadlines
- Prioritizing SLA compliance over cost
Goal: Ensure no job waits longer than the configured max_pickup_time_seconds.
Mathematical Foundation
Time-to-Breach Calculation
Calculate how much time remains before SLA breach:
Time to Breach = SLA Target - Oldest Job Age
If Time to Breach ≤ 0: Already breaching!
If Time to Breach ≤ Buffer: Imminent breach!
Required Throughput
Calculate throughput needed to clear backlog within time:
Required Throughput = Pending Jobs / Time Remaining
Workers Needed = Required Throughput / Processing Rate
Implementation
Basic Backlog Drain
public function calculateBacklogDrainWorkers(object $metrics, QueueConfiguration $config): int
{
$pendingJobs = $metrics->depth->pending ?? 0;
$oldestJobAge = $metrics->depth->oldestJobAgeSeconds ?? 0;
$slaTarget = $config->maxPickupTimeSeconds;
$processingRate = $metrics->processingRate ?? 0.0;
if ($pendingJobs === 0) {
return $config->minWorkers;
}
// Calculate time remaining before SLA breach
$timeRemaining = $slaTarget - $oldestJobAge;
// If already breaching, scale to maximum immediately
if ($timeRemaining <= 0) {
return $config->maxWorkers;
}
// If processing rate unknown, scale conservatively
if ($processingRate === 0) {
return (int) ceil($config->maxWorkers * 0.8);
}
// Calculate workers needed to clear backlog in remaining time
$requiredThroughput = $pendingJobs / $timeRemaining; // jobs/second
$workers = (int) ceil($requiredThroughput / $processingRate);
// Apply limits
return max($config->minWorkers, min($config->maxWorkers, $workers));
}
With Safety Buffer
Add warning threshold before actual breach:
public function calculateWithBuffer(object $metrics, QueueConfiguration $config): int
{
$oldestJobAge = $metrics->depth->oldestJobAgeSeconds ?? 0;
$slaTarget = $config->maxPickupTimeSeconds;
// Define warning threshold at 80% of SLA
$warningThreshold = $slaTarget * 0.8;
if ($oldestJobAge < $warningThreshold) {
// Normal operation - use Little's Law
return $this->littlesLawCalculation($metrics, $config);
}
// Approaching SLA limit - use backlog drain
$baseWorkers = $this->calculateBacklogDrainWorkers($metrics, $config);
// Add safety margin based on proximity to breach
$slaUsage = $oldestJobAge / $slaTarget;
$safetyMargin = 1.0 + (($slaUsage - 0.8) * 2); // 1.0x at 80%, 1.4x at 90%, 1.8x at 95%
$workers = (int) ceil($baseWorkers * $safetyMargin);
return max($config->minWorkers, min($config->maxWorkers, $workers));
}
Examples
Example 1: Normal Operation
Scenario:
- Pending: 100 jobs
- Oldest age: 10 seconds
- SLA target: 60 seconds
- Processing rate: 5 jobs/sec per worker
Calculation:
Time remaining = 60 - 10 = 50 seconds
SLA usage = 10/60 = 16.7%
Status: Normal (< 80% threshold)
Action: Use Little's Law instead of backlog drain
Example 2: Approaching SLA
Scenario:
- Pending: 200 jobs
- Oldest age: 48 seconds
- SLA target: 60 seconds
- Processing rate: 10 jobs/sec per worker
Calculation:
Time remaining = 60 - 48 = 12 seconds
SLA usage = 48/60 = 80%
Required throughput = 200 / 12 = 16.67 jobs/sec
Workers needed = 16.67 / 10 = 1.67 ≈ 2 workers
Safety margin = 1.0 + ((0.8 - 0.8) × 2) = 1.0x
Final workers = 2 × 1.0 = 2 workers
Example 3: Imminent Breach
Scenario:
- Pending: 500 jobs
- Oldest age: 55 seconds
- SLA target: 60 seconds
- Processing rate: 8 jobs/sec per worker
Calculation:
Time remaining = 60 - 55 = 5 seconds
SLA usage = 55/60 = 91.7%
Required throughput = 500 / 5 = 100 jobs/sec
Workers needed = 100 / 8 = 12.5 ≈ 13 workers
Safety margin = 1.0 + ((0.917 - 0.8) × 2) = 1.23x
Final workers = 13 × 1.23 = 16 workers
Example 4: Active Breach
Scenario:
- Pending: 300 jobs
- Oldest age: 65 seconds
- SLA target: 60 seconds
- Max workers: 20
Calculation:
Time remaining = 60 - 65 = -5 seconds
Status: BREACH!
Action: Scale to maximum immediately
Workers = 20 (max_workers)
SLA Urgency Levels
Classify urgency and response:
public function determineSlaUrgency(float $oldestAge, int $slaTarget): string
{
$usage = $oldestAge / $slaTarget;
if ($usage >= 1.0) {
return 'BREACH'; // Already violating SLA
} elseif ($usage >= 0.9) {
return 'CRITICAL'; // <10% time remaining
} elseif ($usage >= 0.8) {
return 'WARNING'; // <20% time remaining
} elseif ($usage >= 0.6) {
return 'ELEVATED'; // <40% time remaining
} else {
return 'NORMAL'; // >40% time remaining
}
}
public function getScalingResponse(string $urgency, object $metrics, QueueConfiguration $config): int
{
return match($urgency) {
'BREACH' => $config->maxWorkers, // Maximum immediately
'CRITICAL' => $this->aggressiveScale($metrics, $config), // Very aggressive
'WARNING' => $this->backlogDrain($metrics, $config), // Backlog drain
'ELEVATED' => $this->cautious Scale($metrics, $config), // Slightly elevated
'NORMAL' => $this->littlesLaw($metrics, $config), // Standard calculation
};
}
Advanced Features
Drain Rate Tracking
Monitor how fast the backlog is being cleared:
public function calculateDrainRate(array $historicalDepth): float
{
if (count($historicalDepth) < 2) {
return 0.0;
}
$recent = array_slice($historicalDepth, -5); // Last 5 samples
$firstDepth = reset($recent);
$lastDepth = end($recent);
$timeSpan = count($recent) * $this->sampleInterval; // seconds
$drainRate = ($firstDepth - $lastDepth) / $timeSpan; // jobs/second
return max(0, $drainRate); // Can't have negative drain
}
Predicted Breach Time
Forecast when SLA breach will occur:
public function predictBreachTime(object $metrics, QueueConfiguration $config): ?int
{
$pendingJobs = $metrics->depth->pending ?? 0;
$oldestAge = $metrics->depth->oldestJobAgeSeconds ?? 0;
$drainRate = $this->calculateDrainRate($this->historicalDepth);
if ($drainRate <= 0 || $pendingJobs === 0) {
return null; // Can't predict
}
// Time until queue is empty at current drain rate
$timeToEmpty = $pendingJobs / $drainRate;
// Time until SLA breach
$timeToBreach = $config->maxPickupTimeSeconds - $oldestAge;
// If queue will empty before breach, no breach predicted
if ($timeToEmpty < $timeToBreach) {
return null;
}
// Breach predicted in X seconds
return (int) ceil($timeToBreach);
}
Cascading Breach Prevention
Prevent single breach from causing cascade:
public function preventCascade(object $metrics, QueueConfiguration $config): int
{
$breachTime = $this->predictBreachTime($metrics, $config);
if ($breachTime === null || $breachTime > 120) {
// No imminent breach, normal operation
return $this->littlesLaw($metrics, $config);
}
// Breach predicted soon - scale aggressively
$baseWorkers = $this->backlogDrain($metrics, $config);
// Add extra capacity to prevent cascade
$cascadeBuffer = 1.5; // 50% extra capacity
return (int) ceil($baseWorkers * $cascadeBuffer);
}
Integration with Hybrid Strategy
class HybridPredictiveStrategy
{
public function calculateTargetWorkers(object $metrics, QueueConfiguration $config): int
{
$oldestAge = $metrics->depth->oldestJobAgeSeconds ?? 0;
$slaTarget = $config->maxPickupTimeSeconds;
$slaUsage = $oldestAge / $slaTarget;
// Priority 1: SLA breach protection
if ($slaUsage >= 0.8) {
return $this->backlogDrain($metrics, $config);
}
// Priority 2: Trend-based proactive scaling
if (($metrics->trend->direction ?? 'stable') === 'up') {
return $this->trendBased($metrics, $config);
}
// Priority 3: Steady-state Little's Law
return $this->littlesLaw($metrics, $config);
}
private function backlogDrain(object $metrics, QueueConfiguration $config): int
{
$pending = $metrics->depth->pending ?? 0;
$oldestAge = $metrics->depth->oldestJobAgeSeconds ?? 0;
$slaTarget = $config->maxPickupTimeSeconds;
$rate = $metrics->processingRate ?? 1.0;
$timeRemaining = max(1, $slaTarget - $oldestAge); // At least 1 second
$requiredThroughput = $pending / $timeRemaining;
$workers = (int) ceil($requiredThroughput / $rate);
// Safety margin based on proximity to breach
$slaUsage = $oldestAge / $slaTarget;
$safetyMargin = 1.0 + max(0, ($slaUsage - 0.8) * 2);
$workers = (int) ceil($workers * $safetyMargin);
return max($config->minWorkers, min($config->maxWorkers, $workers));
}
}
Performance Characteristics
Time Complexity
- O(1): Constant time calculation
- Very fast, suitable for high-frequency evaluation
Space Complexity
- O(1): No historical data required
- Can enhance with history (O(n) for n samples)
Accuracy
- Very High for SLA protection
- May overprovision slightly for safety
- Excellent at preventing breaches
Best Practices
- Set appropriate thresholds: Typically 80% SLA usage for activation
- Use safety margins: Buffer for uncertainty and delays
- Monitor drain rate: Track how fast backlog is clearing
- Combine with other algorithms: Use as override, not primary
- Alert on activation: Log when backlog drain is triggered
- Track breach rate: Measure actual SLA compliance
Common Patterns
Pattern 1: Tiered Response
Different responses at different urgency levels:
$slaUsage = $oldestAge / $slaTarget;
$workers = match(true) {
$slaUsage >= 1.0 => $config->maxWorkers, // Breach: max immediately
$slaUsage >= 0.95 => (int) ceil($baseWorkers * 1.5), // Critical: 50% buffer
$slaUsage >= 0.9 => (int) ceil($baseWorkers * 1.3), // Warning: 30% buffer
$slaUsage >= 0.8 => (int) ceil($baseWorkers * 1.2), // Elevated: 20% buffer
default => $this->littlesLaw($metrics, $config), // Normal: standard calc
};
Pattern 2: Gradual Escalation
Increase urgency over time:
private int $consecutiveHighUsage = 0;
public function escalate(float $slaUsage): float
{
if ($slaUsage >= 0.8) {
$this->consecutiveHighUsage++;
} else {
$this->consecutiveHighUsage = 0;
}
// Escalate safety margin with consecutive high usage
$escalationFactor = min(2.0, 1.0 + ($this->consecutiveHighUsage * 0.1));
return $escalationFactor;
}
Pattern 3: Post-Breach Recovery
Aggressive scaling after breach to prevent recurrence:
public function postBreachRecovery(object $metrics, QueueConfiguration $config): int
{
if (!$this->wasRecentlyBreached()) {
return $this->normalCalculation($metrics, $config);
}
// Stay at elevated capacity for recovery period
$recoveryWorkers = (int) ceil($config->maxWorkers * 0.8);
$normalWorkers = $this->normalCalculation($metrics, $config);
return max($recoveryWorkers, $normalWorkers);
}
Monitoring and Alerts
Track SLA performance:
// Log SLA usage
logger()->info('SLA usage', [
'queue' => $config->queue,
'oldest_age' => $oldestAge,
'sla_target' => $slaTarget,
'sla_usage_percent' => ($oldestAge / $slaTarget) * 100,
'urgency' => $this->determineSlaUrgency($oldestAge, $slaTarget),
]);
// Alert on high usage
if ($oldestAge / $slaTarget >= 0.9) {
$this->alert->send([
'severity' => 'critical',
'message' => "SLA breach imminent for queue {$config->queue}",
'details' => [
'oldest_job_age' => $oldestAge,
'sla_target' => $slaTarget,
'time_remaining' => $slaTarget - $oldestAge,
],
]);
}
See Also
- Little's Law - Steady-state calculation
- Trend Prediction - Predictive scaling
- Resource Constraints - Resource management