Why I Built a Collector-Free Telemetry Package for Laravel
Sentry and Nightwatch showed green while customers saw errors. Nothing ever threw, and sampling dropped exactly the requests that mattered. That incident, collector fatigue on Kubernetes, and a family of packages that all needed metrics became "Telemetry for Laravel".
The worst incidents are the quiet ones. A while back we had errors in production for the better part of a day. Customers wrote in to tell us. The load balancer graphs showed a steady band of 5xx responses. The PHP logs on the nodes had the stack traces. And both Sentry and Nightwatch, two tools we paid to catch exactly this, showed green.
The uncomfortable part is that nothing was broken in the way error trackers define broken. Our code was solid. Nothing threw. The failures were internal error states: results we computed, decided were wrong, and rendered to customers anyway. Error tracking keys on exceptions, and there were none to report. Tracing could have shown it, but sampling meant exactly the affected requests were the ones missing from the data. We were blind precisely where it mattered, and I spent the day doing archaeology instead: grepping logs node by node, squinting at load balancer dashboards, and trying to match timestamps across systems that had never heard of each other.
You can wire around this by hand, of course. Every tool has hooks for reporting custom errors. But an application has hundreds of places where something can go wrong without throwing, and instrumenting them one by one is a cat-and-mouse game you lose the day someone adds a new code path. Unless a real exception is thrown, you only ever catch the failures you already predicted.
What I wanted was embarrassingly simple. One id that follows a request from the edge, through the app, into the queue job it dispatched, down to the query that failed. Signals that cover every request, not a sample of them. One place where the metrics from every node land as a single series instead of a per-server guessing game. I did not have it, and the tooling I was paying for did not give it to me.
Three itches
That incident was the first itch. The second was operational. The standard answer to PHP observability is OpenTelemetry, and the standard answer to OpenTelemetry in PHP is "run a collector". On Kubernetes that means a sidecar or a daemonset, config for it, memory limits for it, upgrades for it, and alerts for when it falls over. The thing that watches your app becomes another thing you watch. For a Go service with a persistent runtime that trade can be worth it. For a fleet of PHP apps it always felt like buying a truck to carry a letter.
The third itch came from my own packages. Queue Autoscale makes scaling decisions every few seconds. Queue Monitor tracks every job. Both of them produce data that deserves to end up in a dashboard, and I kept almost writing a metrics layer inside each of them. What I actually wanted was one shared telemetry layer that packages can publish into without depending on each other.
The problem with PHP metrics
PHP is shared-nothing. Every request gets a fresh process state, and whatever you accumulated in memory dies when the response is sent. A counter that lives in the process is worthless the moment the request ends. This is the exact assumption the official OpenTelemetry SDK makes in the other direction: it aggregates metrics in-process, which works beautifully in a long-running runtime and falls apart under FPM. Each worker exports its own private stream of numbers, and the official fix is to ship them to a collector that re-aggregates them.
The collector is not an optimization. It is load-bearing.
But Laravel apps already have a piece of shared infrastructure that is fast, atomic, and sitting right there: Redis. If every worker on every node increments the same Redis hash field, there is nothing to re-aggregate. The scrape endpoint renders the store, Prometheus reads the merged truth, and OTLP metrics can be exported as cumulative values, which sidesteps the per-worker delta problem entirely. That one decision, aggregate in shared storage instead of in the process, is the whole package. Everything else is careful plumbing.
The official SDK under FPM This package
fpm-1 ─┐ fpm-1 ─┐
fpm-2 ─┼─ per-worker streams fpm-2 ─┼─► Redis, one series
fpm-N ─┘ │ fpm-N ─┘ │
▼ ┌───────┴───────┐
collector sidecar ▼ ▼
(re-aggregates) Prometheus OTLP backend
│ (scrape) (telemetry:flush)
▼
backend
Learnings from the graveyard
Before writing any code I spent time on prior art, and the PHP telemetry ecosystem turned out to be a graveyard with excellent documentation. The learnings shaped the design more than any feature idea did.
First: every abandoned Laravel Prometheus exporter died the same death. Not from bugs in its own code, but from churn in the client library underneath it, which was forked from maintainer to maintainer while the wrappers rotted. The lesson was uncomfortable but clear: own the core. It costs more upfront and it is the only position that survives a decade.
Second: the most popular Redis-backed Prometheus library for PHP has a cautionary tale baked into its storage layer. Summaries forced it into one Redis key per observation, collected with a wildcard KEYS call at scrape time. That is a production outage waiting for traffic. So laravel-telemetry has no summary instrument at all, histograms cover the need, every write is a single atomic Lua call, and the scrape path never scans the keyspace. Some of the best design decisions are the features you refuse to build.
# the cautionary tale # laravel-telemetry
KEYS prometheus:* SMEMBERS index
(scans the whole keyspace HGETALL per metric family
on every scrape, blocks (bounded, index-driven,
Redis while it runs) one atomic Lua call per write)
Third: names lie. There is a well-known Laravel package with OpenTelemetry in its name that emits Zipkin JSON on the wire. Meanwhile a tiny proof-of-concept library quietly demonstrated that you can speak spec-stable OTLP over plain HTTP JSON from PHP with no SDK, no protobuf, and no gRPC extension. The wire format is the contract. Everything above it is replaceable.
And fourth: Laravel Pulse gets the operational details right, and I copied what worked without shame. Capped in-memory buffers so a long-running worker cannot grow unbounded. A Redis stream recipe for high-traffic ingest. And a rule I promoted to a design principle: telemetry must never throw into your app. A broken exporter degrades to silence, not to a 500.
The shape it took
Telemetry::counter('orders.created')->inc();
Telemetry::gauge('queue.depth', fn () => Queue::size());
Telemetry::histogram('checkout.duration', unit: 'ms')->record($ms);
Telemetry::span('import.customers', function () {
// traced work: exceptions recorded, duration measured
});
Telemetry::event('autoscale.decision', ['workers' => 7]);
That is the whole public API surface for day-to-day use. Underneath it, requests, queue jobs, database queries, scheduled tasks, mail, and outgoing HTTP calls are instrumented automatically. The part I care most about is trace propagation: dispatched jobs carry the full W3C traceparent, trace id and parent span id, so a job span shows up as a child of the request that dispatched it instead of a detached root. The response carries an X-Trace-Id header, which means a support ticket can contain the exact key that opens the right trace waterfall.
And sampling is designed around the failure mode from that incident. Metrics are never sampled: every request on every node lands in the shared store, whatever the trace sampler decides, so a band of errors is visible in the aggregates even when no trace captured it. Error spans are exported even from traces the sampler skipped, so a 10 percent sampled app still surfaces every failing span. The one class of problem this cannot fully solve is the silent kind, where nothing throws and the response looks fine on paper. But because every request is measured and every response carries a trace id, the moment a customer says "this looked wrong", you hold the key to the exact request instead of a day of archaeology.
Two details I want to call out because they rarely get prioritized. Telemetry::fake() ships with assertions for counters, gauges, histograms, spans, and events, so instrumentation is testable the same way mail is. And when telemetry is disabled, every instrument resolves to a no-op and no event listeners are registered at all. Observability you cannot afford to leave on in every environment is observability you will not have during the incident.
What's next
The provider contract is the foundation for the rest of the family: Queue Autoscale and Queue Monitor will publish their own telemetry through it, guarded by a class_exists check, so neither package ever depends on the other. The Statamic addon, statamic-telemetry, tagged its first alpha alongside this post: content-aware trace names, Stache and static cache instrumentation, and site context on every span. Statamic routes everything through one catch-all route, so without the addon every page on your site collapses into one span name and one latency series. With it:
# without the addon # with statamic-telemetry
GET /{segments?} GET entry:blog.article
GET /{segments?} GET entry:docs.page
GET /{segments?} GET term:topics
And I am working on a UI package for the cases where Grafana is too generic: visualizations rendered inside your own app, tailored to what the data actually is, queue timelines, trace waterfalls, scaling decisions, instead of another wall of general-purpose panels. The core package ships four Grafana dashboards today, and Grafana remains the right home for fleet-level monitoring, but some views deserve to live closer to the app.
Try it
composer require cboxdk/laravel-telemetry:^0.1.0-alpha.1
The first alpha is tagged. It is an alpha in the honest sense: the core works, the docs are real, and the API can still move where feedback pushes it. If you run Laravel under FPM and you have ever wanted Prometheus metrics and OTLP traces without operating a collector, this was built for you. Try it, break it, and tell me what you find in the GitHub issues. The documentation covers everything from the five-minute quickstart to the architecture decisions, including the graveyard research this post is based on.