Distributed Tracing & Observability
In a microservices system a single user click can fan out across a dozen services, queues, and databases. When something is slow or broken, logs from one service rarely tell the whole story. Distributed tracing stitches those hops into a single, end-to-end narrative so you can see exactly where a request spent its time and where it failed. OpenTelemetry (OTel) is the vendor-neutral standard for collecting that data in Node.js, and it has become the default instrumentation layer across the industry.
The three pillars of observability
Observability is the ability to ask arbitrary questions about your system’s behavior from the outside. It rests on three complementary signal types, all of which OpenTelemetry can emit.
| Pillar | Question it answers | Example |
|---|---|---|
| Logs | What happened at this moment? | ”Payment declined for order A-1001” |
| Metrics | How much / how often, over time? | “p99 latency = 420ms, 30 req/s” |
| Traces | Where did this single request go? | “checkout → inventory → payment took 1.2s” |
| metric | one number | counter, histogram, gauge |
Logs are discrete events, metrics are aggregated numbers, and traces follow one request across service boundaries. Together they let you both detect a problem (metrics), drill into the offending request (traces), and read the detail (logs).
Spans and traces
A trace represents the full journey of a request. It is composed of spans, where each span is a single unit of work — an HTTP handler, a database query, an outbound call. Spans nest: a parent span (the incoming request) contains child spans for the work it triggers. Every span carries a name, start and end timestamps, attributes (key/value tags), and a status.
Each span belongs to a trace identified by a 16-byte trace ID, and each span has its own 8-byte span ID. The combination of traceId, spanId, and sampling flags forms the trace context that is propagated between services.
Setting up the SDK
The @opentelemetry/sdk-node package wires everything together, and the auto-instrumentations package patches popular libraries (HTTP, Express, fetch, database drivers) so you get spans without touching business code.
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/resources \
@opentelemetry/semantic-conventions
Create a tracing.js that initializes OTel before any other module loads:
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { resourceFromAttributes } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME } from "@opentelemetry/semantic-conventions";
const sdk = new NodeSDK({
resource: resourceFromAttributes({
[ATTR_SERVICE_NAME]: "checkout-service",
}),
traceExporter: new OTLPTraceExporter({
url: "http://localhost:4318/v1/traces",
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on("SIGTERM", () => {
sdk.shutdown().then(() => process.exit(0));
});
Load it first with Node’s --import flag so instrumentation hooks register before your app imports anything:
node --import ./tracing.js server.js
The order matters: if your HTTP or database libraries are imported before the SDK starts, the auto-instrumentation cannot patch them and you will see empty traces. Always use
--import(or--require ./tracing.cjsfor CommonJS).
Creating manual spans
Auto-instrumentation covers the framework boundaries, but you often want spans around your own business logic. Use the tracer API to wrap a block of work.
import { trace, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("checkout");
export async function reserveInventory(orderId, items) {
return tracer.startActiveSpan("reserveInventory", async (span) => {
span.setAttribute("order.id", orderId);
span.setAttribute("items.count", items.length);
try {
const result = await callInventoryService(items);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
throw err;
} finally {
span.end();
}
});
}
startActiveSpan makes the span the active context for everything inside the callback, so any child spans (including auto-instrumented HTTP calls) automatically attach to it.
Trace context propagation and correlation IDs
For a trace to span services, the trace context must travel with each outbound request. OpenTelemetry uses the W3C Trace Context standard, injecting a traceparent HTTP header automatically when you make instrumented calls.
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
^ ^ ^ ^
version trace-id parent span-id flags
The receiving service reads that header, continues the same trace, and the backend reconstructs the full tree. The trace ID effectively is your correlation ID — the value you log alongside every message to tie logs back to a trace. Attach it to your logs explicitly:
import { trace } from "@opentelemetry/api";
function logWithTrace(message) {
const span = trace.getActiveSpan();
const ctx = span?.spanContext();
console.log(JSON.stringify({
message,
traceId: ctx?.traceId ?? "none",
spanId: ctx?.spanId ?? "none",
}));
}
logWithTrace("inventory reserved");
Output:
{"message":"inventory reserved","traceId":"4bf92f3577b34da6a3ce929d0e0e4736","spanId":"00f067aa0ba902b7"}
Now a developer can copy the traceId from a log line and open the exact trace in their UI.
Exporting to Jaeger or Zipkin
Spans are useless until they reach a backend you can query. The OTLP exporter above sends to any OpenTelemetry-compatible collector. Jaeger and Zipkin are the two most common open-source backends, and both accept OTLP natively.
Run Jaeger locally with the all-in-one image, which exposes the OTLP endpoint on port 4318 and a UI on 16686:
docker run --rm -p 16686:16686 -p 4318:4318 \
jaegertracing/all-in-one:latest
For Zipkin, swap the exporter:
import { ZipkinExporter } from "@opentelemetry/exporter-zipkin";
const traceExporter = new ZipkinExporter({
url: "http://localhost:9411/api/v2/spans",
});
| Backend | OTLP endpoint | UI port | Strength |
|---|---|---|---|
| Jaeger | :4318/v1/traces | 16686 | Rich trace graph, service map |
| Zipkin | :9411/api/v2/spans | 9411 | Lightweight, simple setup |
| OTel Collector | :4318 | n/a | Routes to any backend, sampling |
In production, point your services at an OpenTelemetry Collector rather than a backend directly. The collector batches, samples, and can fan traces out to multiple destinations, decoupling your apps from any single vendor.
Best practices
- Initialize the SDK with
--importbefore any other module so auto-instrumentation can patch your libraries. - Set a clear
service.nameon every service; without it traces are unattributable in the UI. - Use
startActiveSpanso child spans inherit context automatically instead of becoming orphans. - Record exceptions and set
SpanStatusCode.ERRORon failure so errors surface visually in the trace. - Send the trace ID into your structured logs to correlate logs, metrics, and traces from one identifier.
- Apply tail-based sampling at the collector in high-traffic systems to control cost while keeping slow and failed traces.
- Avoid putting secrets or high-cardinality data (raw payloads, tokens) into span attributes.