Distributed Tracing & Observability

In a microservices system a single user click can fan out across a dozen services, queues, and databases. When something is slow or broken, logs from one service rarely tell the whole story. Distributed tracing stitches those hops into a single, end-to-end narrative so you can see exactly where a request spent its time and where it failed. OpenTelemetry (OTel) is the vendor-neutral standard for collecting that data in Node.js, and it has become the default instrumentation layer across the industry.

The three pillars of observability

Observability is the ability to ask arbitrary questions about your system’s behavior from the outside. It rests on three complementary signal types, all of which OpenTelemetry can emit.

Pillar	Question it answers	Example
Logs	What happened at this moment?	”Payment declined for order A-1001”
Metrics	How much / how often, over time?	“p99 latency = 420ms, 30 req/s”
Traces	Where did this single request go?	“checkout → inventory → payment took 1.2s”
metric	one number	counter, histogram, gauge

Logs are discrete events, metrics are aggregated numbers, and traces follow one request across service boundaries. Together they let you both detect a problem (metrics), drill into the offending request (traces), and read the detail (logs).

Spans and traces

A trace represents the full journey of a request. It is composed of spans, where each span is a single unit of work — an HTTP handler, a database query, an outbound call. Spans nest: a parent span (the incoming request) contains child spans for the work it triggers. Every span carries a name, start and end timestamps, attributes (key/value tags), and a status.

Each span belongs to a trace identified by a 16-byte trace ID, and each span has its own 8-byte span ID. The combination of traceId, spanId, and sampling flags forms the trace context that is propagated between services.

Setting up the SDK

The @opentelemetry/sdk-node package wires everything together, and the auto-instrumentations package patches popular libraries (HTTP, Express, fetch, database drivers) so you get spans without touching business code.

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/resources \
  @opentelemetry/semantic-conventions

Create a tracing.js that initializes OTel before any other module loads:

import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { resourceFromAttributes } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME } from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: resourceFromAttributes({
    [ATTR_SERVICE_NAME]: "checkout-service",
  }),
  traceExporter: new OTLPTraceExporter({
    url: "http://localhost:4318/v1/traces",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

process.on("SIGTERM", () => {
  sdk.shutdown().then(() => process.exit(0));
});

Load it first with Node’s --import flag so instrumentation hooks register before your app imports anything:

node --import ./tracing.js server.js

The order matters: if your HTTP or database libraries are imported before the SDK starts, the auto-instrumentation cannot patch them and you will see empty traces. Always use --import (or --require ./tracing.cjs for CommonJS).

Creating manual spans

Auto-instrumentation covers the framework boundaries, but you often want spans around your own business logic. Use the tracer API to wrap a block of work.

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("checkout");

export async function reserveInventory(orderId, items) {
  return tracer.startActiveSpan("reserveInventory", async (span) => {
    span.setAttribute("order.id", orderId);
    span.setAttribute("items.count", items.length);
    try {
      const result = await callInventoryService(items);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}

startActiveSpan makes the span the active context for everything inside the callback, so any child spans (including auto-instrumented HTTP calls) automatically attach to it.

Trace context propagation and correlation IDs

For a trace to span services, the trace context must travel with each outbound request. OpenTelemetry uses the W3C Trace Context standard, injecting a traceparent HTTP header automatically when you make instrumented calls.

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ^  ^                                ^                ^
          version  trace-id                  parent span-id    flags

The receiving service reads that header, continues the same trace, and the backend reconstructs the full tree. The trace ID effectively is your correlation ID — the value you log alongside every message to tie logs back to a trace. Attach it to your logs explicitly:

import { trace } from "@opentelemetry/api";

function logWithTrace(message) {
  const span = trace.getActiveSpan();
  const ctx = span?.spanContext();
  console.log(JSON.stringify({
    message,
    traceId: ctx?.traceId ?? "none",
    spanId: ctx?.spanId ?? "none",
  }));
}

logWithTrace("inventory reserved");

Output:

{"message":"inventory reserved","traceId":"4bf92f3577b34da6a3ce929d0e0e4736","spanId":"00f067aa0ba902b7"}

Now a developer can copy the traceId from a log line and open the exact trace in their UI.

Exporting to Jaeger or Zipkin

Spans are useless until they reach a backend you can query. The OTLP exporter above sends to any OpenTelemetry-compatible collector. Jaeger and Zipkin are the two most common open-source backends, and both accept OTLP natively.

Run Jaeger locally with the all-in-one image, which exposes the OTLP endpoint on port 4318 and a UI on 16686:

docker run --rm -p 16686:16686 -p 4318:4318 \
  jaegertracing/all-in-one:latest

For Zipkin, swap the exporter:

import { ZipkinExporter } from "@opentelemetry/exporter-zipkin";

const traceExporter = new ZipkinExporter({
  url: "http://localhost:9411/api/v2/spans",
});

Backend	OTLP endpoint	UI port	Strength
Jaeger	`:4318/v1/traces`	16686	Rich trace graph, service map
Zipkin	`:9411/api/v2/spans`	9411	Lightweight, simple setup
OTel Collector	`:4318`	n/a	Routes to any backend, sampling

In production, point your services at an OpenTelemetry Collector rather than a backend directly. The collector batches, samples, and can fan traces out to multiple destinations, decoupling your apps from any single vendor.

Best practices

Initialize the SDK with --import before any other module so auto-instrumentation can patch your libraries.
Set a clear service.name on every service; without it traces are unattributable in the UI.
Use startActiveSpan so child spans inherit context automatically instead of becoming orphans.
Record exceptions and set SpanStatusCode.ERROR on failure so errors surface visually in the trace.
Send the trace ID into your structured logs to correlate logs, metrics, and traces from one identifier.
Apply tail-based sampling at the collector in high-traffic systems to control cost while keeping slow and failed traces.
Avoid putting secrets or high-cardinality data (raw payloads, tokens) into span attributes.