Production Readiness Checklist

Shipping a NestJS service to production is less about new features and more about closing the gaps that only surface under real traffic, real crashes, and real attackers. The list below walks through the concerns every service should address before go-live: structured logging, health checks, graceful shutdown, security headers, rate limiting, observability, and a CI/CD pipeline that can roll back. Treat it as a gate — each item is cheap to add now and expensive to retrofit after the first 3 a.m. incident.

Structured logging

Default console.log output is unparseable at scale. Emit JSON so your log aggregator (Loki, Datadog, CloudWatch) can index fields like traceId, level, and context. A widely used approach is nestjs-pino, which wires Pino into the framework and attaches a request-scoped logger automatically.

npm install nestjs-pino pino-http pino-pretty

// src/app.module.ts
import { Module } from '@nestjs/common';
import { LoggerModule } from 'nestjs-pino';

@Module({
  imports: [
    LoggerModule.forRoot({
      pinoHttp: {
        level: process.env.LOG_LEVEL ?? 'info',
        autoLogging: true,
        redact: ['req.headers.authorization', 'req.headers.cookie'],
        transport:
          process.env.NODE_ENV !== 'production'
            ? { target: 'pino-pretty' }
            : undefined,
      },
    }),
  ],
})
export class AppModule {}

In main.ts, replace the default logger so framework messages also flow through Pino:

// src/main.ts
import { NestFactory } from '@nestjs/core';
import { Logger } from 'nestjs-pino';
import { AppModule } from './app.module';

async function bootstrap() {
  const app = await NestFactory.create(AppModule, { bufferLogs: true });
  app.useLogger(app.get(Logger));
  await app.listen(3000);
}
bootstrap();

Output:

{"level":30,"time":1718323200000,"req":{"id":"f3a1","method":"GET","url":"/orders"},"msg":"request completed","responseTime":12}

Redact secrets at the logger, not in application code. A single un-redacted Authorization header in your logs is a credential leak that survives in cold storage for months.

Health checks

Orchestrators (Kubernetes, ECS, load balancers) decide whether to route traffic based on health endpoints. Use @nestjs/terminus to expose liveness and readiness probes that actually verify downstream dependencies instead of returning a hardcoded 200.

npm install @nestjs/terminus

// src/health/health.controller.ts
import { Controller, Get } from '@nestjs/common';
import {
  HealthCheck,
  HealthCheckService,
  TypeOrmHealthIndicator,
  MemoryHealthIndicator,
} from '@nestjs/terminus';

@Controller('health')
export class HealthController {
  constructor(
    private readonly health: HealthCheckService,
    private readonly db: TypeOrmHealthIndicator,
    private readonly memory: MemoryHealthIndicator,
  ) {}

  @Get('ready')
  @HealthCheck()
  readiness() {
    return this.health.check([
      () => this.db.pingCheck('database', { timeout: 1500 }),
      () => this.memory.checkHeap('memory_heap', 256 * 1024 * 1024),
    ]);
  }

  @Get('live')
  @HealthCheck()
  liveness() {
    return this.health.check([]);
  }
}

Keep live cheap (process is up) and ready thorough (dependencies reachable). A failing readiness probe should pull the pod out of rotation without killing it.

Graceful shutdown

When a pod is terminated, in-flight requests must finish and connections must close cleanly. Enable Nest’s shutdown hooks and react to lifecycle events so you drain work instead of dropping it.

// src/main.ts (excerpt)
const app = await NestFactory.create(AppModule);
app.enableShutdownHooks();

// src/queue/queue.service.ts
import { Injectable, OnApplicationShutdown } from '@nestjs/common';

@Injectable()
export class QueueService implements OnApplicationShutdown {
  async onApplicationShutdown(signal?: string) {
    console.log(`Draining queue, received ${signal}`);
    await this.closeConnections();
  }

  private async closeConnections(): Promise<void> {
    // close DB pools, flush buffers, ack pending messages
  }
}

Set the container’s terminationGracePeriodSeconds longer than your slowest request. If the platform sends SIGKILL before draining finishes, graceful shutdown is meaningless.

Security headers and rate limiting

Add helmet for sensible default headers (HSTS, X-Content-Type-Options, CSP) and @nestjs/throttler to blunt brute-force and scraping. Enable CORS explicitly rather than relying on permissive defaults.

npm install helmet @nestjs/throttler

// src/main.ts (excerpt)
import helmet from 'helmet';

app.use(helmet());
app.enableCors({ origin: ['https://app.example.com'], credentials: true });

// src/app.module.ts (excerpt)
import { ThrottlerModule, ThrottlerGuard } from '@nestjs/throttler';
import { APP_GUARD } from '@nestjs/core';

@Module({
  imports: [
    ThrottlerModule.forRoot([{ ttl: 60_000, limit: 100 }]),
  ],
  providers: [{ provide: APP_GUARD, useClass: ThrottlerGuard }],
})
export class AppModule {}

Observability

Logs answer “what happened”; metrics and traces answer “where and why”. Export OpenTelemetry traces and a Prometheus metrics endpoint so you can correlate a slow request across services.

Signal	Tool	Endpoint / sink
Logs	nestjs-pino	stdout to Loki / Datadog
Metrics	prom-client + `/metrics`	Prometheus scrape
Traces	@opentelemetry/sdk-node	OTLP collector
Errors	Sentry SDK	Sentry project

// src/tracing.ts — imported first in main.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

CI/CD with rollbacks

A deploy you cannot reverse is a liability. Build an immutable image, run the full test suite, deploy with a strategy that keeps the previous version available, and verify health before shifting traffic.

# .github/workflows/deploy.yml (excerpt)
- run: npm ci
- run: npm run test
- run: npm run build
- run: docker build -t app:${{ github.sha }} .
- run: kubectl set image deployment/app app=app:${{ github.sha }}
- run: kubectl rollout status deployment/app --timeout=120s
# on failure: kubectl rollout undo deployment/app

Tag images by commit SHA (never latest) so a rollback is a one-command pointer change to a known-good build.

Best Practices

Emit structured JSON logs with secrets redacted at the logger, and route framework logs through the same pipeline.
Expose separate liveness and readiness probes via Terminus; readiness must verify real dependencies.
Enable enableShutdownHooks() and drain in-flight work in OnApplicationShutdown with an adequate grace period.
Apply helmet, explicit CORS, and a ThrottlerGuard as global defaults before exposing any public route.
Instrument metrics and distributed traces, not just logs, so incidents are diagnosable across services.
Deploy immutable SHA-tagged images through CI that runs tests, gates on rollout status, and can rollout undo.
Treat this checklist as a release gate and re-run it whenever you add a new external dependency.