When your Node.js app crashes at 3am, there are two scenarios: you find out when a user complains, or your monitoring finds out first and restarts the process before anyone notices.

This guide covers the monitoring stack that matters — from basic process-level metrics to application-level observability.

What to Monitor

Not all monitoring is equal. Start with what actually causes incidents:

Process-level:

Is the process alive?
CPU usage (sustained high = bug, spike = expected load)
Memory usage (gradual growth = memory leak)
Restart count (frequent restarts = crash loop)
Uptime (low uptime = instability)

Application-level:

HTTP response time (p50, p95, p99)
Error rate (5xx responses)
Request throughput
Database query latency
Queue depth (if applicable)

Infrastructure-level:

Disk space (logs can fill it)
Network I/O
File descriptor count (open connections)

Start with process-level. It’s free and catches most incidents.

Process-Level Monitoring with Oxmgr

Oxmgr exposes live process metrics via the CLI:

oxmgr status

NAME      PID    STATUS    CPU    MEM      RESTARTS  UPTIME
api       14892  running   2.1%   128 MB   0         3d 14h
worker    14901  running   0.3%   64 MB    2         2d 8h
scheduler 14910  running   0.0%   48 MB    0         3d 14h

For continuous monitoring, pipe to watch:

watch -n 2 oxmgr status

For scripted alerting, use oxmgr status --json:

oxmgr status --json | jq '.[] | select(.restarts > 5)'

Configure automatic restarts and restart limits in oxfile.toml:

[processes.api]
command = "node dist/server.js"
restart_on_exit = true
restart_delay_ms = 1000   # wait 1s before restarting
max_restarts = 10          # stop after 10 restarts (crash loop protection)

[processes.api.health_check]
endpoint = "http://localhost:3000/health"
interval_secs = 30
timeout_secs = 5
unhealthy_threshold = 3   # restart after 3 consecutive failures

When max_restarts is hit, Oxmgr stops restarting and alerts you via the status command rather than thrashing the system.

Exposing Metrics from Your App

Your application should expose its own metrics. The most portable format is Prometheus — a pull-based metrics system that almost every monitoring tool understands.

Install prom-client:

npm install prom-client

Add a /metrics endpoint:

import express from 'express';
import { register, collectDefaultMetrics, Counter, Histogram, Gauge } from 'prom-client';

const app = express();

// Collect default Node.js metrics (event loop lag, GC, heap, etc.)
collectDefaultMetrics();

// Custom metrics
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
});

const httpRequestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const activeConnections = new Gauge({
  name: 'http_active_connections',
  help: 'Number of active HTTP connections'
});

// Middleware to record metrics
app.use((req, res, next) => {
  const start = Date.now();
  activeConnections.inc();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route?.path || req.path;

    httpRequestDuration.observe(
      { method: req.method, route, status_code: res.statusCode },
      duration
    );
    httpRequestTotal.inc(
      { method: req.method, route, status_code: res.statusCode }
    );
    activeConnections.dec();
  });

  next();
});

// Metrics endpoint (restrict access in production)
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000);

The default metrics alone give you: heap size, GC pause times, event loop lag, open file descriptors, and active handles — everything you need to spot memory leaks and event loop blocking.

Health Check Endpoint

Every monitored app needs a /health endpoint. This is different from /metrics — health is a binary ready/not-ready signal; metrics are time-series data.

app.get('/health', async (req, res) => {
  const health = {
    status: 'ok',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    version: process.env.npm_package_version,
    checks: {}
  };

  // Check database connection
  try {
    await db.raw('SELECT 1');
    health.checks.database = { status: 'ok' };
  } catch (err) {
    health.checks.database = { status: 'error', message: err.message };
    health.status = 'degraded';
  }

  // Check Redis
  try {
    await redis.ping();
    health.checks.redis = { status: 'ok' };
  } catch (err) {
    health.checks.redis = { status: 'error', message: err.message };
    // Decide if Redis failure makes the whole app unhealthy
  }

  // Check memory (warn if >80% of limit)
  const used = process.memoryUsage().heapUsed;
  const total = process.memoryUsage().heapTotal;
  const ratio = used / total;
  health.checks.memory = {
    status: ratio > 0.9 ? 'warning' : 'ok',
    usedMB: Math.round(used / 1024 / 1024),
    totalMB: Math.round(total / 1024 / 1024),
    ratio: ratio.toFixed(2)
  };

  const statusCode = health.status === 'ok' ? 200 : 503;
  res.status(statusCode).json(health);
});

Alerting Without a Full Stack

If you don’t have Prometheus + Grafana + Alertmanager yet, you can get process-level alerts with a simple bash script and a cron job.

Restart count alert:

#!/bin/bash
# /usr/local/bin/check-processes.sh

WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
RESTART_THRESHOLD=5

oxmgr status --json | jq -c '.[]' | while read -r process; do
  name=$(echo "$process" | jq -r '.name')
  restarts=$(echo "$process" | jq -r '.restarts')
  status=$(echo "$process" | jq -r '.status')

  if [ "$status" != "running" ]; then
    curl -s -X POST "$WEBHOOK_URL" 
      -H 'Content-type: application/json' 
      -d "{"text": "🔴 Process *$name* is *$status* — check immediately"}"
  elif [ "$restarts" -gt "$RESTART_THRESHOLD" ]; then
    curl -s -X POST "$WEBHOOK_URL" 
      -H 'Content-type: application/json' 
      -d "{"text": "⚠️ Process *$name* has restarted *$restarts* times"}"
  fi
done

Add to cron to run every minute:

* * * * * /usr/local/bin/check-processes.sh

Monitoring Memory Leaks

Memory leaks in Node.js are subtle — the heap grows slowly, performance degrades, eventually the process crashes with FATAL ERROR: Reached heap limit.

Watch the trend, not the absolute number:

# Log memory every minute
while true; do
  echo "$(date) $(oxmgr status --json | jq '.[] | select(.name=="api") | .memory')"
  sleep 60
done >> /var/log/api-memory.log

If memory grows by more than 10-20 MB/hour without traffic growth, you have a leak.

Node.js heap snapshot for investigation:

import { writeHeapSnapshot } from 'node:v8';
import { createServer } from 'node:http';

// Add a diagnostic endpoint (protect with auth in production)
app.get('/debug/heap', (req, res) => {
  const filename = writeHeapSnapshot();
  res.json({ filename, size: fs.statSync(filename).size });
});

Load the snapshot in Chrome DevTools → Memory → Load to find what’s holding references.

Event Loop Lag Monitoring

A blocked event loop means Node.js can’t process new requests. Requests pile up, latency spikes, and the app appears frozen while technically running.

import { monitorEventLoopDelay } from 'node:perf_hooks';

const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();

// Log every 30 seconds
setInterval(() => {
  const lag = histogram.mean / 1e6; // convert nanoseconds to milliseconds

  if (lag > 100) {
    console.error(`Event loop lag: ${lag.toFixed(1)}ms — INVESTIGATE`);
  } else if (lag > 10) {
    console.warn(`Event loop lag: ${lag.toFixed(1)}ms`);
  }

  histogram.reset();
}, 30_000);

Event loop lag above 100ms usually means:

Synchronous work in request handlers (JSON.parse on huge payloads, regex on long strings)
Blocking I/O on the main thread
CPU-intensive computation that should be in a worker thread

Log-Based Monitoring

Structured logs are cheap, searchable, and work everywhere. Use JSON:

const log = {
  info: (msg, data = {}) => console.log(JSON.stringify({ level: 'info', msg, ...data, ts: Date.now() })),
  error: (msg, data = {}) => console.error(JSON.stringify({ level: 'error', msg, ...data, ts: Date.now() })),
  warn: (msg, data = {}) => console.warn(JSON.stringify({ level: 'warn', msg, ...data, ts: Date.now() }))
};

// In request handler
app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    log.info('request', {
      method: req.method,
      path: req.path,
      status: res.statusCode,
      durationMs: Date.now() - start,
      ip: req.ip
    });
  });
  next();
});

Pipe logs to journald via Oxmgr:

[processes.api]
command = "node dist/server.js"
log_file = "/var/log/api/app.log"
error_log_file = "/var/log/api/error.log"

Then query with:

# Last 100 errors
grep '"level":"error"' /var/log/api/app.log | tail -100 | jq .

# Requests slower than 500ms
cat /var/log/api/app.log | jq 'select(.durationMs > 500)'

Summary

A minimum viable monitoring setup for a production Node.js app:

Process health — Oxmgr status, restart limits, max_restarts
Health endpoint — /health that checks real dependencies
Prometheus metrics — prom-client with default metrics + custom request metrics
Structured JSON logs — searchable, parseable, cheap
Restart alerts — cron script or process manager webhook

This baseline catches 90% of production incidents before users do. Add Grafana dashboards and proper alerting when traffic justifies the operational overhead.

See the Oxmgr health check docs for the full health check configuration reference.