Auth observability: the metrics and alerts that catch incidents before users do

An authentication system that has no observability is a black box. You learn about attacks and failures when users report problems or when you review logs after the fact. Proper instrumentation — the right metrics, dashboards, and alerts — means your team learns about anomalies in minutes instead of hours, and you have the context to understand what happened. The specific metrics that matter for auth are different from those for a typical application API.

Core rate metrics

Track these counters per endpoint, broken down by outcome:

Login attempts / login successes / login failures: the login failure rate is your most important security signal. A sudden spike in failures (from baseline 5% to 40%) indicates credential stuffing. Track it as both an absolute rate and a ratio.
MFA challenge rate: what percentage of logins trigger an MFA challenge. Unexpected drops (MFA being bypassed?) or spikes (something triggering elevated risk scores?) both warrant investigation.
Token issuance rate: access tokens and refresh tokens issued per minute. A sudden spike can indicate a compromised client issuing tokens rapidly.
Token validation failures: invalid signature, expired, revoked. A spike in expired token validations might indicate a client-side bug; a spike in revoked token validations might indicate an incident response in progress.
Password reset requests: a sudden spike can indicate an account takeover campaign or a credential stuffing attack preparing for the next phase.

// Prometheus metrics for auth system (Node.js)
import { Counter, Histogram, Gauge, register } from 'prom-client';

const loginAttempts = new Counter({
  name: 'auth_login_attempts_total',
  help: 'Total login attempts',
  labelNames: ['outcome', 'method']  // outcome: success|failure|locked|challenge
});

const tokenIssuances = new Counter({
  name: 'auth_token_issuances_total',
  help: 'Total tokens issued',
  labelNames: ['grant_type', 'client_type']
});

const tokenValidations = new Counter({
  name: 'auth_token_validations_total',
  help: 'Total token validation attempts',
  labelNames: ['outcome']  // valid|expired|revoked|invalid_signature
});

const loginLatency = new Histogram({
  name: 'auth_login_duration_seconds',
  help: 'Login request duration',
  labelNames: ['outcome'],
  buckets: [0.05, 0.1, 0.2, 0.5, 1, 2, 5]
});

const activeSessions = new Gauge({
  name: 'auth_active_sessions_total',
  help: 'Currently active user sessions',
  labelNames: ['session_type']
});

// Instrument your login handler
app.post('/auth/login', async (req, res) => {
  const end = loginLatency.startTimer();
  try {
    const result = await handleLogin(req.body);
    loginAttempts.inc({ outcome: result.outcome, method: result.method });
    end({ outcome: result.outcome });
    res.json(result.response);
  } catch (err) {
    loginAttempts.inc({ outcome: 'error', method: 'unknown' });
    end({ outcome: 'error' });
    throw err;
  }
});

Alert thresholds

Configure alerts that trigger when metrics deviate significantly from baseline. The exact thresholds depend on your traffic volume, but here are the signals worth alerting on:

# Prometheus alerting rules (alertmanager)
groups:
- name: auth_security
  rules:
  - alert: LoginFailureRateHigh
    expr: |
      rate(auth_login_attempts_total{outcome="failure"}[5m])
      /
      rate(auth_login_attempts_total[5m]) > 0.20
    for: 2m
    annotations:
      summary: "Login failure rate above 20% for 2 minutes"
      description: "Possible credential stuffing attack. Current rate: {{ $value | humanizePercentage }}"

  - alert: TokenValidationErrorSpike
    expr: |
      rate(auth_token_validations_total{outcome="invalid_signature"}[5m]) > 10
    for: 1m
    annotations:
      summary: "Token signature validation failures spiking"
      description: "May indicate key rotation issue or token forgery attempt"

  - alert: PasswordResetSpike
    expr: |
      rate(auth_password_resets_total[15m]) > 3 * avg_over_time(rate(auth_password_resets_total[15m])[24h:15m])
    for: 5m
    annotations:
      summary: "Password reset rate 3x above 24-hour average"

  - alert: MfaBypassDetected
    expr: |
      rate(auth_mfa_challenges_total[10m])
      /
      rate(auth_login_attempts_total{outcome="success"}[10m]) < 0.5
    for: 10m
    annotations:
      summary: "MFA challenge rate unexpectedly low — possible bypass"

Geographic and behavioral anomalies

Aggregate metrics tell you about volume anomalies. Per-user behavioral signals tell you about account-level anomalies. The most valuable signals:

Login from new country: first login from a country this account has never used. Emit an event; send a verification email for high-risk country pairs.
Impossible travel: login from City A followed by login from City B less than the travel time between them. A clear signal of either session sharing or account compromise.
Credential reuse after breach: check newly-used passwords against HIBP at login time and flag for forced reset.
Unusual access time: login at 3am for a user whose historical pattern is business hours. Lower severity, but worth logging.

// Impossible travel detection
async function checkImpossibleTravel(
  userId: string,
  currentIp: string,
  currentTime: Date
): Promise<TravelRisk> {
  const lastLogin = await db.loginEvents.findLast(userId, { success: true });
  if (!lastLogin) return { risk: 'low' };

  const timeDeltaHours = (currentTime.getTime() - lastLogin.occurred_at.getTime()) / 3600000;
  if (timeDeltaHours > 24) return { risk: 'low' };  // more than a day ago

  const lastGeo = await geoip.lookup(lastLogin.ip_address);
  const currentGeo = await geoip.lookup(currentIp);

  if (!lastGeo || !currentGeo) return { risk: 'unknown' };

  const distanceKm = haversineDistance(
    lastGeo.latitude, lastGeo.longitude,
    currentGeo.latitude, currentGeo.longitude
  );

  const maxPossibleSpeed = 900;  // km/h — commercial aviation
  const maxPossibleDistance = maxPossibleSpeed * timeDeltaHours;

  if (distanceKm > maxPossibleDistance) {
    return {
      risk: 'high',
      reason: 'impossible_travel',
      details: {
        from: lastGeo.city,
        to: currentGeo.city,
        distanceKm: Math.round(distanceKm),
        timeDeltaHours: Math.round(timeDeltaHours * 10) / 10
      }
    };
  }

  return { risk: 'low' };
}

SIEM integration

Enterprise customers require auth events to flow into their SIEM (Splunk, Azure Sentinel, Elastic, etc.) for correlation with other security events. The standard integration patterns are webhook delivery, syslog forwarding, and direct API export.

Normalize your events to a common schema before export. CEF (Common Event Format) and OCSF (Open Cybersecurity Schema Framework) are the two most widely supported schemas. Key fields for auth events: timestamp, event type, actor, target, outcome, source IP, user agent, and a correlation ID that ties the auth event to downstream application events.

// OCSF-formatted auth event for SIEM export
{
  "class_uid": 3002,          // Authentication Activity
  "category_uid": 3,          // Identity & Access Management
  "activity_id": 1,           // Logon
  "time": 1644832800000,      // epoch milliseconds
  "severity_id": 1,           // Informational
  "status_id": 1,             // Success
  "user": {
    "uid": "user_abc123",
    "email_addr": "alice@example.com",
    "name": "Alice Smith"
  },
  "auth_protocol_id": 6,      // OAuth 2.0
  "logon_type_id": 13,        // Cached Remote Interactive
  "src_endpoint": {
    "ip": "203.0.113.5",
    "location": { "country": "US", "city": "San Francisco" }
  },
  "metadata": {
    "version": "1.0.0",
    "product": { "name": "Bastionary", "vendor_name": "SummitFlux LLC" },
    "log_name": "auth.login"
  }
}

← Back to blog Try Bastionary free →