Webhook fan-out at scale: delivering millions of events without falling over

Delivering webhooks at scale is harder than it looks. At 100 customers with a few events per day, you can get away with synchronous HTTP calls in your request handlers. At 10,000 customers with authentication events happening continuously, that approach will take your application down. Slow or unreachable webhook endpoints create blocking threads, pile up in your connection pool, and eventually cascade into failures in your core application. The right architecture decouples event production from delivery entirely.

Queue-based delivery architecture

The fundamental pattern: when an event occurs, write it to a durable queue and return immediately. A pool of dedicated webhook workers reads from the queue and makes the HTTP delivery attempts. Your application's request handling path is never blocked by a slow endpoint.

// Event producer — writes to queue, returns immediately
async function emitWebhookEvent(
  orgId: string,
  eventType: string,
  payload: object
): Promise<void> {
  const event = {
    id: `evt_${crypto.randomUUID().replace(/-/g, '')}`,
    type: eventType,
    created: Math.floor(Date.now() / 1000),
    data: payload
  };

  // Find all webhook endpoints for this org subscribed to this event type
  const endpoints = await db.webhookEndpoints.findActive(orgId, eventType);

  // Enqueue one delivery job per endpoint
  const jobs = endpoints.map(endpoint => ({
    webhook_endpoint_id: endpoint.id,
    event_id: event.id,
    event_type: eventType,
    payload: JSON.stringify(event),
    attempts: 0,
    next_attempt_at: new Date(),
    created_at: new Date()
  }));

  if (jobs.length > 0) {
    await db.webhookDeliveries.insertMany(jobs);
    // Signal workers that new jobs are available
    await queue.publish('webhook:pending', { count: jobs.length });
  }
}

// Worker — processes delivery jobs from the queue
async function processWebhookDelivery(jobId: string): Promise<void> {
  const job = await db.webhookDeliveries.findById(jobId);
  const endpoint = await db.webhookEndpoints.findById(job.webhook_endpoint_id);

  if (!endpoint.active) {
    await db.webhookDeliveries.update(jobId, { status: 'cancelled' });
    return;
  }

  const signature = computeWebhookSignature(job.payload, endpoint.secret);
  const startTime = Date.now();

  try {
    const response = await fetch(endpoint.url, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Webhook-Signature': `t=${startTime},v1=${signature}`,
        'Webhook-ID': job.event_id,
        'User-Agent': 'YourProduct-Webhooks/1.0'
      },
      body: job.payload,
      signal: AbortSignal.timeout(30000)  // 30-second timeout
    });

    const durationMs = Date.now() - startTime;
    const success = response.status >= 200 && response.status < 300;

    await recordDeliveryAttempt(job, response.status, durationMs, success);

    if (!success) {
      await scheduleRetry(job, response.status);
    }
  } catch (err) {
    const durationMs = Date.now() - startTime;
    await recordDeliveryAttempt(job, 0, durationMs, false);
    await scheduleRetry(job, 0);
  }
}

Retry schedule with exponential backoff

Webhook endpoints go down. When they do, your delivery system must retry without creating a thundering herd when the endpoint comes back up. Exponential backoff with jitter is the standard approach: each retry waits 2^attempt seconds, plus a random jitter to prevent synchronized retry storms.

// Retry schedule: 5s, 30s, 2m, 10m, 30m, 2h, 5h — then give up
const RETRY_DELAYS_SECONDS = [5, 30, 120, 600, 1800, 7200, 18000];

async function scheduleRetry(
  job: WebhookDelivery,
  statusCode: number
): Promise<void> {
  const nextAttemptIndex = job.attempts;

  if (nextAttemptIndex >= RETRY_DELAYS_SECONDS.length) {
    // Max retries reached — mark as failed
    await db.webhookDeliveries.update(job.id, {
      status: 'failed',
      failed_at: new Date()
    });
    return;
  }

  const baseDelay = RETRY_DELAYS_SECONDS[nextAttemptIndex];
  const jitter = Math.random() * baseDelay * 0.1;  // 10% jitter
  const delaySeconds = Math.floor(baseDelay + jitter);

  await db.webhookDeliveries.update(job.id, {
    attempts: job.attempts + 1,
    next_attempt_at: new Date(Date.now() + delaySeconds * 1000),
    last_status_code: statusCode,
    status: 'pending'
  });
}

Endpoint health tracking and circuit breakers

When an endpoint has been failing for hours, continuing to attempt delivery puts load on your delivery infrastructure and floods that endpoint's logs when it comes back up with a backlog of deliveries. A circuit breaker pauses delivery to unhealthy endpoints.

// Endpoint health tracking
async function updateEndpointHealth(
  endpointId: string,
  success: boolean,
  statusCode: number
): Promise<void> {
  const endpoint = await db.webhookEndpoints.findById(endpointId);

  if (success) {
    await db.webhookEndpoints.update(endpointId, {
      consecutive_failures: 0,
      last_success_at: new Date(),
      circuit_breaker_open: false,
      circuit_breaker_open_at: null
    });
    return;
  }

  const newFailureCount = (endpoint.consecutive_failures || 0) + 1;
  const updates: Partial<WebhookEndpoint> = {
    consecutive_failures: newFailureCount,
    last_failure_at: new Date(),
    last_failure_status: statusCode
  };

  // Open circuit breaker after 5 consecutive failures
  if (newFailureCount >= 5 && !endpoint.circuit_breaker_open) {
    updates.circuit_breaker_open = true;
    updates.circuit_breaker_open_at = new Date();
    // Notify the endpoint owner
    await notifyEndpointUnhealthy(endpoint);
  }

  await db.webhookEndpoints.update(endpointId, updates);
}

// Check circuit breaker before attempting delivery
async function isEndpointDeliverable(endpointId: string): Promise<boolean> {
  const endpoint = await db.webhookEndpoints.findById(endpointId);

  if (!endpoint.circuit_breaker_open) return true;

  // Auto-reset after 1 hour — allow one probe attempt
  const openDuration = Date.now() - endpoint.circuit_breaker_open_at.getTime();
  if (openDuration > 3600 * 1000) {
    await db.webhookEndpoints.update(endpointId, {
      circuit_breaker_open: false,
      circuit_breaker_open_at: null
    });
    return true;  // allow probe attempt
  }

  return false;
}

Fan-out architecture for high event volume

When a single event needs to be delivered to thousands of endpoints — for example, a platform-level event that many developer customers subscribe to — the naive approach of queuing one job per endpoint in the request handler can itself become a bottleneck. A two-stage fan-out handles this: the first stage writes a single event record and publishes to a fan-out queue. Workers in the fan-out stage read the event, look up all subscribed endpoints, and write individual delivery jobs. This separates the write amplification from the event production path.

Always include a stable event ID in webhook payloads and document that endpoints must handle duplicate deliveries idempotently. Delivery exactly-once is not achievable over HTTP with retries. Endpoints that are down when you deliver will receive retries; endpoints that return 5xx may have processed the event before the error was returned. Your customers' endpoint handlers must use the event ID to detect and ignore duplicates.

Webhook signature verification

Webhook consumers need to verify that deliveries come from you and have not been tampered with. The standard approach is an HMAC-SHA256 signature over the request body, using a per-endpoint secret. Include a timestamp in the signed payload to prevent replay attacks.

// Computing the webhook signature (matches Stripe's scheme)
function computeWebhookSignature(payload: string, secret: string): string {
  const timestamp = Math.floor(Date.now() / 1000);
  const signedPayload = `${timestamp}.${payload}`;
  return crypto.createHmac('sha256', secret).update(signedPayload).digest('hex');
}

// Consumer-side verification
function verifyWebhookSignature(
  rawBody: string,
  signatureHeader: string,
  secret: string,
  toleranceSeconds = 300
): void {
  const parts = Object.fromEntries(
    signatureHeader.split(',').map(p => p.split('='))
  );
  const timestamp = parseInt(parts['t']);
  const signature = parts['v1'];

  if (Math.abs(Date.now() / 1000 - timestamp) > toleranceSeconds) {
    throw new Error('Webhook timestamp too old — possible replay attack');
  }

  const expected = crypto
    .createHmac('sha256', secret)
    .update(`${timestamp}.${rawBody}`)
    .digest('hex');

  if (!crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(expected))) {
    throw new Error('Webhook signature verification failed');
  }
}
← Back to blog Try Bastionary free →