Building reliable webhook delivery: retries, signatures, and fan-out

Webhooks are the backbone of modern SaaS integrations. When your auth system fires a user.created or session.revoked event, the downstream systems that depend on it need to receive that event reliably — even if the receiving server is temporarily down, overloaded, or returning errors. Getting this right requires more than a simple HTTP POST in a background job. This post covers signatures, retry logic, idempotency, and what to do when delivery permanently fails.

HMAC-SHA256 signatures

The first thing a webhook consumer needs to verify is that the payload came from you and hasn't been tampered with. The standard approach is to include an HMAC-SHA256 signature in a request header:

import { createHmac, timingSafeEqual } from 'crypto';

function signWebhookPayload(
  payload: string,  // raw JSON string, not parsed object
  secret: string,
  timestamp: number = Math.floor(Date.now() / 1000)
): string {
  // Sign timestamp + payload to prevent replay attacks
  const signedContent = `${timestamp}.${payload}`;
  const signature = createHmac('sha256', secret)
    .update(signedContent)
    .digest('hex');
  return `t=${timestamp},v1=${signature}`;
}

// In your webhook dispatcher:
const payload = JSON.stringify(event);
const signature = signWebhookPayload(payload, endpoint.signingSecret);

await fetch(endpoint.url, {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Bastionary-Signature': signature,
    'X-Bastionary-Event': event.type,
    'X-Bastionary-Delivery': deliveryId,
  },
  body: payload,
});

The receiver validates the signature:

function verifyWebhookSignature(
  rawBody: string,
  signatureHeader: string,
  secret: string,
  toleranceSeconds = 300  // 5-minute replay window
): boolean {
  const parts = Object.fromEntries(
    signatureHeader.split(',').map(p => p.split('=') as [string, string])
  );
  const timestamp = parseInt(parts['t'], 10);
  const receivedSig = parts['v1'];

  if (!timestamp || !receivedSig) return false;

  // Check timestamp is within tolerance
  const age = Math.floor(Date.now() / 1000) - timestamp;
  if (Math.abs(age) > toleranceSeconds) return false;

  // Recompute expected signature
  const signedContent = `${timestamp}.${rawBody}`;
  const expectedSig = createHmac('sha256', secret)
    .update(signedContent)
    .digest('hex');

  // Use timing-safe comparison
  const a = Buffer.from(expectedSig, 'hex');
  const b = Buffer.from(receivedSig, 'hex');
  if (a.length !== b.length) return false;
  return timingSafeEqual(a, b);
}

Always use timingSafeEqual for signature comparison. A standard string equality check has timing side channels that can leak information about the correct signature byte-by-byte. This is the same vulnerability class as classic timing attacks on password checks.

Retry with exponential backoff and jitter

HTTP requests fail. The receiving server might be deploying, rate-limiting you, or experiencing an outage. Your webhook system needs to retry gracefully without hammering a struggling endpoint.

Exponential backoff doubles the delay between retries. Jitter adds randomness to prevent all retrying webhooks from slamming the endpoint at the same moment after a recovery:

function calculateRetryDelay(attempt: number): number {
  // Base: 10s, 20s, 40s, 80s, 160s, 320s (capped at 1h)
  const base = 10_000; // 10 seconds
  const cap = 3_600_000; // 1 hour
  const exponential = Math.min(base * Math.pow(2, attempt), cap);

  // Full jitter: random value in [0, exponential]
  // This is the best jitter strategy for high-concurrency systems (see AWS blog)
  return Math.floor(Math.random() * exponential);
}

const MAX_ATTEMPTS = 8; // ~4 hours of total retry window

async function scheduleRetry(delivery: WebhookDelivery, db: DB): Promise<void> {
  if (delivery.attempts >= MAX_ATTEMPTS) {
    await moveToDeadLetter(delivery, db);
    return;
  }

  const delay = calculateRetryDelay(delivery.attempts);
  const nextAttempt = new Date(Date.now() + delay);

  await db.webhookDeliveries.update(delivery.id, {
    status: 'pending',
    nextAttemptAt: nextAttempt,
    attempts: delivery.attempts + 1,
  });
}

Idempotency: handling duplicate deliveries

Even with the best retry logic, a webhook may be delivered more than once. The POST might succeed on the server but time out before the 200 response reaches you. In that case, your system will retry — and the receiver will see a duplicate. Consumers need a way to detect and discard duplicates.

Include a stable X-Bastionary-Delivery ID that's the same across retries of the same event. Receivers use this to implement idempotency:

// Receiver-side idempotency check (Express)
app.post('/webhooks', express.raw({ type: 'application/json' }), async (req, res) => {
  const deliveryId = req.headers['x-bastionary-delivery'] as string;
  if (!deliveryId) return res.status(400).json({ error: 'missing_delivery_id' });

  // Check if we've already processed this delivery
  const processed = await redis.get(`webhook:processed:${deliveryId}`);
  if (processed) {
    // Acknowledge receipt — it's fine to 200 here
    return res.status(200).json({ status: 'already_processed' });
  }

  // Verify signature
  const rawBody = req.body.toString();
  const valid = verifyWebhookSignature(rawBody, req.headers['x-bastionary-signature'] as string, WEBHOOK_SECRET);
  if (!valid) return res.status(401).json({ error: 'invalid_signature' });

  // Process the event
  const event = JSON.parse(rawBody);
  await processEvent(event);

  // Mark as processed with 24-hour TTL (longer than your retry window)
  await redis.setex(`webhook:processed:${deliveryId}`, 86400, '1');
  res.status(200).json({ status: 'ok' });
});

Fan-out: delivering to multiple endpoints

When an event fires, it may need to be delivered to multiple registered endpoints for the same organization. Don't serialize these deliveries — fan them out in parallel, and track them independently:

async function fanOutEvent(event: WebhookEvent, db: DB, queue: Queue): Promise<void> {
  // Find all active endpoints for this org that subscribe to this event type
  const endpoints = await db.webhookEndpoints.findActiveForEvent(
    event.orgId,
    event.type
  );

  // Create a delivery record for each endpoint
  const deliveries = await Promise.all(
    endpoints.map(endpoint =>
      db.webhookDeliveries.create({
        id: crypto.randomUUID(),
        eventId: event.id,
        endpointId: endpoint.id,
        payload: JSON.stringify(event),
        status: 'pending',
        attempts: 0,
        nextAttemptAt: new Date(),
      })
    )
  );

  // Enqueue each delivery independently
  await queue.bulkEnqueue(
    deliveries.map(d => ({ type: 'webhook.deliver', deliveryId: d.id }))
  );
}

Each delivery record tracks its own retry state. If endpoint A fails repeatedly but endpoint B succeeds, endpoint A's failures shouldn't affect B.

Ordering guarantees

Webhooks are inherently at-least-once and unordered. Your retry logic means event B might arrive before event A if A needed retries. Design your webhook schema accordingly:

Include a created_at timestamp on every event
Include a sequence number or version if ordering matters for a resource
Include the full resource state in the payload, not just the delta — if the consumer processes event B before A, they'll still end up with the correct final state

{
  "id": "evt_01J3X8K...",
  "type": "user.updated",
  "created_at": "2025-05-05T14:23:11Z",
  "livemode": true,
  "delivery_id": "dlv_01J3X8L...",
  "data": {
    "object": {
      "id": "usr_abc",
      "email": "james@example.com",
      "updated_at": "2025-05-05T14:23:11Z",
      "version": 14
    },
    "previous_attributes": {
      "email": "james.old@example.com"
    }
  }
}

Including previous_attributes is a Stripe pattern worth copying. Consumers that need to react to specific field changes can check it without querying your API for the previous state.

Dead letter queues

After exhausting all retries, move the delivery to a dead letter queue (DLQ) rather than discarding it. This gives you:

A complete audit trail of all delivery failures
The ability to replay failed deliveries after the customer fixes their endpoint
Data for your support team to diagnose integration issues

async function moveToDeadLetter(delivery: WebhookDelivery, db: DB): Promise<void> {
  await db.transaction(async (trx) => {
    await trx.webhookDeliveries.update(delivery.id, { status: 'dead' });
    await trx.webhookDeadLetters.create({
      deliveryId: delivery.id,
      endpointId: delivery.endpointId,
      lastError: delivery.lastError,
      failedAt: new Date(),
      expiresAt: new Date(Date.now() + 30 * 24 * 60 * 60 * 1000), // 30-day retention
    });
  });

  // Alert the endpoint owner
  await sendDeliveryFailureEmail(delivery);
}

Expose a "Replay" button in your dashboard that re-enqueues dead-lettered deliveries. This is one of the most-used developer experience features in any webhook system.

Endpoint health monitoring

Track the failure rate of each endpoint over time. If an endpoint is failing more than 80% of deliveries over a 24-hour window, mark it as degraded and alert the owner. If it reaches 100% failure for 72 hours, consider auto-disabling it to stop wasting retry cycles. Always send an email warning before auto-disabling — a silent disable without notification is how you earn angry support tickets.