Loading...
Building a Resilient Webhook Architecture
March 13, 2026
10 min read
Get posts like this in your inbox
Bi-weekly engineering deep dives on auth, notifications, and developer infrastructure. No spam.
Sign Up for UpdatesPublished by Zyphr
March 13, 2026
10 min read
Loading...
Bi-weekly engineering deep dives on auth, notifications, and developer infrastructure. No spam.
Sign Up for UpdatesPublished by Zyphr
It’s 3:14 AM on a Tuesday. Your downstream customer’s API just returned a 503 Service Unavailable for exactly 120 seconds during a routine container deployment. In that window, your system attempted to fire 5,000 "payment.succeeded" webhook events. Because your producer was using a simple "fire and forget" POST request, those 5,000 events are now gone. They aren't in a database, they aren't in a log—they simply ceased to exist when the HTTP request timed out.
This isn't a hypothetical edge case; it's the default state of most home-grown webhook implementations. Most developers treat webhooks as a side effect—a notification that happens after the real work is done. But if you’re building a SaaS platform where webhooks trigger shipping labels, provision user accounts, or update ledger balances, a dropped hook is a data integrity failure. A resilient webhook architecture ensures that even during these transient failures, data remains consistent across your distributed systems.
At Zyphr, we believe webhooks shouldn't be an afterthought. We’ve built our system on the principle that a webhook delivery should be treated with the same ACID-compliance mindset as a local database transaction. If the event is triggered, the system guarantees delivery attempts until the downstream service acknowledges it or we’ve exhausted a multi-day retry window.
When you call zyphr.webhooks.send() or trigger an event through our auth system, we don't just open an HTTP connection to your customer's URL. That approach is brittle and couples your application’s performance to the latency of a third-party server. Instead, Zyphr uses a multi-stage pipeline designed for high throughput and extreme reliability.
The pipeline follows a strict flow:
By separating the "trigger" from the "delivery," we ensure that a spike in outbound traffic doesn't saturate your application's resources. More importantly, it allows us to implement "Delivery State Persistence." You can view the full lifecycle of a message in our dashboard, which is part of the broader Zyphr platform features designed to give you total visibility into your stack.
This webhook architecture relies on an "at-least-once" delivery guarantee. While "exactly-once" is the holy grail of distributed systems, it is practically impossible in the face of network partitions. If a customer's server processes a request but the "200 OK" response is lost due to a TCP reset, we will retry. This shifts the burden of idempotency to the receiver, but it guarantees that no data is ever lost.
If a delivery fails, the immediate instinct for many developers is to "retry in 30 seconds." If the downstream server is struggling with a memory leak or a database lock, hitting it again 30 seconds later with 10,000 pending requests is the fastest way to turn a minor hiccup into a total outage. This is known as the "thundering herd" problem.
Zyphr implements a retry strategy that spans more than 12 hours. We use an exponential backoff algorithm combined with Decorrelated Jitter. Instead of retrying at fixed intervals, we introduce randomness to the delay. This ensures that even if a thousand events fail at the same second, their retries will be spread across a wide window, giving the downstream service room to breathe.
Here is how we calculate the delay for the next attempt:
function calculateNextRetry(attempt: number, baseDelay: number = 1000): number {
// Max delay capped at 4 hours to ensure events eventually clear
const cap = 4 * 60 * 60 * 1000;
// Exponential backoff: 2^attempt * baseDelay
const exponentialDelay = Math.min(cap, baseDelay * Math.pow(2, attempt));
// Apply decorrelated jitter: random value between baseDelay and exponentialDelay * 3
// This prevents synchronization of retry attempts across the fleet.
const sleep = Math.random() * (exponentialDelay * 3 - baseDelay) + baseDelay;
return Math.min(cap, sleep);
}
By the time an event hits its fifth or sixth retry, the gap between attempts is measured in hours. This accounts for scenarios where a customer's site might be down for an extended period due to a failed migration or a DNS issue. This logic is a core component of a high-scale webhook architecture because it prioritizes the health of the entire ecosystem over the immediate delivery of a single packet.
Because our architecture guarantees at-least-once delivery, your endpoints must be prepared to receive the same event multiple times. This happens most often when a request succeeds on your end but the network connection times out before our delivery engine receives your response.
We recommend using a unique event ID, which we provide in the X-Zyphr-Event-Id header. On your backend, you should track these IDs in a table with a unique constraint.
async function handleWebhook(req: Request) {
const eventId = req.headers['x-zyphr-event-id'];
// Check if we have already processed this specific event
const alreadyProcessed = await db.processedEvents.findUnique({
where: { id: eventId }
});
if (alreadyProcessed) {
return res.status(200).send('Already processed');
}
// Execute your business logic here
await processOrder(req.body);
// Mark as processed
await db.processedEvents.create({ data: { id: eventId } });
}
This pattern ensures that even if our retry logic sends the same "payment_processed" hook three times, your user is only credited once. It turns a potential double-billing bug into a non-event.
Retries alone are a blunt instrument. If we detect that 90% of requests to a specific endpoint are failing with 5xx errors, continuing to hit that endpoint is irresponsible. This is where we implement the Circuit Breaker pattern within our webhook architecture.
Our delivery engine tracks the health of every webhook destination across three states:
The Circuit Breaker is a self-healing mechanism. It prevents Zyphr from accidentally DDoS-ing your customers when they are already in a vulnerable state. While the circuit is Open, we maintain event ordering in the queue, ensuring that once the service recovers, the events are delivered in the sequence they were generated.
type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';
class WebhookCircuitBreaker {
private state: CircuitState = 'CLOSED';
private failureCount: number = 0;
private lastFailureTime: number | null = null;
async execute(deliveryFn: () => Promise<void>) {
if (this.state === 'OPEN') {
// 5 minute cooldown before entering HALF_OPEN
if (Date.now() - this.lastFailureTime! > 300000) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit is open: Queueing event for later');
}
}
try {
await deliveryFn();
this.reset();
} catch (error) {
this.recordFailure();
throw error;
}
}
private recordFailure() {
this.failureCount++;
if (this.failureCount >= 5) {
this.state = 'OPEN';
this.lastFailureTime = Date.now();
}
}
private reset() {
this.state = 'CLOSED';
this.failureCount = 0;
}
}
Sometimes, even 12 hours and 7 retries aren't enough. Perhaps a customer changed their API endpoint without updating their settings, or their SSL certificate expired and wasn't renewed for a weekend. In a standard system, these messages would eventually be purged from the queue.
In Zyphr, these "exhausted" events move to the Dead Letter Queue (DLQ). The DLQ is a staging area for manual intervention. Through our webhook management dashboard, you can inspect the exact payload that failed, see the final HTTP status code (e.g., a 403 Forbidden because of a misconfigured WAF), and trigger a manual replay once the customer fixes their endpoint.
Crucially, we provide "Event Filtering." Not every event is mission-critical. You might want payment.succeeded to go to the DLQ on failure, but you might decide that user.logged_in isn't worth the manual effort. You can configure your subscriptions to automatically drop non-essential events after the retry window, keeping your DLQ focused on the data that impacts your business operations.
If you’re building a system that acts on webhook data—like upgrading a user's subscription—you must be certain the request came from Zyphr. Relying on IP whitelisting is a recipe for maintenance pain, as IP ranges change and are easily spoofed via DNS poisoning.
We sign every outgoing webhook with an HMAC-SHA256 signature. This signature is generated using a secret key unique to each webhook endpoint. We also include a X-Zyphr-Timestamp header. When you verify the signature, you should also check that the timestamp is within a 5-minute window of the current time. This prevents "replay attacks" where an attacker intercepts a valid hook and tries to resend it later.
We also support "Zero-Downtime Secret Rotation." When you rotate a webhook secret, we provide an Overlap Window. For a configurable period (usually 24 hours), Zyphr will sign the payload with both the old secret and the new secret, sending two signature headers. This allows your team to update the new secret in your environment without dropping a single incoming hook.
Here is how you verify a signature using our Node.js SDK:
import { Zyphr } from '@zyphr/sdk';
const zyphr = new Zyphr(process.env.ZYPHR_API_KEY);
app.post('/webhooks/zyphr', (req, res) => {
const signature = req.headers['x-zyphr-signature'] as string;
const timestamp = req.headers['x-zyphr-timestamp'] as string;
const payload = JSON.stringify(req.body);
try {
// This helper validates the HMAC and the timestamp window
// It uses a timing-safe comparison to prevent side-channel attacks
const isValid = zyphr.webhooks.verifySignature(
payload,
signature,
timestamp,
process.env.ZYPHR_WEBHOOK_SECRET
);
if (!isValid) return res.status(401).send('Invalid signature');
// Process the event...
res.status(200).send('OK');
} catch (err) {
res.status(400).send('Webhook verification failed');
}
});
The biggest frustration with webhooks is the "Black Box" problem. You know you sent it, they say they didn't get it, and neither of you has proof. We solve this by providing raw delivery logs for every single attempt. This includes the full request headers, the request body, and the exact response returned by the customer's server.
Beyond raw logs, we calculate Webhook Health Scores. This is a proactive metric based on the success rate and latency of a specific endpoint over the last hour. If a score starts to dip, it’s often an early warning sign of a memory leak or a struggling database on the customer's end—allowing you to reach out to them before the circuit breaker actually opens. It turns "something is wrong" into "your /api/webhooks endpoint has seen a 40% increase in 500 errors over the last 15 minutes."
If your current webhook logic consists of a simple axios.post wrapped in a basic try/catch block, you are essentially gambling with your data integrity. At some point, the internet will fail, and you will lose events.
Audit your current event pipeline. Does it handle exponential backoff? Does it have a jitter strategy to prevent thundering herds? Do you have a mechanism to pause delivery if a customer's site goes down, or are you just wasting compute resources retrying against a dead server?
If you aren’t satisfied with the answers, it’s time to move your hooks to a system built for 99.99% delivery. Sign up for a free Zyphr account and use our Webhook simulator to test how your app handles circuit breakers and signature verification. Install the @zyphr/sdk today and build an architecture that survives the 3:00 AM outages.