Webhooks feel magical at first. Something happens in one system, and another system immediately reacts. But as soon as a workflow becomes important, the “magic” turns into operational work: missed events, duplicate deliveries, unexplained failures, and questions like “what actually ran and why?”
Good webhook hygiene is the set of small design choices that keep event driven automations stable and understandable over time. It is not about heavy infrastructure. It is about clear contracts, predictable behavior, and a path to debug and recover when something inevitably goes wrong.
This post walks through a practical, small team friendly approach. The goal is that your webhook based automations continue to work when volume increases, team members change, or the sending system evolves.
What webhook hygiene means
Webhook hygiene means treating each webhook like a tiny public API. Even if you are the only consumer, you will eventually forget assumptions you made. If another teammate consumes it, they will make different assumptions. Hygiene reduces ambiguity.
At a minimum, “clean” webhooks provide:
- A stable event contract: clear fields, consistent types, and versioning.
- Authenticity checks: confidence that the sender is who they claim to be.
- Reliability behaviors: duplicates can happen, ordering is not guaranteed, and delays are normal.
- Debuggability: the ability to answer “which events were received, processed, and why?”
- A recovery path: replay or reprocess without manual detective work.
Think of webhooks as messages, not commands. The sender is reporting that something happened, and the receiver decides what to do, safely and repeatedly.
Define a stable event contract
The most common webhook failure mode is semantic drift. The sender changes a field name, adds a new “optional” field that is actually required in one edge case, or starts emitting a new event type without notice. A minimal contract prevents most of this.
Minimal envelope fields (the part that should almost never change)
Start with a small “envelope” that wraps the payload. Keep the envelope consistent across all event types. The payload can vary, but the envelope is what makes processing reliable and observable.
{
"event_id": "evt_01H...",
"event_type": "invoice.paid",
"event_version": 1,
"occurred_at": "2026-02-06T12:34:56Z",
"source": "billing-system",
"data": { "...payload..." }
}
Why these fields matter:
event_id: lets you detect duplicates and tie logs together.event_type: makes routing explicit (no guessing based on payload shape).event_version: enables change without breaking existing consumers.occurred_at: separates business time from delivery time.source: helps when multiple senders exist (or will exist later).
Versioning without pain
A practical approach for small systems is to version per event type and increment only for breaking changes. Additive changes (new optional fields) usually do not need a new version if you keep consumers tolerant of unknown fields.
If you do need a breaking change, keep the old version emitting for a while. Your receiver can support both versions and gradually migrate. This “dual support” period is often cheaper than emergency fixes after a silent break.
Secure, validate, and limit blast radius
Webhook endpoints are public by design. Even if the URL is “secret,” it will eventually leak in logs, browser history, or forwarded emails. Assume attackers can find it and send requests to it.
A copyable webhook receiver checklist
- Verify authenticity: use a signature header (HMAC) or another cryptographic mechanism shared with the sender.
- Validate shape: check required fields, types, and acceptable
event_typevalues. - Enforce size limits: cap request body size to reduce abuse and parsing risk.
- Require HTTPS: never accept plain HTTP for webhook ingestion.
- Use a dedicated endpoint: do not reuse a general API route with broader permissions.
- Least privilege downstream: the automation should have only the permissions it needs, especially for write operations.
- Fail closed: if signature verification fails or required fields are missing, return an error and do not process.
Also decide what your endpoint returns. Many senders interpret any 2xx status as “delivered.” If you return success before you have safely recorded the event, you may lose it forever. A good default is: record first, then acknowledge.
Design for duplicates and delays
Webhooks are not guaranteed to arrive exactly once. You may receive the same event multiple times due to retries, network timeouts, or sender bugs. You may also receive events out of order. Your design must assume this.
Three practical patterns keep you safe without overengineering:
- Deduplicate by
event_id: store processed IDs with a retention window (for example, 7 to 30 days). If an event repeats, skip side effects. - Make side effects idempotent: when writing to another system, use stable keys. For example, “create or update customer by customer_id” instead of “create customer” each time.
- Separate receipt from processing: capture the event quickly, then process asynchronously. Even a simple internal queue table works. The key is that the incoming request returns fast and does not depend on slow downstream APIs.
Retries are normal, but uncontrolled retries can become a loop. Use backoff, cap attempts, and record the last error. If an event fails repeatedly, route it to a “needs review” state so humans can intervene without blocking the entire pipeline.
Observe, replay, and evolve safely
Most webhook pain is not the failure itself. It is the time spent figuring out what happened. Add observability on day one, even if it is just structured logs and a simple event table.
At minimum, you want to answer these questions:
- Did we receive the event? If yes, when and from which source IP or sender ID?
- Did signature verification pass?
- Did we validate and accept the payload shape?
- Did we process it successfully? If not, what was the error and how many attempts were made?
- Which downstream side effects happened (records created, emails queued, tickets opened)?
Replay is the other half of observability. If you cannot easily reprocess a single event, small failures turn into manual work. Practical replay options include:
- Re-run from your event store: requeue a single event by ID.
- Time window re-run: requeue all events in a time range after a dependency outage.
- Dry-run mode: validate and simulate side effects, but do not actually write to external systems.
Evolving safely means making changes without breaking older consumers or historical replay. Keeping the raw payload (or a normalized representation) alongside your processed record helps when you need to reinterpret older events after a logic update.
Example: a small ecommerce automation
Imagine a small ecommerce operation that wants a simple automation:
- When an order is paid, create or update the customer in a CRM.
- Create a shipping task in an internal tracker.
- Send a message to an internal chat channel.
A “quick” implementation might call three APIs directly inside the webhook handler and return 200 if everything finishes. This works until one downstream API is slow, times out, or returns a transient error. Then the sender retries, your handler runs twice, and you get duplicate CRM entries and duplicate tasks.
A hygienic implementation would instead:
- Verify the signature and validate the envelope.
- Store the event row:
event_id, raw payload, statusreceived. - Return
200quickly (or after storing), so the sender stops retrying. - Process asynchronously:
- Check if
event_idalready processed. If yes, stop. - Upsert the CRM customer by
customer_id. - Create the shipping task using a stable id like
order_idas an idempotency key. - Send the chat message, but guard against duplicates (for example, store “message sent” state keyed by
order_id).
- Check if
- Mark the event as
processedwith a timestamp and a compact summary of side effects.
Now, if the CRM API is down for an hour, you do not lose events. You see a backlog, retries are controlled, and replay is as simple as requeuing failed events when the CRM comes back.
Common mistakes
These are the mistakes that most often turn “simple webhooks” into recurring fire drills:
- No event ID: you cannot deduplicate, correlate logs, or replay reliably.
- Doing heavy work in the request: slow downstream calls cause timeouts and duplicates.
- Assuming order: processing “invoice.paid” before “invoice.created” might happen; design defensively.
- Implicit routing: inferring event type from payload structure instead of a clear
event_type. - Skipping signature verification: the endpoint becomes a write-anything door to your systems.
- Logging too little: you only discover gaps when a human asks why something did not happen.
- Changing payloads without versioning: tiny changes break downstream parsing, often silently.
Key Takeaways
- Use a stable event envelope with
event_id,event_type, andevent_version. - Record first, then acknowledge; do heavy work asynchronously.
- Assume duplicates, delays, and out-of-order delivery; design for safe reprocessing.
- Secure the endpoint with signature verification and strict validation.
- Make replay a first-class feature so failures do not require manual reconstruction.
When not to use webhooks
Webhooks are great for near real time reactions, but they are not always the best trigger. Consider alternatives when:
- You need guaranteed delivery but cannot store events: if you cannot record and replay, a periodic pull (polling) may be safer.
- The sender cannot sign requests: without authenticity checks, you may be exposing sensitive operations.
- Your workflow is batch oriented: if you only need daily totals, a scheduled job can be simpler and easier to monitor.
- The downstream action is high risk: for irreversible operations, you may want a human approval step rather than a push trigger.
A useful rule: if a failure would require a person to reconcile records, prioritize designs that make replay and audit straightforward, regardless of whether you choose push or pull.
Conclusion
Webhook hygiene is not about perfection. It is about building the smallest set of guarantees that keep automations safe: stable contracts, strong validation, resilience to duplicates, and a clear trail from receipt to side effects.
If you adopt only one practice, make it this: store every incoming event with an ID before doing anything else. That single choice turns a fragile trigger into an automation you can trust, debug, and evolve.
FAQ
Should I return 200 before I finish processing?
Usually yes, but only after you have safely recorded the event (for example, written to a database). If you acknowledge before recording, you risk losing events permanently. If you wait to acknowledge until after all downstream work, you increase timeouts and duplicates.
How long should I keep event IDs for deduplication?
Long enough to cover the sender’s retry window plus your own backlog risk. Many teams start with 14 to 30 days. If storage is a concern, keep the IDs and minimal metadata longer than the full raw payload.
What if the sender does not provide event IDs?
Create your own derived identifier using stable fields and a hash (for example, event type plus a source record ID plus occurred timestamp). It is not perfect, but it is better than nothing. Also consider asking the sender to add a unique ID; it is a high leverage improvement.
Do I need a message queue to do this well?
No. A queue helps, but a small “events” table plus a background worker can be enough. The key idea is decoupling receipt from processing and tracking status, attempts, and errors in a durable way.