Automation jobs fail for unglamorous reasons: a network blip, a restarted server, a vendor API timeout, an expired credential, or a partial deploy. The expensive part is rarely the failure itself. It is what happens next, when someone hits “run again” and the job creates duplicates, overwrites the wrong data, or charges customers twice.
Idempotency is the property that makes re-runs safe. If a job is idempotent, you can execute it multiple times with the same inputs and end up with the same result. That does not mean “nothing happens.” It means “nothing bad happens.”
This post shows a practical pattern for building idempotent batch jobs and automations. The goal is not academic perfection. The goal is to make retries and restarts routine, so operations feel calm even when the environment is not.
Why idempotency is the foundation of reliable automation
In real workflows, “exactly once” execution is often a myth. Your scheduler might retry, a worker might crash mid-run, or an API request might succeed but the response never reaches you. If you design as if every step happens once, you are building a trap for your future self.
Idempotent jobs give you three operational superpowers:
- Safe retries: you can retry on timeouts and transient errors without fearing duplicates.
- Resumability: you can restart after partial progress and continue without manual cleanup.
- Recoverability: you can replay historical ranges to repair data without creating chaos.
Even if your job touches several systems, you can usually make each side effect “at most once” by attaching a stable identity to the change and recording what you already applied.
Define your unit of work (so re-runs have meaning)
The most important design decision is not the retry logic. It is defining what you are doing, in a way that can be repeated. That definition becomes your unit of work.
A unit of work should be:
- Small enough that you can re-run it without reprocessing everything.
- Stable so you can identify it across retries (for example, a vendor invoice ID).
- Observable so you can log status and debug outcomes.
Common units of work include “one customer record,” “one invoice,” “one webhook event,” or “one page of API results.” Pick one that matches how your source system already identifies objects.
Once you have a unit of work, the question becomes: how do you ensure each unit is applied once, even if the overall job runs multiple times?
Practical idempotency strategies (that work in real systems)
Idempotency is usually achieved through a small set of techniques. You often combine two or three to get reliable behavior under failure.
1) Use a stable idempotency key and write with upsert semantics
If you store results in a database, prefer operations that are naturally idempotent. Instead of “insert a row,” do “insert if missing, otherwise update.” The key is choosing an identifier that does not change between retries, such as a source record ID plus a namespace.
For example, if you import invoices, store them with a unique constraint on (source_system, source_invoice_id). Then a second run becomes an update, not a duplicate insert.
2) Keep a run ledger (a lightweight record of what happened)
When side effects go beyond your database (sending an email, creating a ticket, charging a card), you need a way to prove “did we already do this?” A small ledger table or collection can do that.
At minimum, track the unit of work key, the status, and when it was last attempted. Many teams also include an idempotency key per external call.
{
job: "invoice-import",
unitKey: "acme:invoice:INV-10493",
status: "applied | skipped | failed",
appliedAt: "2026-06-01T02:14:05Z",
externalIdempotencyKey: "inv-import:INV-10493:v1"
}
This is not heavy observability. It is a pragmatic memory, so your system does not rely on a human remembering what happened at 2 AM.
3) Use durable checkpoints for pagination and large backfills
For “read many, process many” jobs, failures often happen mid-stream. A checkpoint stores where you are in the source data so you can resume. That checkpoint must be written durably and updated in a careful order.
Two common checkpoint shapes:
- Cursor based: store the last seen cursor token or item ID.
- Time window based: store a high-water mark timestamp, plus a small overlap to handle out-of-order updates.
Importantly, do not treat “checkpoint advanced” as meaning “everything before it applied.” Advance only after you have recorded application for each unit of work, or you will skip data on a crash.
4) Make side effects conditional and reversible where possible
Some side effects are inherently non-idempotent, like sending a physical letter. Others can be made idempotent by creating a record first and acting only once, or by switching to an update-style action.
- Instead of “send email,” consider “ensure notification record exists; send only if not already sent.”
- Instead of “create external object,” consider “create or update external object using a stable external key.”
- If you must perform a one-way action, add a manual approval step or human review.
Example: a nightly invoice import that never double-bills
Imagine a small subscription business that pulls invoices from a billing provider and syncs them into a CRM and an internal analytics database. The job runs nightly, but sometimes the provider API times out. The business wants to retry automatically without creating duplicate invoices or duplicate “payment reminder” emails.
A robust design might look like this:
- Define the unit of work: one invoice identified by
provider_invoice_id. - Upsert the invoice row: write to the CRM integration table with a unique constraint on that ID.
- Ledger the email side effect: create a record like
email:reminder:INV-10493with status “sent” only after the email provider confirms acceptance. - Checkpoint pagination: store the provider API cursor after each page, but only after processing all invoices in that page.
Now consider a failure mode: the job processes invoices 1 to 200, sends 5 reminders, then crashes while fetching page 3. When it restarts, it re-reads page 2 because the cursor did not advance. That is fine:
- The invoice upserts do not create duplicates.
- The email ledger prevents sending the same reminders again.
- The job continues and eventually advances the checkpoint.
The result is boring reliability. Operators stop worrying about hitting “retry,” because retry is part of normal operation.
Copyable checklist: make your job safe to re-run
Use this checklist when you design or refactor an automation. You can paste it into a ticket template or runbook.
- Define the unit of work: What is “one thing” and what is its stable key?
- Choose write semantics: Can the destination operation be an upsert or a replace instead of an insert?
- Add uniqueness: Is there a constraint (database unique index or equivalent) that prevents duplicates even if your code is wrong?
- Track progress durably: Do you have a checkpoint that survives restarts, and does it advance only after successful application?
- Record side effects: For each external action (email, SMS, ticket, charge), do you record “done” with an idempotency key?
- Handle partial success: If a batch contains 100 items and item 37 fails, can you retry item 37 without redoing the other 99?
- Plan for replay: If you need to reprocess last week, can you do it without duplicates and without manual cleanup?
- Log with keys: Do your logs include unit keys so you can trace a single record through retries?
- Test a failure: Have you simulated a crash mid-run and verified the second run converges to the correct result?
Common mistakes (and how to avoid them)
Most idempotency failures come from a few patterns. Catch these early and your automation will feel dramatically more dependable.
- Using timestamps as identity: “Process everything since last run” is not a stable unit of work. Prefer stable IDs, or combine a high-water mark with overlap and per-item deduplication.
- Advancing checkpoints too early: If you save the cursor before work is applied, a crash can permanently skip data. Advance only after you have persisted results or ledger entries.
- Assuming API retries are free: A timeout does not mean the request failed. If you retry a “create” call, you might create duplicates unless you use an idempotency key supported by the destination, or your own ledger.
- Idempotency only in code, not in storage: Without uniqueness constraints, a bug or race condition can still create duplicates. Make the database enforce rules whenever possible.
- One giant transaction: For long-running jobs, a single transaction can lock resources and fail catastrophically. Prefer per-unit commits and a ledger for overall progress.
When not to pursue strict idempotency
Idempotency is a great default, but you should be honest about costs. Some workflows do not warrant strict guarantees.
- Low-impact, easily reversible actions: If a duplicate update is harmless and easy to correct, you might accept “best effort” behavior.
- High-complexity distributed transactions: If making every side effect perfectly idempotent adds lots of infrastructure, consider narrowing scope or introducing a manual step for the risky portion.
- Truly one-way actions: If the side effect cannot be repeated or reversed (for example, physical fulfillment), prioritize human review and explicit approvals over automated retries.
A good compromise is to make the data synchronization idempotent and to isolate non-idempotent side effects behind a separate, auditable step.
Key Takeaways
- Define a stable unit of work first. Idempotency becomes much easier once “one thing” is clear.
- Prefer upserts and uniqueness constraints to prevent duplicates at the storage layer.
- Use a small ledger for side effects like emails and external creates, especially when timeouts are possible.
- Checkpoints are only safe when they advance after successful application, not before.
- Test re-runs by simulating crashes. A job that converges after failure is the goal.
Conclusion: aim for boring re-runs
The most mature automation systems are not the ones that never fail. They are the ones where failure is routine and recovery is uneventful. Idempotent job design turns “retry” from a scary button into a normal control.
If you take only one step, pick a unit of work and enforce uniqueness on it. That single constraint often eliminates the worst outcomes, and it gives you a solid base to add ledgers and checkpoints as your workflows grow.
FAQ
Is idempotency only relevant for APIs?
No. It applies to any automation with side effects: database imports, file processing, scheduled reports, queue consumers, and integrations. If a process can be re-run, idempotency is worth considering.
What is the difference between deduplication and idempotency?
Deduplication removes or ignores repeats after they happen. Idempotency prevents repeats from causing a different outcome in the first place. Deduplication can be part of an idempotent design, but it is usually not enough on its own.
How do I pick a good idempotency key?
Start with a stable identifier from the source system (like an invoice ID). If needed, namespace it by system and action (for example, billing:invoice:INV-10493 and email:reminder:INV-10493). The key should not depend on time-of-run.
Do I need both checkpoints and a ledger?
Not always. For small jobs, an upsert plus a unique constraint may be enough. Add checkpoints when you paginate or process large ranges, and add a ledger when you perform non-idempotent side effects or interact with systems that can accept a request without reliably returning a response.