Scheduled automations look simple: run something every hour, pull new data, push updates, done. In real systems, “every hour” turns into missed runs, duplicate records, angry APIs, and a backlog of manual fixes.
The good news is that most failures come from a few predictable gaps: unclear time boundaries, unsafe reruns, and a lack of respect for rate limits. You can fix these with a small set of design choices that are easy to document and test.
This post gives you an evergreen framework for building scheduled API jobs that behave well under real conditions: clock drift, retries, partial failures, and provider throttling.
Why scheduled jobs fail in practice
Scheduled jobs fail less because of “bugs” and more because the job has no explicit contract for what it is responsible for. Without that contract, operators (including future you) cannot answer basic questions like: Which records should have been processed? If we rerun, what happens? If we skip a day, can we catch up safely?
Here are the failure modes that show up repeatedly:
- Gaps: data created during a missed window never gets processed.
- Overlaps: the job reprocesses the same items and creates duplicates downstream.
- Rate limit blowups: retries amplify traffic until the provider blocks you.
- Silent partials: the job “succeeds” but some items were skipped due to pagination, timeouts, or mid-run errors.
- Unbounded catch-up: after downtime, the job tries to process weeks of data in one run and collapses.
Durability comes from defining boundaries, persisting progress, and treating retries as a first-class feature.
Define the time window contract
A durable schedule begins with a clear statement: “This run processes items in time window [start, end) according to a chosen timestamp field.” That bracket style is intentional: inclusive start, exclusive end. It prevents double-processing items that land exactly on a boundary.
Choose your watermark (cursor) carefully
Most APIs offer at least one timestamp: created_at, updated_at, or an event time. Decide which one governs your job. If your goal is to keep a mirror up to date, updated_at is usually the right watermark. If you only care about new records, created_at may be enough.
Then decide what you persist between runs:
- High-water mark: the maximum processed timestamp (plus tie-breaker) from the last successful run.
- Cursor token: an API-provided pagination cursor, if the provider supports it.
- Explicit window: store
window_startandwindow_endper run so you can re-run the same window deterministically.
When timestamps are not unique, add a tie-breaker. A common pattern is “(timestamp, id)” ordering: process in ascending timestamp and use the record ID to break ties. This avoids missing records that share the same timestamp.
A useful mental model is to make each run auditable. Even if your scheduler triggers every hour, your automation should be able to say, “I processed from 10:00 to 11:00 UTC, and here is the count of items I touched.”
Key Takeaways
- Make the job’s responsibility explicit: process
[start, end)for a chosen watermark field. - Persist progress as data, not as assumptions in code.
- Design reruns and backfills up front, not as emergency procedures.
- Throttle intentionally so retries do not become traffic spikes.
Backfills and reruns without drama
Backfills are not exceptional. They are a normal operation: onboarding a new destination, fixing a bug, recovering from downtime, or correcting bad data. If your job cannot backfill safely, it is fragile.
There are two distinct actions to support:
- Rerun: reprocess the exact same window because something failed or you want to validate.
- Backfill: process historical windows that were never processed (or are being reprocessed intentionally).
The safest approach is to store a run record with its window boundaries and status. Conceptually, you can think of the scheduler as creating “work units” and the worker as consuming them.
{
run_id,
window_start, // inclusive
window_end, // exclusive
watermark_field, // e.g. updated_at
status, // pending | running | succeeded | failed
item_counts: {fetched, processed, failed},
started_at,
finished_at
}
This record becomes your control plane. It lets you:
- Retry only failed runs, without guessing what was included.
- Limit catch-up by creating a bounded sequence of windows.
- Prove completeness: you can show that windows cover a timeline without gaps.
To make reruns safe, the downstream write must be stable under duplication. Prefer upserts (insert or update) keyed by a durable external ID. If the destination does not support upsert, you can approximate it by writing to a staging area and merging, or by checking for existence before insert. The goal is the same: rerunning should not create extra rows or extra side effects.
Rate limits and polite throughput
Rate limits are not just a “performance” concern. They are a reliability concern. A job that hits rate limits often will produce partial windows, long runtimes, and inconsistent retries.
A simple throttle model: steady pacing plus burst control
You do not need advanced infrastructure to behave well. You need predictable pacing:
- Steady pace: a target request rate (for example, N requests per second) that stays below the provider’s limits.
- Burst control: avoid spikes caused by parallelism, retries, or page boundaries.
- Backoff: when the provider signals throttling, slow down and try again after waiting.
A practical rule: retries should consume the same budget as normal requests. If you retry immediately and in parallel, you can turn one intermittent timeout into a sustained throttle.
Also budget time. If your job runs hourly, but a full window takes 55 minutes under normal conditions, you have no slack for retries or backoff. Aim for a typical runtime of 20 to 40 percent of the schedule interval, or increase the schedule interval and compensate with larger windows.
A concrete example: nightly CRM to warehouse sync
Consider a small business that wants a nightly sync from a CRM to a reporting database. The business relies on these reports each morning to plan staffing and follow-ups. The CRM API is rate-limited and occasionally returns intermittent errors.
A durable design looks like this:
- Schedule: run every night, but define windows by UTC day:
[00:00, 24:00). - Watermark: use
updated_atso edits to existing contacts and deals are included. - Work units: create one run record per day. If a run fails, rerun the same day window.
- Pagination: process pages in a deterministic order; store counts and last-seen keys for debugging.
- Writes: upsert into the warehouse keyed by the CRM’s object ID.
- Backfill: if you change the mapping logic, enqueue the last 30 days as separate windows, not one giant run.
If the CRM API rate-limits the job on a given night, the run record shows it failed, and the next attempt reruns the same window. Because writes are upserts, rerunning is safe. Because the window is bounded, catch-up is controlled and predictable.
Checklist: ship a durable scheduler
Copy this checklist into your project and treat it as the definition of “ready to operate”:
- Window contract: Define
[start, end)and document the watermark field. - Progress persistence: Store run records with window boundaries and status.
- Deterministic selection: Ensure the query order is stable; add a tie-breaker when timestamps collide.
- Safe writes: Use upsert or an equivalent pattern so reruns do not duplicate side effects.
- Bounded retries: Cap retry attempts and total run time; record failures with enough context to debug.
- Rate limit plan: Set a target request rate and implement backoff on throttle responses.
- Catch-up strategy: On backlog, create multiple small windows instead of one huge job.
- Validation: Record counts (fetched, processed, failed) and alert on anomalies.
- Operator workflow: Make it easy to rerun a specific window by run ID.
Common mistakes
- Using “last run time” as the only state: if the system clock changes or a run overlaps, you create gaps or duplicates. Store windows explicitly.
- Assuming timestamps are unique: many systems update multiple records at the same second. Without a tie-breaker, you will eventually miss records.
- Equating “job finished” with “job succeeded”: partial failures should be first-class. Track item-level failures and counts.
- Letting retries flood the API: treat retries as traffic, not as “free.” Apply the same throttle and add backoff.
- Doing big-bang backfills: a three-month catch-up run is hard to monitor, hard to rerun, and easy to time out. Slice by windows.
When not to do this
Scheduled jobs are great for batchable work, but they are not always the right tool. Consider alternatives if:
- You need near-real-time updates: a schedule introduces inherent delay. Event-driven patterns (like webhooks) are often better.
- The source API has weak filtering: if you cannot request changes by time or cursor, every run becomes a full scan, which is costly and error-prone.
- Side effects are irreversible: if your job triggers actions (emails, account changes), reruns must be extremely careful. You may need a review queue or a two-step publish.
- The data requires strong transactional consistency: batch windows can see partial snapshots. If that matters, you need a different integration approach.
If you still choose a schedule in these situations, acknowledge the tradeoffs and add compensating controls.
Conclusion
Durable scheduled automations are less about the cron expression and more about the contract: explicit windows, persisted progress, safe reruns, and intentional throttling. Once those are in place, backfills become routine work instead of emergency surgery.
If you publish your job’s “run records” as the source of truth, you give yourself an operational handle: you can explain what happened, rerun what matters, and keep external APIs happy.
FAQ
How big should my time windows be?
Choose windows small enough to finish comfortably within your schedule interval, with room for retries and backoff. For many small business integrations, 15 to 60 minutes works well for frequent jobs, and one day works well for nightly reporting. If you are unsure, start smaller and scale up once you observe stable runtimes.
What if records arrive late or timestamps change after the fact?
Use updated_at where possible, and consider a small overlap buffer (for example, always start 5 minutes earlier than the last window end) while relying on upsert to avoid duplicates. The overlap should be a deliberate, documented design choice.
How do I handle deletes?
Deletes often require a different feed (like a “deleted records” endpoint) or periodic reconciliation. If your source does not provide delete events, plan for a periodic full comparison of keys, or accept that your destination is an append-only history.
What should I log for each run?
At minimum: window boundaries, counts (fetched, processed, failed), the reason for failure, and the slowest or most common API error. Logs are helpful, but the run record is what makes the system operable.