“Sync our data to that other system” sounds simple until you ship it. Then you discover missing records, duplicates, jobs that run forever, and a support inbox full of “why is this customer not there?” questions.
The good news is that most API sync failures come from a handful of predictable issues: pagination that shifts under you, rate limits that you only hit in production, and retries that accidentally create duplicates.
This post lays out a dependable pattern you can reuse: iterate through pages using a stable cursor, record a checkpoint as you progress, and design retries to be safe. It is intentionally tool-agnostic so you can apply it whether you run jobs via a server, a scheduler, or a workflow tool.
Why API data syncs fail in practice
API documentation often shows a happy-path loop: fetch page 1, fetch page 2, and so on. Real systems are less cooperative. Here are the failure modes that show up repeatedly:
- Unstable pagination: offset pagination (
?page=3or?offset=200) can shift when new records are created during your run, causing skips or duplicates. - Ambiguous ordering: if the API does not guarantee sort order, the same record might move between pages from one request to the next.
- Rate limiting: after you scale from hundreds to tens of thousands of records, you start getting
429responses or throttling headers you never noticed. - Retries that are not safe: a timeout after writing to the destination can prompt a retry that writes again, creating duplicates or conflicting updates.
- No progress tracking: if the job crashes mid-run, you do not know where to resume, so you rerun everything and hope it “mostly works.”
A reliable sync is less about raw speed and more about correctness under interruptions. That means designing for partial completion, restarts, and repeated runs.
The core pattern: cursor pagination + checkpoints
The backbone of a robust sync is a monotonic traversal of the source, paired with a durable checkpoint you can resume from. If the API supports a cursor-based approach (sometimes called “next token”), use it. If it supports filtering by updated time plus a stable tie-breaker (like ID), you can emulate a cursor.
What a “checkpoint” really is
A checkpoint is a small piece of state that captures “how far we got” in a way that is safe to reuse. The simplest checkpoints are:
- Cursor token: store the API’s next cursor after each page.
- High-water mark: store the last processed
updated_attimestamp plus a tie-breaker ID. - Destination mapping: store a mapping of source IDs to destination IDs so you update instead of re-create.
Persist checkpoints somewhere durable: a database table, a key-value store, or even a small file in object storage. The important part is that it survives process restarts.
{
"syncName": "customers_to_crm",
"source": { "cursor": "next_page_token_or_null" },
"watermark": { "updatedAt": "2026-01-01T00:00:00Z", "id": "cust_000123" },
"stats": { "processed": 1200, "errors": 3 }
}
If the API provides a cursor, you usually do not need a watermark. If it does not, a watermark approach is more common. In both cases, the intent is the same: resume without guessing.
Make ordering explicit
If you use a time-based watermark, define a deterministic ordering. A practical rule is: sort by updated_at ascending, then by id ascending. Your “resume” filter becomes:
- records where
updated_atis greater than the saved timestamp, or - records where
updated_atequals the saved timestamp andidis greater than the saved ID
This prevents you from re-processing large swaths while still allowing safe resume when multiple records share the same timestamp.
Handling rate limits and transient errors
Once your sync becomes reliable, the next pain point is resiliency under load. You will see throttling, occasional 500s, timeouts, and network hiccups. Treat these as normal conditions.
Backoff and jitter: be polite and effective
A simple approach that works well is exponential backoff with jitter for transient errors (429, some 5xx, and timeouts). “Jitter” means adding randomness so multiple workers do not retry in lockstep.
- Start with a short delay (for example 1 to 2 seconds).
- Double the delay after each failure up to a cap (for example 60 seconds).
- Add randomness (for example 50 to 100 percent of the base delay).
If the API provides a Retry-After header, prefer it. It is the closest thing you get to “official” pacing.
Retry safely by separating fetch and write concerns
The most expensive bug in sync jobs is “retry equals duplicate.” To avoid that, make writes idempotent, meaning multiple identical attempts produce the same final state.
Common idempotency techniques include:
- Upsert by stable key: create or update the destination record based on the source’s immutable ID (stored in a dedicated field).
- Write with a version check: only update when the source
updated_atis newer than what you last wrote. - Use a dedupe key: for “create” endpoints that support it, include a unique request key tied to the source record.
Finally, checkpoint after your destination writes are confirmed. If you checkpoint before writing, you risk skipping records during a crash.
A concrete example: syncing customers to a CRM
Imagine an e-commerce system that exposes customers via an API and a CRM where you want a matching contact record. The job runs hourly and should:
- create new contacts for new customers,
- update contacts when customer details change,
- avoid duplicates if the job is retried or rerun.
A practical implementation looks like this:
- Choose your traversal: fetch customers ordered by
updated_atascending, thenid. - Store a watermark: start with a conservative timestamp (for first run) and save
(updated_at, id)as you make progress. - Map source to destination: in the CRM contact, store
external_customer_idto support upserts. - Write idempotently: for each customer, “find by external ID” then update, else create with external ID.
- Checkpoint in small increments: update the watermark every page or every N records, not only at the end.
What happens if the CRM API times out after you send an update request? Your job retries that customer. Because you upsert by external ID and only apply updates when newer, the retry is safe.
What happens if new customers arrive while you are syncing? Because you traverse by updated_at and save a watermark, they will be picked up in the next run. You do not need to “chase” a moving tail during the current run.
Checklist: a copyable implementation plan
Use this as a build checklist for your next API-to-API sync. If you can tick every box, you are usually in good shape.
- Define the contract: what objects sync, what fields map, and what “source of truth” means for conflicts.
- Pick a stable traversal: cursor pagination if available; otherwise
updated_atplusidordering. - Persist a checkpoint: cursor or watermark in durable storage.
- Make writes idempotent: upsert by stable external ID; avoid blind “create” calls.
- Separate errors: treat
4xxvalidation errors as record-level problems; treat429/5xx/timeouts as retryable. - Backoff correctly: exponential backoff with jitter; obey Retry-After if provided.
- Checkpoint frequently: per page or per batch, after successful writes.
- Log for support: include sync name, source ID, destination ID (if known), and a short error reason.
- Set limits: maximum retries per record and maximum runtime per job run.
- Plan for replays: assume the job might rerun the same range and ensure outcomes remain correct.
- Prefer cursor pagination. If you cannot, use a deterministic
updated_atplusidordering and store a watermark. - Checkpoint progress in durable storage and update it only after confirmed destination writes.
- Design retries to be safe by using upserts and version checks, not “best effort” creates.
- Rate limits and timeouts are normal. Backoff with jitter and respect Retry-After.
Common mistakes to avoid
Most broken syncs are not broken because they are complex. They are broken because they assume the world is simpler than it is.
- Using offset pagination on mutable datasets: offsets are fine for static reports, but risky for ongoing syncs where new records are added.
- Checkpointing too late: saving state only at the end turns a crash into a full rerun. Save progress incrementally.
- Retrying everything the same way: a validation error on one record should not trigger exponential backoff for the whole job.
- Not capturing a stable external ID: if you do not store source IDs in the destination, you will eventually create duplicates.
- Over-optimizing throughput: parallelism can amplify rate limits and make debugging harder. Start with correctness, then scale carefully.
If you want to move fast, build a job that can fail without creating data damage. That is what checkpoints and idempotency buy you.
When not to build your own sync
This pattern is a good default, but there are times when a custom sync is the wrong tool:
- Near real-time requirements: if you truly need sub-minute latency, a scheduled pull sync may not fit. Event-driven integration might be more appropriate.
- Complex reconciliation rules: if you need bi-directional conflict resolution across many object types, you may want a dedicated integration layer.
- Regulated audit needs: if you must retain full change history with strict traceability, ensure your storage and logging meet the requirement before rolling your own.
- Very high volume: if you are processing millions of records per run, you will need more careful partitioning, capacity planning, and load testing.
Even then, the same concepts still apply: stable traversal, durable checkpoints, and safe retries. The difference is mainly the scale and operational rigor.
FAQ
Should I sync “all records” every time, or only changes?
Prefer incremental syncs based on cursor or watermark. Full re-syncs are useful as an occasional repair tool, but they are slower, more expensive, and more likely to hit rate limits. If you do run full re-syncs, idempotent writes become even more important.
How frequently should I update the checkpoint?
Update it often enough that a crash does not cause painful rework, but not so often that checkpoint writes dominate runtime. Per page is a good default. For large pages, consider every N records, always after destination writes succeed.
What should I do with records that always fail (bad data)?
Do not let one bad record stall the entire sync. Log it with enough context to fix later, count it as a record-level failure, and continue. Optionally store a small “failed items” list for review, but keep the main flow moving.
How do I prevent duplicates if the destination API does not support upserts?
Use a lookup step: search by a stored external ID field, then update if found, else create. If the destination cannot store an external ID at all, you may need a separate mapping store that tracks source ID to destination ID.
Conclusion
Reliable API data syncs are built, not wished into existence. The winning pattern is consistent: traverse the source in a stable way, persist progress as a checkpoint, and make every write safe to retry.
Once you have that foundation, you can improve performance and add features confidently, because a restart or a rerun becomes a normal, boring event instead of a disaster.