Pagination looks simple: fetch page 1, then page 2, and repeat until you are done. In production sync jobs, that “repeat until done” is where reliability problems hide. Records get inserted while you are reading, APIs apply filters differently across endpoints, rate limits slow you down, and retries can create duplicates if you do not design for them.
This post gives you an evergreen pagination plan you can apply to most REST-style APIs and many GraphQL list queries. The goal is not maximum speed. The goal is a sync you can re-run safely, monitor confidently, and evolve as the dataset grows from hundreds to millions of rows.
Even if you already have a working import, a few adjustments to how you page, store state, and validate completeness can eliminate the most common failure modes without a major rewrite.
Why pagination fails in real sync jobs
Most pagination bugs are not “off by one” mistakes. They come from implicit assumptions that stop being true as your data and traffic grow. Here are the big ones:
- Unstable ordering: if the API does not guarantee a consistent sort order, items can move between pages while you read. You will miss some and duplicate others.
- Offset drift: offset-based pagination (skip N, take K) breaks when new records are inserted or deleted during the read window. Page boundaries shift.
- Retry duplication: if your job times out and restarts, it may reprocess pages unless you persist a checkpoint and write idempotently.
- Hidden filters: “status=active” may change mid-sync, or the endpoint may have default filters that differ from what you expect.
- Rate limits and time windows: you fetch slower than records change. If you also rely on “last updated” timestamps without guardrails, you can create gaps.
The fix is to make ordering, state, and boundaries explicit, then treat pagination as a resumable process rather than a single request loop.
Choose the right pagination method
APIs usually offer one (or more) of these patterns. Your job is to pick the method that supports a stable walk through data and makes recovery simple.
Cursor-based pagination (preferred)
Cursor pagination returns a token (cursor) that represents “where to continue.” This is the most reliable choice because it is usually tied to a stable ordering on the server side. You request the next page with the cursor, not an offset. If the API provides cursors, use them.
- Pros: stable, resilient to inserts, easy to resume, often faster at scale.
- Cons: cursors can expire; you may need to store additional context to resume after long pauses.
Keyset pagination (great if you can sort)
Keyset pagination is “give me the next page where id > last_id” (or where updated_at > last_updated with a tie-breaker). It is cursor-like, but you build the cursor from fields you control. This works well if the API lets you filter and sort by a monotonic key.
- Pros: checkpoint is human-readable, resume is straightforward, no cursor expiration.
- Cons: requires strict ordering guarantees and a tie-breaker when timestamps match.
Offset pagination (use only with extra caution)
Offset pagination is “page=5” or “offset=200&limit=50.” It can be fine for small, static datasets, but it is brittle for live systems where rows are added or removed during the run.
- Pros: simplest to understand, easy to implement.
- Cons: drift and gaps during concurrent updates, slow on large datasets.
If your API only supports offset, you can still make it work by adding a stable sort (if supported), limiting the time window, and using reconciliation checks after the run.
Design a sync loop with state and safety
A reliable pagination job has two layers: a reader that walks pages, and a writer that upserts records in an idempotent way. The reader should be resumable, and the writer should tolerate repeats.
A minimal state model
At minimum, store a checkpoint that lets you resume the read. Depending on the pagination type, that might be a cursor token, a last-seen ID, or a compound “(updated_at, id)” pair. Store it somewhere durable, not only in memory.
- Checkpoint: “continue from here” marker.
- Run boundary: a cap that prevents chasing a moving target forever, such as a snapshot timestamp.
- Watermark: the last fully processed “updated_at” (plus tie-breaker) for incremental syncs.
A useful mental model is: “I want to read everything up to a known boundary, in a stable order, and I want to be able to restart from my last confirmed position.”
Pseudo-structure for a resilient paginator
The following is intentionally conceptual. It shows the moving parts you should have, even if your stack and API differ:
run_boundary = capture_boundary() # e.g., now() or API-provided snapshot marker
checkpoint = load_checkpoint() # cursor token or (updated_at, id)
while true:
page = fetch_page(checkpoint, run_boundary)
if page.items is empty: break
for item in page.items:
upsert_item_idempotently(item) # safe if repeated
checkpoint = page.next_checkpoint
save_checkpoint(checkpoint)
record_run_metrics()
reconcile_if_needed()
Two details matter more than anything else:
- Upserts must be idempotent: if the same record arrives twice, the second write should be harmless. Use a stable external ID and update-by-ID semantics.
- Checkpoints must be saved after durable writes: do not advance the checkpoint before you know the data is committed.
Finally, instrument the loop. Count items fetched, items written, pages processed, and how many retries occurred. These are your early warning signals.
Example: syncing invoices from a billing API
Imagine a small SaaS business that needs nightly invoice syncs into its internal database for customer support and reporting. The billing API provides invoices with fields like id, updated_at, and status. The dataset is large enough that the API requires pagination.
A practical plan is to do an incremental sync using keyset pagination on updated_at with a tie-breaker on id:
- Ordering: sort by
updated_atascending, then byidascending. - Checkpoint: store
last_updated_atandlast_id(the last record you successfully wrote). - Boundary: capture a
run_boundary_timeat job start, then only request invoices withupdated_at <= run_boundary_time.
Why the boundary? Without it, you can get stuck chasing fresh updates. A customer pays, an invoice updates, it moves to the end, and the job keeps extending. With a boundary, you finish the snapshot cleanly and pick up new changes in the next run.
For writes, you upsert by invoice id. If the job crashes after writing 3,000 invoices, it restarts from the stored checkpoint and repeats the tail. Because your upsert is idempotent, repeats are safe, and you do not need fragile “did I already import this page” bookkeeping.
Common mistakes (and safer alternatives)
- Mistake: paginating without an explicit sort.
Safer alternative: always request a documented sort order, ideally by a stable key or(updated_at, id). - Mistake: using
updated_atalone as a checkpoint.
Safer alternative: store a tie-breaker (like ID) so you can resume deterministically when many records share the same timestamp. - Mistake: saving the checkpoint before database commits.
Safer alternative: only advance the checkpoint after writes succeed, and ensure the state save is also durable. - Mistake: assuming retries only happen on network errors.
Safer alternative: design for retries at every layer, including timeouts, partial failures, and restarts. Make writes idempotent. - Mistake: ignoring deletion handling.
Safer alternative: if the API supports “deleted” events or tombstones, ingest them. Otherwise, plan periodic reconciliation to detect missing rows.
These mistakes usually appear when pagination logic is built as a quick loop inside a bigger job. Treating pagination as a first-class component is what keeps “quick” from turning into “fragile.”
When not to do this
Not every integration needs a sophisticated paginator. Consider simpler options when they fit:
- Small, bounded datasets: if the API supports fetching all items reliably and the total is always small, one request may be enough.
- Event-driven sources: if the system offers webhooks or a change feed, an incremental event stream can be simpler than repeated full reads.
- One-off migrations: for a single controlled import, you can sometimes accept a manual reconciliation step instead of building full resume logic.
The rule of thumb: if the data is business-critical or the job will run repeatedly, invest in checkpoints and idempotency. If it is a one-time, low-risk copy, keep it simple and validate manually.
A copyable checklist for your next sync
Use this as a build and review checklist before you ship a paginated sync job:
- Pagination type chosen: cursor or keyset preferred; offset only with a stable sort and reconciliation.
- Stable ordering: the API request includes an explicit sort order.
- Checkpoint persisted: cursor token or (field, tie-breaker) saved in durable storage.
- Run boundary defined: a snapshot time or marker prevents chasing live updates forever.
- Idempotent writes: upsert by external ID; repeated pages do not create duplicates.
- Retry policy: backoff and max retries, with clear logging for failures.
- Rate limit handling: honor “retry-after” or implement pacing to avoid bans.
- Metrics captured: pages, items fetched, items written, duration, retry counts.
- Reconciliation plan: periodic full scan, count checks, or spot checks for missing data.
- Prefer cursor or keyset pagination with an explicit, stable sort order.
- Make the job resumable with durable checkpoints saved only after commits.
- Use a run boundary (snapshot marker) so the job finishes cleanly.
- Assume retries and restarts will happen, then make repeats harmless via idempotent upserts.
- Instrument and reconcile, because “no errors” does not always mean “no gaps.”
Conclusion
Reliable pagination is less about clever looping and more about explicit guarantees: stable ordering, durable state, safe retries, and clear boundaries. If you adopt those four ideas, your sync jobs become easier to operate and much harder to break as data grows.
The next time you add a paginated integration, treat the paginator as a small subsystem with its own invariants and metrics. Your future self, and anyone on call, will notice the difference.
FAQ
How do I handle records that change while I am paging?
Use a run boundary so you read a consistent snapshot window, and use incremental checkpoints like (updated_at, id). Changes after the boundary are picked up on the next run, which prevents gaps and endless chasing.
What if the API does not let me sort or filter?
If the API is “pagination only” with no stable sort guarantees, you should assume duplicates and gaps are possible. In that case, lean harder on idempotent upserts and add reconciliation, such as periodic full compares or record counts by time bucket.
Do I need to store every page response for recovery?
Usually no. Store the minimal checkpoint needed to resume, plus enough logs and metrics to diagnose issues. Saving raw pages can help during debugging, but it increases storage and can create sensitive data handling problems.
How can I tell if my pagination is missing data?
Build at least one independent validation signal: compare counts across runs, reconcile a sample of IDs, or periodically run a slower “audit” sync that scans a known time window and checks for gaps.