“Refactor vs rewrite” is rarely a purely technical choice. For a small team, it is a bet on time, risk, and focus. A rewrite can feel like a clean slate, while refactoring can feel like slow progress through legacy constraints.
The real cost is not measured in lines of code. It is measured in missed product opportunities, reliability incidents, team burnout, and the length of time you carry two systems in your head.
This post offers a practical framework you can run in a meeting: clarify what problem you are solving, score the options, and choose a plan that protects delivery while improving the codebase.
What you are really deciding
Most teams frame the question as “Is the code bad enough to rewrite?” A better frame is: “What outcome do we need, and which path gets us there with controlled risk?” The outcomes typically fall into a few buckets:
- Change speed: features take too long because the code is hard to understand or safely modify.
- Reliability: incidents repeat because behavior is unpredictable or hard to observe.
- Cost: infrastructure or vendor constraints make the current approach expensive.
- Capability: a new product direction requires patterns the current architecture resists.
- People: onboarding is slow, and only one person feels safe touching critical parts.
Once you name the outcome, you can evaluate options beyond two extremes. In practice you often choose one of these:
- Targeted refactor: restructure a few modules, keep interfaces stable, ship continuously.
- Replatform slice: rebuild a bounded capability behind the same API or UI surface.
- Full rewrite: rebuild most of the system and migrate users and data.
- Stabilize only: invest in tests, monitoring, and documentation, with minimal design change.
A practical scoring rubric
Scoring rubrics are helpful because they force the team to write down assumptions. The goal is not a perfect number, it is a shared decision record.
Step 1: Define a “decision unit”
A rewrite is only sensible if the boundary is clear. Decide what you are choosing about:
- A single service or module?
- An internal admin tool?
- The customer-facing app?
If you cannot name the boundary, start with discovery. Ambiguous scope is a leading indicator of runaway rewrite projects.
Step 2: Score both paths on six factors
Use a simple 1 to 5 scale where 1 is low and 5 is high. Score incremental refactor and rewrite separately.
Scorecard (1-5 each):
1) Business urgency (need value soon)
2) Scope clarity (boundaries and requirements)
3) Unknowns (data, edge cases, integrations)
4) Safety net (tests, observability, rollback)
5) Parallel run cost (can you afford two systems?)
6) Talent continuity (will key maintainers stay available?)
Interpretation hint:
- Refactor wins when urgency/unknowns are high and safety net is limited.
- Rewrite wins when scope is clear, unknowns are low, and parallel run cost is manageable.
Then ask two tie-breaker questions:
- Can we ship customer-visible improvements every 1 to 2 weeks? If not, risk increases sharply.
- Can we migrate incrementally? If you can move one workflow, one tenant, or one endpoint at a time, many “rewrites” become controlled slices.
- Choose an outcome (speed, reliability, cost, capability), not a vibe about “old code.”
- Small teams should default to incremental moves unless scope is clear and unknowns are low.
- Measure the cost of running two systems. It is often the real constraint.
- Invest in a safety net (tests, logging, rollback) before any major restructure.
The incremental refactor plan (that still ships)
Incremental refactoring is not “do nothing.” Done well, it is a sequence of small structural changes that reduce risk while the product continues to move. The pattern is: stabilize, isolate, reshape, and then accelerate.
A copyable checklist for a 4 to 8 week push
- Write the problem statement: one paragraph that explains the pain and the desired outcome (for example, “Reduce checkout incident rate and cut change lead time from 5 days to 2 days”).
- Choose one north-star metric: deployment frequency, lead time, incident rate, or support volume. Pick one to prevent fuzzy success criteria.
- Map the boundary: list inputs, outputs, data stores, and external integrations. Keep it to one page.
- Build the safety net: add a few high-value tests around the boundary plus basic logging and dashboards. You do not need 100 percent coverage, you need confidence in core paths.
- Introduce an adapter layer: create stable interfaces around the messy core so you can change internals without changing callers.
- Refactor in thin slices: replace one component at a time behind the adapter. Ship after each slice.
- Use a rollback plan: decide how to revert quickly if a slice misbehaves (config toggle, routing switch, or deployment rollback).
- Close the loop: compare your north-star metric before and after, and write a short decision log for future maintainers.
This plan works because it turns “legacy modernization” into a series of safe, reviewable changes. It also creates optionality. If priorities change, you can stop after a few slices and still keep the improvements you shipped.
Common mistakes (and what to do instead)
- Mistake: Starting with a new architecture diagram.
Instead: start with the boundary map and the top five failure modes. Architecture should follow constraints you can name. - Mistake: Rewriting because onboarding is painful.
Instead: document the “golden paths,” add runbooks, and reduce tribal knowledge with tests around critical flows. Onboarding pain is often an observability and documentation issue. - Mistake: Treating refactoring as a separate track that competes with features.
Instead: bundle refactor slices with feature work in the same area. If you are touching checkout for a feature, improve the checkout boundary while you are there. - Mistake: Building a perfect test suite before moving anything.
Instead: add targeted tests for core flows and failure cases, then iterate. The best tests are the ones that protect the next risky change. - Mistake: Ignoring data migration until late.
Instead: treat data as part of the boundary from day one. If rewriting means new schemas, prototype migration early and plan how you will validate correctness.
When not to do this
Both refactoring and rewriting can be the wrong move. Here are cases where the highest leverage is elsewhere:
- Product direction is unclear: if requirements will change weekly, a rewrite locks you into guesses. Focus on learning and keep changes local.
- Incidents are caused by operations, not code structure: if outages come from capacity, missing alerts, or manual deploy steps, fix the operational basics first.
- The system is stable and low-change: if a module rarely changes and works, rewriting it is often pure risk. Document it and move on.
- You cannot afford parallel run: if you cannot run and support two systems, avoid big-bang migrations. Choose a slice you can replace without dual ownership.
The point is not to avoid improvement. It is to avoid spending months “modernizing” while the real bottleneck stays in place.
A concrete example: the checkout service dilemma
Consider a small e-commerce team with a monolithic app. Checkout changes take a week, and production incidents cluster around promotions and shipping rules. The team proposes rewriting checkout as a new service.
They run the scorecard:
- Business urgency: 5. Revenue-impacting issues need improvement quickly.
- Scope clarity: 3. Checkout touches tax, shipping, inventory, fraud checks, and email receipts.
- Unknowns: 4. Many edge cases exist for international orders and discount stacking.
- Safety net: 2. Few tests exist and logging is inconsistent.
- Parallel run cost: 4. Supporting two checkouts would complicate customer support and accounting.
- Talent continuity: 3. The person who knows the promo engine might rotate to another project.
Rewrite looks risky: urgency and unknowns are high, and the safety net is weak. The team chooses an incremental plan:
- They add “contract tests” around the checkout API responses and key calculations (totals, taxes, shipping cost).
- They introduce an internal adapter that centralizes pricing logic calls, so callers stop reaching into internal state.
- They extract promotion evaluation into a single module with clear inputs and outputs, and add logging for rule decisions.
- They migrate one slice: shipping calculation becomes a replaceable component with a fallback to the old path.
After several weeks, checkout changes are smaller and safer. Incident rate drops because logs reveal which promotion rules misfire. Only then do they consider rebuilding a portion of checkout logic as a separate service, but now they can do it slice by slice, with validation and rollback.
Conclusion
A rewrite is not a moral victory, and refactoring is not settling. The best choice is the one that delivers your outcome with the least uncontrolled risk. For most small teams, that means: make boundaries explicit, build a safety net, ship in slices, and keep the option to stop or pivot.
If you want one simple rule: only choose a full rewrite when scope is genuinely clear, unknowns are low, and you can afford parallel run and migration work without starving the product.
FAQ
How do we know if our scope is “clear enough” for a rewrite?
Scope is clear when you can list the system’s inputs, outputs, data, and integrations on one page, and you can name the top edge cases. If “we will discover that later” applies to core flows, prefer incremental work until the unknowns shrink.
Is a rewrite ever faster than refactoring?
Yes, when the old system is small, tightly bounded, and well understood, and the migration is straightforward. In practice, rewrites are fastest when you can replace one unit cleanly and verify correctness with real data and contract tests.
What if the team is blocked by a single “haunted” module?
Start by isolating it. Put a stable interface in front of it, add targeted tests around that interface, and reduce the number of callers. Once it is contained, you can refactor or replace the internals with far less blast radius.
How much testing is “enough” before making major changes?
Enough means you can detect and roll back the most expensive failures. Cover the golden path, the top few edge cases, and any calculations that impact customers. Add observability so you can see behavior changes quickly after release.