← Writing

Migrating a .NET Payments Platform to AWS with Zero Downtime

"Zero downtime" is the most over-claimed phrase in cloud migration. Here's what it actually means for a payments platform, the five categories of risk most teams skip, and the rollback gate that saved us at 3 AM.

I've watched three teams in two companies claim "zero-downtime migration" and all three quietly dropped transactions during cutover. Not many — single digits per minute for a few minutes — but in payments, even one dropped authorization is a customer who blames you, a chargeback risk, and a reconciliation headache that follows you for weeks.

The phrase has become marketing. This is what it actually requires, drawn from leading the migration of a mission-critical .NET payments platform from on-prem to AWS.

1. Define "zero" before you claim it

"Zero downtime" needs a measurable definition before the cutover plan can be written. Pick one of the following and write it down. If you can't, you're not ready to migrate.

  • Availability zero — no HTTP 5xx returned to clients during the cutover window. Achievable.
  • Transaction zero — no payment authorization fails or is duplicated due to the cutover itself. Hard. Requires idempotency end-to-end.
  • Reconciliation zero — by end of cutover day, every transaction is in exactly one ledger, with the same status everywhere. Hardest. Most teams skip this and pay the price the next week.

The hardest one is reconciliation. If your post-cutover ledger has a transaction that exists in the old system and not the new (or vice versa, or with a different status), you've migrated the application but corrupted the data. That counts as downtime in finance even if every customer saw an instant 200 OK.

If "zero downtime" isn't defined as one of: availability, transaction, or reconciliation — it's not a goal. It's a vibe.

2. Map the five risk categories

Most cutover failures come from one of five risk categories. Map each one explicitly before writing the cutover playbook.

2.1 Stateful coupling

Any process that holds state in memory — session caches, in-process queues, retry queues that never made it to disk — will lose that state when you cut over. For each component, ask: "What is held in process memory right now, and what happens to it if this process dies in 60 seconds?" If the answer is unclear, fix it before cutover.

2.2 Shared resources

Database, cache, message broker, file storage. If you're keeping any shared resource (often the database, during the first cutover phase), you must verify that the new system reads and writes to it in byte-identical ways to the old. Locale settings, timezone handling, decimal precision, BOM in encoded fields — these are the silent killers.

2.3 Downstream contracts

Card networks, banks, processors (TSYS, Fiserv, Stripe), KYC providers, fraud services. Every integration has a contract — explicit or assumed — about latency, idempotency, retries, and error handling. The new system must honor every assumed contract plus survive the increased latency from running in AWS vs. on-prem during the transition.

2.4 Cryptographic boundaries

Anywhere you encrypt, sign, or hash with a secret. PCI DSS, in particular, cares about HSM and key-management boundaries. Migrating the application without migrating the key custody story is a compliance incident waiting to happen. KMS keys must be provisioned, rotated, and proven equivalent before any production traffic.

2.5 Observability gap

The most common failure: the new system has different metrics, different log formats, different dashboards. During the 4-hour cutover window, the on-call engineer can't correlate "the spike at 03:14" between the old graph and the new. Either build a unified dashboard before cutover, or you are flying blind during the exact period you most need vision.

3. Pick the right strategy: strangler-fig, blue-green, or dual-write

These three are not interchangeable. They suit different risk profiles:

  • Strangler-fig — route X% of traffic to the new system, ramp slowly. Works when both systems can independently process traffic and reconcile asynchronously. Best for stateless, idempotent operations.
  • Blue-green — both systems run in parallel; cutover is an atomic switch. Works when downstream cannot tolerate any duplicate processing. Requires a clean handoff moment.
  • Dual-write — both systems receive every write. Works for the database-migration sub-problem within a broader strategy. Painful, but provides the safest correctness guarantee for the data layer.

For payments specifically, I favor strangler-fig at the API gateway for read traffic, blue-green for the authorization path, and dual-write for the ledger. The strategy isn't one-size — it varies per surface.

4. The cutover playbook (real one, not the wishful one)

Here's a structure that has survived three production payments-platform cutovers without losing a transaction:

  1. T-7 days: Dark traffic. Mirror 100% of production traffic to the new system. Discard responses. Compare outputs offline. Goal: surface diffs you didn't know existed.
  2. T-3 days: Shadow comparison. Same as above, but actively diff every response. Diffs > threshold abort the migration.
  3. T-1 day: Synthetic traffic at scale. Synthetic load that mimics peak production through the new path end-to-end, including downstream calls in a test mode.
  4. T-0: Gradual ramp. 1% → 5% → 25% → 50% → 100% over a controlled window. Each step has explicit success criteria — error rate, latency p95, downstream-contract success — and an explicit abort condition.
  5. T+1 day: Soak. 100% traffic on the new system, old system kept warm. Reconcile every transaction. Verify ledger integrity.
  6. T+7 days: Old-system decommissioning. Not before. The old system is your rollback. Don't burn it the day after cutover, no matter how confident you feel.

5. The rollback gate

The single most important artifact in any cutover playbook is the rollback gate: a written, pre-agreed criterion under which you halt the ramp and route traffic back. Three rules:

  • The gate is defined before the cutover, not during. (Trying to define "is this bad enough to roll back?" at 3 AM is the worst time.)
  • The gate is measurable — a specific error rate, latency target, or downstream-failure count. Not "if it feels wrong."
  • The gate is owned by one person. Distributed accountability at 3 AM equals no accountability.

On our migration, the rollback gate fired at 03:14 — at the 50% ramp. A downstream processor was returning a soft error our new system handled differently. We rolled back in eleven minutes. Without the gate, we'd have escalated for an hour before deciding.

6. Common mistakes

  • Migrating during low traffic. Tempting, but you don't surface real problems until peak. Better to migrate at moderate traffic with full team on-call.
  • Skipping the soak. "We're at 100%, decommission the old system!" — no. The old system is your insurance. Pay for it for another week.
  • Underestimating cross-AZ latency. If the new system places a previously-co-located dependency in a different AZ, p95 latency can rise sharply with no warning. Profile latency end-to-end before cutover, not after.
  • Forgetting the batch jobs. Hourly settlement runs, nightly reconciliation, weekly reports — these have to be migrated too. Most teams remember the API; many forget the cron.
  • Treating "PCI DSS" as a checkbox. Compliance scope changes when infrastructure changes. The AWS environment has its own AOC requirements, key-management story, network-segmentation rules. Re-scope your audit boundary before cutover, not during the next audit cycle.

7. What I'd do differently next time

Three concrete changes:

  1. Invest more in offline diff infrastructure. The dark-traffic / shadow-comparison phase is where you catch the silent bugs. Better tooling pays compound interest.
  2. Test the rollback path in production before cutover. A practice rollback against synthetic-but-realistic traffic. Most teams test the migration path obsessively and never test the un-migration path until they need it under stress.
  3. Bring finance into cutover planning earlier. Reconciliation requirements drove three changes to the application layer; we'd have made them more cheaply if finance had been in the room from week one.

Takeaways

  1. Define "zero" before you claim it — availability, transaction, or reconciliation.
  2. Map the five risk categories: stateful coupling, shared resources, downstream contracts, cryptographic boundaries, observability.
  3. Pick the strategy per surface, not for the system as a whole.
  4. The rollback gate must be defined before cutover, be measurable, and have a single owner.
  5. Don't decommission the old system the day after cutover. Soak first.
FM

Fady Massoud — Engineering Manager (Hands-On) at Kort Payments, formerly Lead Software Engineer at Paysafe. 18+ years building FinTech payments platforms. Get in touch.