← Writing

Decomposing a 1.5M-LOC Monolith Without Stopping the World

The honest playbook for splitting a large payments monolith into 20+ microservices. Bounded contexts without dogma, strangler-fig sequencing, the contract-test layer that buys you speed, and the outbox-pattern mistake we made.

Most "we decomposed our monolith" stories skip the first question: should you? Decomposition is not free. It's not even cheap. It trades single-process simplicity for distributed-system complexity — network partitions, partial failures, schema versioning, deployment coordination, the entire menagerie of problems that arrive the moment a service call leaves your process boundary.

Do it anyway when the cost of not decomposing is larger. Here's how to make that call, and how to execute when you decide to proceed.

1. When NOT to decompose

Three signals that say "leave the monolith alone, or fix what's actually broken first":

  • Your deploy pipeline is the bottleneck, not your code. If deploys take 90 minutes because tests are slow, fix the tests. A microservice with the same slow tests just gives you 20 slow pipelines.
  • Your team is small and co-located. Conway's law is real. If three engineers ship every PR, splitting the code doesn't help; it adds coordination cost.
  • The business doesn't have separable failure-isolation needs. If every feature shares the same uptime SLA and the same blast-radius tolerance, the microservice boundary buys you nothing on the operational side.

I've seen all three signals get ignored. The result was a "distributed monolith" — one logical system spread across 20 deployable artifacts, all changing together, all failing together, with all the operational complexity of distribution and none of the benefits.

The right question isn't "should we use microservices?" It's "what specific problem will we no longer have after this?" If you can't name the problem, don't start.

2. Identifying bounded contexts (DDD without dogma)

Domain-Driven Design has a vocabulary problem. "Bounded context" is one of those terms that gets repeated in architecture reviews until it loses meaning. In practice, here's what a bounded context is:

A bounded context is a chunk of code in which the same business term means the same thing. If "customer" in the billing area means something different from "customer" in the fraud area, those are two bounded contexts. The boundary is the place where the term changes meaning.

For a payments platform, the typical bounded contexts are:

  • Authorization — "transaction" means a card-network message in flight
  • Capture / settlement — "transaction" means a financial event with downstream ledger impact
  • Risk / fraud — "transaction" means a feature vector being scored
  • Ledger — "transaction" means a double-entry record
  • Disputes / chargebacks — "transaction" means a historical record being contested
  • Reporting — "transaction" means a row in an analytics warehouse

Six different things, all called "transaction." Each one is a candidate service boundary. The boundary is where the term changes meaning, not where the code happens to be organized today.

3. Strangler-fig sequencing — which service first?

The single biggest decision in a decomposition project is the order of extraction. The wrong order makes every subsequent step harder. The right order makes each step cheaper than the last.

Rank candidates by three axes:

  1. Stability — how often does this code change? Extract stable code first. Volatile code stays in the monolith longer because each iteration is cheaper there.
  2. Coupling — how many threads of data and control reach into and out of this area? Extract least-coupled code first.
  3. Business risk — what's the blast radius if this service has a bad day? Extract lower-risk code first to build operational muscle before you bet the company on a riskier service.

For us, the first extraction was reporting (stable, low coupling, low risk). The last was the authorization path (volatile, deeply coupled, highest risk). Each extraction in between built the pattern, the tooling, the team's confidence.

4. The contract-test layer

The single most valuable engineering investment in a decomposition project is contract testing. Not unit tests, not integration tests — contracts.

A contract test verifies: "When service A sends this shape, service B accepts it; when service B sends this response shape, service A understands it." The contract is the shape, not the behavior. The contract lives in version control. Either side can break it — and either side can detect the break — without spinning up the other side.

We used Pact-style consumer-driven contracts. The consumer publishes its expectations. The producer's CI verifies it still meets them. A breaking change to the producer fails the producer's PR, not the consumer's production.

Without contract tests, every extraction requires you to spin up both sides — locally, in CI, in staging — and you spend more time fighting the test environment than building anything. With contract tests, each service can be developed against a mock of the contract and the mock is verified to match reality continuously.

Contract tests are the single highest-ROI investment in a decomposition. Build them in week one, not week twenty.

5. The outbox-pattern mistake we made

The outbox pattern is the canonical answer to "how do I write to my database and publish a message reliably?" The idea: don't publish directly to the message broker; write the message into an "outbox" table in the same database transaction as the business write. A separate process reads the outbox and publishes. If the publish fails, the outbox still has the record; you retry.

Beautiful pattern. Here's what we got wrong the first time:

We had the outbox processor running on every service instance. Each instance polled the outbox, published, and marked rows as sent. The problem: under load, two instances would lock the same row, one would publish, the other would publish a duplicate before realizing the row was taken. We were producing duplicate messages — exactly the failure mode the outbox is supposed to prevent.

Three things needed fixing:

  1. Atomic row claim. Use SELECT ... FOR UPDATE SKIP LOCKED (Postgres) or equivalent so only one process claims each row. Easy fix; we'd missed it.
  2. Idempotent consumers. Even with atomic claims, network or process failure can cause re-publish. Consumers must be idempotent. Always.
  3. Outbox row TTL. Without it, the table grows unbounded. After 90 days a "lightweight outbox" was 200M rows. Add archival from day one.

6. Common pitfalls

  • Decomposing across the wrong axis. Splitting "user service / order service / payment service" by noun looks tidy but often cuts across the actual business workflows. Split along the workflow (the use case), not the noun.
  • Shared database for too long. Two services that read each other's tables are one service in disguise. The database is the boundary; share it and you've made nothing.
  • Synchronous everything. If service A calls service B calls service C synchronously, your latency is the sum and your availability is the product. The monolith was a 99.95% system; the chain of three is 99.85% at best. Push asynchronous communication wherever the business workflow allows it.
  • Distributed transactions. Two-phase commit across services is a footgun. Use sagas or compensating transactions instead. Idempotency at every step.
  • Versioning by deployment timing. "Don't worry, we'll deploy both services at the same time" is not a versioning strategy. Every service must tolerate at least one previous version of every contract it speaks.

7. What I'd do differently

  1. Invest in observability first, code second. The first month should produce a unified tracing story, structured logs, and standardized metrics. Without it, every new service is opaque from day one.
  2. Build the deployment platform before the services. If onboarding a new service to your platform takes a week, you'll have 5 services in a year. If it takes an hour, you'll have 30. The platform is the multiplier.
  3. Be more aggressive about pulling out the database. We kept a shared database for too long. The day each service got its own data store was the day we actually got the benefits we'd been promising for two years.
  4. Stop earlier. Not every monolith needs to become 20 services. We had two areas that should have stayed in the monolith. Knowing when to stop is a craft skill.

Takeaways

  1. Don't decompose unless you can name the specific problem you'll no longer have.
  2. Bounded contexts = where business terms change meaning. Find them first.
  3. Sequence by stability + low coupling + low risk. Build muscle on the easy ones.
  4. Contract tests are the single highest-ROI investment. Build them in week one.
  5. Outbox-pattern: atomic row claim, idempotent consumers, TTL on the table.
  6. Observability and deployment platform before the second service. The platform is the multiplier.
FM

Fady Massoud — Engineering Manager (Hands-On) at Kort Payments, formerly Lead Software Engineer at Paysafe. 18+ years building FinTech payments platforms. Get in touch.