# Fady Massoud — Full site content

> Engineering Manager (Hands-On) and Software Architect with two decades scaling FinTech payments platforms on .NET, Azure, and AWS. Currently leading a 12-engineer team through a zero-downtime AWS migration at Kort Payments. Open to Engineering Manager, Staff / Lead / Principal Engineer, and Software Architect roles in payments, FinTech, and platform infrastructure.

This file follows the [llms.txt](https://llmstxt.org/) convention and provides the concatenated substantive content of the site so an LLM can ingest the full corpus in one fetch. For a curated index, see [llms.txt](/llms.txt).

---

## Currently

- **role** — Engineering Manager (Hands-On) · Kort Payments
- **building** — Zero-downtime AWS migration · CI/CD modernization · On-call program
- **stack** — .NET 8 · AWS (EC2, RDS, SQS) · Kafka · SQL Server
- **learning** — AI-augmented engineering — Claude Code, Claude Cowork, Cursor, Gemini, OpenAI
- **location** — California · Open to remote

---

## About

I run engineering teams the way I wish someone had run mine when I was a Senior IC — clear goals, clear standards, a real on-call rotation, and a manager who can still pair on the hard refactor. **Hands-on Engineering Manager** isn't a hedge; it's a deliberate operating model.

Over the better part of a decade in FinTech payments at Paysafe and Kort, I've done two big platform modernizations end-to-end: decomposing a 1.5M-LOC .NET monolith into 20+ event-driven microservices on Azure, and lifting a mission-critical .NET payments platform to AWS with zero downtime and full PCI DSS compliance. I've owned audit cycles, processor recertifications (TSYS, Fiserv), bank-onboarding integrations, and the secure-SDLC standards that make those things possible.

**What I look for next:** Engineering-led companies in payments, FinTech, or platform infrastructure — places where leadership is technical, not just managerial, and where the on-call pager goes to the person who can read the stack trace.

---

## By the numbers (career totals)

- **20+** .NET Core microservices architected at Paysafe
- **2** platform modernizations end-to-end
- **+40%** release-velocity lift, team-wide
- **12** engineers currently leading at Kort
- **2,500 TPS** peak throughput across regulated FinTech payment systems
- **99.95%** production availability across multi-market regulated payment infrastructure
- **150,000** Kafka events / minute with sub-second latency
- **30%** reduction in high-severity Veracode findings via secure-SDLC and dependency hardening
- **Multiple consecutive** clean PCI DSS audit passes

---

## Experience

### Mar 2025 — Present · Engineering Manager (Hands-On) · Lead Software Engineer
**Kort Payments (acquired by Paysafe)** · FinTech / Payments · Remote

- Leading the **zero-downtime AWS migration** of a mission-critical .NET payments platform (EC2, RDS, SQS, CloudWatch, IAM) with full PCI DSS compliance.
- Managing a **12-engineer team**; designed the on-call rotation, blameless post-mortem process, design-doc review practice, and code-review standards.
- Refactored core services with idempotency keys, retry-with-backoff, and circuit breakers — payment-processing latency down 25%.
- Modernized CI/CD: trunk-based, feature flags, blue-green deploys — daily releases instead of weekly.

### Jun 2019 — Feb 2025 · Lead Software Engineer · Solution Architect
**Paysafe Group** · FinTech / Payments · Remote

- Architected enterprise-scale FinTech payments platforms serving millions of users across multiple markets.
- Decomposed a 1.5M-LOC .NET Framework monolith into 20+ .NET Core microservices on Azure (Azure VMs, Event Grid) — DDD bounded contexts and event sourcing.
- Designed the CI/CD platform (Azure DevOps, GitHub Actions, canary deploys) — +40% release velocity team-wide.
- Authored REST API and platform standards (OpenAPI 3.0, idempotency-key conventions, error envelopes) adopted org-wide.
- Owned PCI DSS audit readiness across services; clean audit passes across multiple consecutive cycles.
- Mentored 10+ engineers; led the cross-team architecture forum.

### Jul 2017 — Jun 2019 · Senior Software Engineer
**Paysafe Group** · FinTech / Payments · Irvine, CA

- Designed Kafka producer/consumer services for payments event flows; modernized legacy WCF endpoints to REST.
- Led TSYS and Fiserv processor recertifications and onboarded new banking partners.
- Cut high-severity Veracode findings by 30% via dependency upgrades, input-validation hardening, and secure-coding training.
- Delivered 2FA and platform-wide security enhancements; strengthened PCI DSS compliance.

### Jun 2014 — Jun 2017 · Senior Software Engineer
**CraneMorley** · E-Learning · Long Beach, CA

- Modernized legacy CMS, LMS, and Microsoft Dynamics platforms — refactored ASP, VB, and DotNetNuke for scalability.
- Cut median report-generation time by 85% via SQL Server query rewrites and indexing strategy.
- Led development of client-specific SPAs, certification engines, and SCORM API integrations.

### 2008 — 2014 · Earlier experience

- Senior Software Engineer (Part-Time), The Dependable Companies — Logistics (LA, 2016).
- Software Engineer, Upwork (CA, 2012–2014).
- IT / Junior Software Engineer, Credit Agricole Bank — Banking (Cairo, 2009–2011).
- .NET Software Engineer, Travel Solutions Egypt (Cairo, 2008–2009).

---

## Skills (full)

### Leadership & Org
Engineering management · Hiring & interview loops · Performance calibration · On-call program design · Blameless post-mortems · Design docs & decision records · Code-review standards · Tech-debt budgeting · OKRs & roadmapping

### Architecture & Distributed Systems
Microservices · Event-Driven Architecture · DDD & CQRS · Event Sourcing · Idempotency keys · Circuit Breaker / Retry / Backoff · REST / OpenAPI · Rate limiting · Secure SDLC

### Backend & Languages
C# / .NET 8 / .NET Core · ASP.NET / Web API · WCF (modernizing to REST) · TypeScript / JavaScript · Angular / React · SQL Server · MongoDB / Cosmos DB · Kafka

### AWS
EC2 · RDS · DynamoDB · SQS · S3 · CloudWatch · IAM · KMS

### Azure
Azure VMs · Event Grid · SQL Server · Azure DevOps

### DevOps & Reliability
Feature flags · Blue-green / canary deploys · Docker / containers · CI/CD pipelines (Azure DevOps, GitHub Actions) · SLO / SLI / error budgets · Structured logging

### Security & Compliance
PCI DSS (multi-cycle audit) · OWASP Top-10 · Veracode (SAST) · DAST · TLS 1.2+ · Secrets management · Secure SDLC · Audit remediation

### AI-Augmented Engineering
AI pair-programming (Cursor) · Prompt-engineering for backend scaffolding · Cursor / Claude / ChatGPT / Gemini · Team-level AI workflow rollout · Usage guidelines

### Tools
Visual Studio · VS Code · Git · Jira · Bitbucket · Confluence · SonarQube

---

## Open to

- **Roles:** Engineering Manager (Hands-On) · Staff / Lead / Principal Engineer · Software Architect
- **Domains:** Payments · FinTech · Platform infrastructure · Distributed systems
- **Mode:** Remote (US-based) · Select hybrid
- **Auth:** Authorized to work in the USA · Green Card

---

# Articles

## Article 1 — Migrating a .NET Payments Platform to AWS with Zero Downtime

*Tag: Architecture · 12 min read · 2026*

**Dek:** "Zero downtime" is the most over-claimed phrase in cloud migration. Here's what it actually means for a payments platform, the five categories of risk most teams skip, and the rollback gate that saved us at 3 AM.

I've watched three teams in two companies claim "zero-downtime migration" and all three quietly dropped transactions during cutover. Not many — single digits per minute for a few minutes — but in payments, even one dropped authorization is a customer who blames you, a chargeback risk, and a reconciliation headache that follows you for weeks. The phrase has become marketing. This is what it actually requires, drawn from leading the migration of a mission-critical .NET payments platform from on-prem to AWS.

### 1. Define "zero" before you claim it

"Zero downtime" needs a measurable definition before the cutover plan can be written. Pick one of the following and write it down. If you can't, you're not ready to migrate.

- **Availability zero** — no HTTP 5xx returned to clients during the cutover window. Achievable.
- **Transaction zero** — no payment authorization fails or is duplicated due to the cutover itself. Hard. Requires idempotency end-to-end.
- **Reconciliation zero** — by end of cutover day, every transaction is in exactly one ledger, with the same status everywhere. Hardest. Most teams skip this and pay the price the next week.

The hardest one is reconciliation. If your post-cutover ledger has a transaction that exists in the old system and not the new (or vice versa, or with a different status), you've migrated the application but corrupted the data. That counts as downtime in finance even if every customer saw an instant `200 OK`.

> If "zero downtime" isn't defined as one of: availability, transaction, or reconciliation — it's not a goal. It's a vibe.

### 2. Map the five risk categories

Most cutover failures come from one of five risk categories. Map each one explicitly before writing the cutover playbook.

**2.1 Stateful coupling.** Any process that holds state in memory — session caches, in-process queues, retry queues that never made it to disk — will lose that state when you cut over. For each component, ask: *"What is held in process memory right now, and what happens to it if this process dies in 60 seconds?"* If the answer is unclear, fix it before cutover.

**2.2 Shared resources.** Database, cache, message broker, file storage. If you're keeping any shared resource (often the database, during the first cutover phase), you must verify that the new system reads and writes to it in *byte-identical* ways to the old. Locale settings, timezone handling, decimal precision, BOM in encoded fields — these are the silent killers.

**2.3 Downstream contracts.** Card networks, banks, processors (TSYS, Fiserv, Stripe), KYC providers, fraud services. Every integration has a contract — explicit or assumed — about latency, idempotency, retries, and error handling. The new system must honor every assumed contract *plus* survive the increased latency from running in AWS vs. on-prem during the transition.

**2.4 Cryptographic boundaries.** Anywhere you encrypt, sign, or hash with a secret. PCI DSS, in particular, cares about HSM and key-management boundaries. Migrating the application without migrating the key custody story is a compliance incident waiting to happen. KMS keys must be provisioned, rotated, and proven equivalent *before* any production traffic.

**2.5 Observability gap.** The most common failure: the new system has different metrics, different log formats, different dashboards. During the 4-hour cutover window, the on-call engineer can't correlate "the spike at 03:14" between the old graph and the new. Either build a unified dashboard before cutover, or you are flying blind during the exact period you most need vision.

### 3. Pick the right strategy: strangler-fig, blue-green, or dual-write

These three are not interchangeable. They suit different risk profiles:

- **Strangler-fig** — route X% of traffic to the new system, ramp slowly. Works when both systems can independently process traffic and reconcile asynchronously. Best for stateless, idempotent operations.
- **Blue-green** — both systems run in parallel; cutover is an atomic switch. Works when downstream cannot tolerate any duplicate processing. Requires a clean handoff moment.
- **Dual-write** — both systems receive every write. Works for the database-migration sub-problem within a broader strategy. Painful, but provides the safest correctness guarantee for the data layer.

For payments specifically, I favor **strangler-fig at the API gateway** for read traffic, **blue-green** for the authorization path, and **dual-write** for the ledger. The strategy isn't one-size — it varies per surface.

### 4. The cutover playbook

A structure that has survived three production payments-platform cutovers without losing a transaction:

1. **T-7 days: Dark traffic.** Mirror 100% of production traffic to the new system. Discard responses. Compare outputs offline. Goal: surface diffs you didn't know existed.
2. **T-3 days: Shadow comparison.** Same as above, but actively diff every response. Diffs > threshold abort the migration.
3. **T-1 day: Synthetic traffic at scale.** Synthetic load that mimics peak production through the new path end-to-end, including downstream calls in a test mode.
4. **T-0: Gradual ramp.** 1% → 5% → 25% → 50% → 100% over a controlled window. Each step has explicit success criteria — error rate, latency p95, downstream-contract success — and an explicit abort condition.
5. **T+1 day: Soak.** 100% traffic on the new system, old system kept warm. Reconcile every transaction. Verify ledger integrity.
6. **T+7 days: Old-system decommissioning.** Not before. The old system is your rollback. Don't burn it the day after cutover, no matter how confident you feel.

### 5. The rollback gate

The single most important artifact in any cutover playbook is the **rollback gate**: a written, pre-agreed criterion under which you halt the ramp and route traffic back. Three rules:

- The gate is defined **before** the cutover, not during. (Trying to define "is this bad enough to roll back?" at 3 AM is the worst time.)
- The gate is **measurable** — a specific error rate, latency target, or downstream-failure count. Not "if it feels wrong."
- The gate is owned by **one person**. Distributed accountability at 3 AM equals no accountability.

> On our migration, the rollback gate fired at 03:14 — at the 50% ramp. A downstream processor was returning a soft error our new system handled differently. We rolled back in eleven minutes. Without the gate, we'd have escalated for an hour before deciding.

### 6. Common mistakes

- **Migrating during low traffic.** Tempting, but you don't surface real problems until peak. Better to migrate at moderate traffic with full team on-call.
- **Skipping the soak.** "We're at 100%, decommission the old system!" — no. The old system is your insurance. Pay for it for another week.
- **Underestimating cross-AZ latency.** If the new system places a previously-co-located dependency in a different AZ, p95 latency can rise sharply with no warning. Profile latency end-to-end before cutover, not after.
- **Forgetting the batch jobs.** Hourly settlement runs, nightly reconciliation, weekly reports — these have to be migrated too. Most teams remember the API; many forget the cron.
- **Treating "PCI DSS" as a checkbox.** Compliance scope changes when infrastructure changes. The AWS environment has its own AOC requirements, key-management story, network-segmentation rules. Re-scope your audit boundary *before* cutover, not during the next audit cycle.

### 7. What I'd do differently next time

1. **Invest more in offline diff infrastructure.** The dark-traffic / shadow-comparison phase is where you catch the silent bugs. Better tooling pays compound interest.
2. **Test the rollback path in production before cutover.** A practice rollback against synthetic-but-realistic traffic. Most teams test the migration path obsessively and never test the un-migration path until they need it under stress.
3. **Bring finance into cutover planning earlier.** Reconciliation requirements drove three changes to the application layer; we'd have made them more cheaply if finance had been in the room from week one.

### Takeaways

1. Define "zero" before you claim it — availability, transaction, or reconciliation.
2. Map the five risk categories: stateful coupling, shared resources, downstream contracts, cryptographic boundaries, observability.
3. Pick the strategy per surface, not for the system as a whole.
4. The rollback gate must be defined before cutover, be measurable, and have a single owner.
5. Don't decommission the old system the day after cutover. Soak first.

---

## Article 2 — Decomposing a 1.5M-LOC Monolith Without Stopping the World

*Tag: Architecture · 14 min read · 2026*

**Dek:** The honest playbook for splitting a large payments monolith into 20+ microservices. Bounded contexts without dogma, strangler-fig sequencing, the contract-test layer that buys you speed, and the outbox-pattern mistake we made.

Most "we decomposed our monolith" stories skip the first question: *should you?* Decomposition is not free. It's not even cheap. It trades single-process simplicity for distributed-system complexity — network partitions, partial failures, schema versioning, deployment coordination, the entire menagerie of problems that arrive the moment a service call leaves your process boundary. Do it anyway when the cost of *not* decomposing is larger.

### 1. When NOT to decompose

Three signals that say "leave the monolith alone, or fix what's actually broken first":

- **Your deploy pipeline is the bottleneck, not your code.** If deploys take 90 minutes because tests are slow, fix the tests. A microservice with the same slow tests just gives you 20 slow pipelines.
- **Your team is small and co-located.** Conway's law is real. If three engineers ship every PR, splitting the code doesn't help; it adds coordination cost.
- **The business doesn't have separable failure-isolation needs.** If every feature shares the same uptime SLA and the same blast-radius tolerance, the microservice boundary buys you nothing on the operational side.

I've seen all three signals get ignored. The result was a "distributed monolith" — one logical system spread across 20 deployable artifacts, all changing together, all failing together, with all the operational complexity of distribution and none of the benefits.

> The right question isn't "should we use microservices?" It's "what specific problem will we no longer have after this?" If you can't name the problem, don't start.

### 2. Identifying bounded contexts (DDD without dogma)

**A bounded context is a chunk of code in which the same business term means the same thing.** If "customer" in the billing area means something different from "customer" in the fraud area, those are two bounded contexts. The boundary is the place where the term changes meaning.

For a payments platform, the typical bounded contexts are:

- **Authorization** — "transaction" means a card-network message in flight
- **Capture / settlement** — "transaction" means a financial event with downstream ledger impact
- **Risk / fraud** — "transaction" means a feature vector being scored
- **Ledger** — "transaction" means a double-entry record
- **Disputes / chargebacks** — "transaction" means a historical record being contested
- **Reporting** — "transaction" means a row in an analytics warehouse

Six different things, all called "transaction." Each one is a candidate service boundary. The boundary is where the term changes meaning, not where the code happens to be organized today.

### 3. Strangler-fig sequencing — which service first?

The single biggest decision in a decomposition project is the *order* of extraction. The wrong order makes every subsequent step harder. The right order makes each step cheaper than the last. Rank candidates by three axes:

1. **Stability** — how often does this code change? Extract *stable* code first. Volatile code stays in the monolith longer because each iteration is cheaper there.
2. **Coupling** — how many threads of data and control reach into and out of this area? Extract *least-coupled* code first.
3. **Business risk** — what's the blast radius if this service has a bad day? Extract *lower-risk* code first to build operational muscle before you bet the company on a riskier service.

For us, the first extraction was reporting (stable, low coupling, low risk). The last was the authorization path (volatile, deeply coupled, highest risk). Each extraction in between built the pattern, the tooling, the team's confidence.

### 4. The contract-test layer

The single most valuable engineering investment in a decomposition project is contract testing. Not unit tests, not integration tests — contracts.

A contract test verifies: *"When service A sends this shape, service B accepts it; when service B sends this response shape, service A understands it."* The contract is the shape, not the behavior. The contract lives in version control. Either side can break it — and either side can detect the break — without spinning up the other side.

We used **Pact**-style consumer-driven contracts. The consumer publishes its expectations. The producer's CI verifies it still meets them. A breaking change to the producer fails the producer's PR, not the consumer's production.

> Contract tests are the single highest-ROI investment in a decomposition. Build them in week one, not week twenty.

### 5. The outbox-pattern mistake we made

The outbox pattern is the canonical answer to "how do I write to my database and publish a message reliably?" The idea: don't publish directly to the message broker; write the message into an "outbox" table in the same database transaction as the business write. A separate process reads the outbox and publishes. If the publish fails, the outbox still has the record; you retry.

Beautiful pattern. Here's what we got wrong the first time:

We had the outbox processor running on every service instance. Each instance polled the outbox, published, and marked rows as sent. **The problem:** under load, two instances would lock the same row, one would publish, the other would publish a duplicate before realizing the row was taken. We were producing duplicate messages — exactly the failure mode the outbox is supposed to prevent.

Three things needed fixing:

1. **Atomic row claim.** Use `SELECT ... FOR UPDATE SKIP LOCKED` (Postgres) or equivalent so only one process claims each row.
2. **Idempotent consumers.** Even with atomic claims, network or process failure can cause re-publish. Consumers must be idempotent. Always.
3. **Outbox row TTL.** Without it, the table grows unbounded. After 90 days a "lightweight outbox" was 200M rows. Add archival from day one.

### 6. Common pitfalls

- **Decomposing across the wrong axis.** Splitting "user service / order service / payment service" by noun looks tidy but often cuts across the actual business workflows. Split along the workflow (the use case), not the noun.
- **Shared database for too long.** Two services that read each other's tables are one service in disguise. The database is the boundary; share it and you've made nothing.
- **Synchronous everything.** If service A calls service B calls service C synchronously, your latency is the sum and your availability is the product. The monolith was a 99.95% system; the chain of three is 99.85% at best. Push asynchronous communication wherever the business workflow allows it.
- **Distributed transactions.** Two-phase commit across services is a footgun. Use sagas or compensating transactions instead. Idempotency at every step.
- **Versioning by deployment timing.** "Don't worry, we'll deploy both services at the same time" is not a versioning strategy. Every service must tolerate at least one previous version of every contract it speaks.

### 7. What I'd do differently

1. **Invest in observability first, code second.** The first month should produce a unified tracing story, structured logs, and standardized metrics. Without it, every new service is opaque from day one.
2. **Build the deployment platform before the services.** If onboarding a new service to your platform takes a week, you'll have 5 services in a year. If it takes an hour, you'll have 30. The platform is the multiplier.
3. **Be more aggressive about pulling out the database.** We kept a shared database for too long. The day each service got its own data store was the day we actually got the benefits we'd been promising for two years.
4. **Stop earlier.** Not every monolith needs to become 20 services. We had two areas that should have stayed in the monolith. Knowing when to stop is a craft skill.

### Takeaways

1. Don't decompose unless you can name the specific problem you'll no longer have.
2. Bounded contexts = where business terms change meaning. Find them first.
3. Sequence by stability + low coupling + low risk. Build muscle on the easy ones.
4. Contract tests are the single highest-ROI investment. Build them in week one.
5. Outbox-pattern: atomic row claim, idempotent consumers, TTL on the table.
6. Observability and deployment platform *before* the second service. The platform is the multiplier.

---

## Article 3 — What "Hands-On Engineering Manager" Actually Means

*Tag: Engineering Leadership · 10 min read · 2026*

**Dek:** The math behind 30% time in code. Where it pays compound interest. Where it backfires. And the one question I ask before every IC task I take on.

"Hands-on engineering manager" is one of the most over-used phrases in tech-hiring listings. Half the time it means "we don't have headcount to hire a tech lead and an EM, so we want one person." The other half it means something real and important — but most candidates and most managers struggle to articulate which version they're living in. Here's what I've learned over five years of running engineering teams while still writing code.

### 1. The math: why 30%?

The number I aim for is **30% of my week in code**. Not 50%. Not 10%. Thirty.

- **~25%** on 1:1s with reports, peers, and leadership.
- **~15%** on broader org work — calibration, hiring, performance reviews, planning cycles.
- **~10%** on the team's roadmap and dependencies — partners in Product, DevOps, Compliance, Security.
- **~10%** on async written work — status updates, RFCs, internal docs, code review comments.
- **~10%** on unblocking — the surprise meetings, the urgent escalations.
- **~30%** for IC work — coding, debugging, architecture.

The 30% is the residual. It's what's left after the management work that *only the manager can do*. If management eats more than 70% of your week, you have no time for code. If management eats less than 50%, you're probably under-managing and the team is paying for it.

> 30% is the sweet spot only if the other 70% is genuinely manager-only work. The trap is filling the 70% with things ICs could do.

### 2. Where hands-on pays compound interest

Five places where IC time as an EM is worth disproportionately more than the same hours as a pure IC:

**2.1 Code review on cross-cutting concerns.** Reviewing PRs that span multiple sub-teams or that touch shared infrastructure. As EM you have the cross-team context an IC reviewer often doesn't.

**2.2 The hardest 5% of the hardest project.** Every project has a 5% that's genuinely difficult. Pairing on that 5%, or owning it outright, frees the team to ship the other 95% confidently. Don't take the easy 5%. Take the hard one.

**2.3 Bootstrap work for new patterns.** When the team is adopting a new pattern — first microservice extraction, first use of a new database, first event-driven flow — writing the reference implementation yourself sets the bar.

**2.4 The week before an audit.** PCI DSS prep, security review, compliance audit — these need someone who can read code *and* talk to auditors. Often that someone has to be you.

**2.5 Production incidents.** I don't take primary on-call. But I'm in every Sev-1, and I'm the person who can read the trace, the code, and the deploy history simultaneously without 30 minutes of context-rebuilding.

### 3. Where hands-on backfires

Three traps. I've fallen into each at least once.

**3.1 Taking the critical path.** You write the code on the dependency that the team's release depends on. You also have a 1:1 cycle, an interview loop, and a leadership update. Predictable: the code slips. The release slips with it. Six engineers idle. **Never own the critical path unless you are willing to drop the management work to ship it.** And dropping management work is usually wrong.

**3.2 The PR you never finish.** You open a draft. You don't touch it for three weeks. You eventually close it. Either ship the PR within a sprint or close it and hand it off cleanly.

**3.3 The architecture you authored.** You design the system. You wrote the original code. Now nobody on the team is willing to push back on your assumptions. The team's collective judgment gets quietly worse over months.

### 4. The question I ask before every IC task

> "If I take this task, what management work am I dropping — and is that drop worth less than what someone else on the team would drop to take this task instead?"

If the answer is "no one would drop anything important; this would just be slower without me" — take it. If the answer is "I'd drop performance prep" or "I'd skip my 1:1s this week" — almost always wrong. The team will accept slower IC work. The team will not accept skipped 1:1s.

### 5. Anti-patterns to watch for

- **The "EM who's really a Staff Engineer with a team."** They write 60% of the code, never run a real 1:1, and the team is functionally manager-less.
- **The "EM who never codes."** Loses technical credibility within six months. Stops being able to call out bad design in reviews.
- **The "EM who codes on weekends."** The math doesn't work — they're paying with their off-hours for IC work they should have delegated.
- **The "EM who codes only on greenfield."** Always the new shiny thing; never the unglamorous maintenance.

### 6. What to ask in an interview

If you're a candidate evaluating an EM role advertised as "hands-on":

- "What percentage of a typical week does the current EM spend in code?" If >50% or <10%, ask what gives.
- "Has the current EM authored a production-merged PR in the last month? Last quarter?" Concrete signal.
- "How does the team handle code review when the EM is the author? Who can block?"
- "What's the IC work the EM owns vs. delegates? Who decided that split?"

### Takeaways

1. 30% in code is the target. The 70% must genuinely be manager-only work.
2. Spend IC time on cross-cutting reviews, the hardest 5%, bootstrap work, audit prep, and Sev-1s.
3. Don't take the critical path. Don't open PRs you won't finish. Don't be the architect the team won't push back on.
4. Before each IC task: "what management work am I dropping, and is that drop worth it?"
5. If you're hiring, ask the EM-candidate concrete questions: percentage, last merged PR, code-review handling.

---

## Article 4 — Idempotency Keys in Payments: Five Mistakes I've Watched Get Shipped

*Tag: Payments · 11 min read · 2026*

**Dek:** The "obviously correct" idempotency implementations that ship duplicate charges anyway. Composite keys, retry windows, what to actually store, and which layer is the right one.

Every payments engineer has read the same idempotency-key blog post. The pattern looks simple: client sends a key, server stores the result against that key, repeat requests return the stored result. What could go wrong? A lot. Here are the five mistakes I've watched ship to production over the years, including one I shipped myself.

### Mistake 1 — Composite keys missing the tenant

You're a multi-tenant payments platform. Two merchants both happen to choose the same UUID for an idempotency key. Merchant A sends key `abc-123`. Merchant B sends the same key. Your server returns Merchant A's response to Merchant B. Each merchant sees a different bug.

**Fix:** the idempotency key in your data layer is always `(tenant_id, key)`, not just `key`. Always. Even if you "control the key generation," you don't, because tomorrow you'll integrate a partner who doesn't.

```sql
-- WRONG
CREATE UNIQUE INDEX idx_idemp ON idempotency_records (idempotency_key);

-- RIGHT
CREATE UNIQUE INDEX idx_idemp ON idempotency_records (tenant_id, idempotency_key);
```

> Idempotency is always scoped. Decide what the scope is (tenant, account, API key, route) and make the scope part of the index — not a runtime check.

### Mistake 2 — Retry windows too short (or absent)

How long do you store the idempotency record? Many implementations answer with a default — 24 hours, 7 days, "until we delete it." The right answer depends on the client. A point-of-sale terminal retrying on a 2G modem in Yemen will retry up to 4 hours later. A web client typically gives up in seconds. A batch job might retry the next morning.

**Fix:** match retention to the longest realistic retry window across all client types. For card-present payments, this means *at least* 24 hours and ideally 7 days.

Don't conflate "retention" with "this key is reusable." Even 7 days later, the same key with a different request body must be rejected, not allowed to "create a new charge under an old key." Otherwise you've invented a new vulnerability.

### Mistake 3 — Storing the response, not the state machine

The naive implementation stores the response body. Client retries → server returns the stored response body. Done.

What if the original request *hadn't finished yet* when the retry came in? The first request is mid-flight to the card network. The retry hits the server. There's no stored response yet. The server begins processing the retry as a new request. Now you have two in-flight authorizations against the network for the same logical operation.

**Fix:** store the *state*, not the response. The idempotency record has a state field: `RECEIVED` → `PROCESSING` → `COMPLETED` | `FAILED`.

- Repeat request, state is `PROCESSING` → either block-and-wait, or return `409 Conflict` with "operation in progress." Both are correct; pick one.
- Repeat request, state is `COMPLETED` → return the stored response.
- Repeat request, state is `FAILED` → depends on whether the failure was retryable. Often you let the retry through.

The state machine is the real idempotency contract. Stored-response-only implementations are a buggy subset.

### Mistake 4 — Idempotency at the wrong layer

Most teams put idempotency at the HTTP layer. The web framework intercepts requests, checks the header, decides whether to process. Clean. Easy to retrofit.

It also doesn't help you for the calls that don't come over HTTP. Kafka consumers re-processing messages after a rebalance. Scheduled jobs that run twice because of a deploy race. Internal service-to-service calls that retry on transient failure. All of those need idempotency too — and the HTTP-layer middleware doesn't reach them.

**Fix:** idempotency lives at the *business operation* layer, not the transport layer. Whatever your "charge" operation is — a method, a handler, a saga step — it accepts a key, checks the record, and decides whether to proceed.

Test for whether your layering is right: *"Can I invoke this operation from an internal job, a Kafka consumer, and an HTTP request, and get idempotent behavior in all three?"* If the answer requires three different middleware stacks, the layering is wrong.

### Mistake 5 — Concurrent requests with the same key

Two requests with the same key arrive at the server within milliseconds. Both check the database: no record exists. Both insert a new record. Both proceed to process the operation. Two charges.

This is the failure mode most code reviews miss because everyone is thinking about *sequential* retries, not concurrent ones. With network jitter, concurrent same-key arrivals are normal.

**Fix:** the insertion of the idempotency record must be *atomic* with the decision to proceed. Two patterns:

1. **Database unique constraint + insert-first.** Try to `INSERT` a new record with status `PROCESSING`. If the insert succeeds, you own the operation; proceed. If it fails on the unique constraint, the operation is already owned by another request; check the state and respond accordingly.
2. **Distributed lock (Redis SETNX or equivalent).** Acquire a lock keyed on the idempotency key before doing anything. Release on completion.

```
// WRONG (check-then-act, racy)
record = db.find(key)
if (record == null) {
  db.insert(key, PROCESSING)
  process()
}

// RIGHT (insert-with-unique-constraint)
try {
  db.insert(key, PROCESSING) // unique constraint protects us
  process()
} catch (UniqueConstraintViolation) {
  existing = db.find(key)
  // Handle based on existing.state
}
```

### What you should actually store

Minimum-viable idempotency record:

- `tenant_id` — scope
- `idempotency_key` — the key from the client
- `request_fingerprint` — a hash of the request body. Used to reject "same key, different request" with `422`.
- `state` — `RECEIVED | PROCESSING | COMPLETED | FAILED`
- `response_body` — the response we returned (only set in `COMPLETED`/`FAILED`)
- `response_status` — HTTP status
- `created_at`, `completed_at`, `expires_at`

Index: `UNIQUE (tenant_id, idempotency_key)`. Index on `expires_at` for cleanup jobs.

### Testing

Two test cases that catch most of the mistakes above:

1. **Same key, same body, two requests in parallel.** Outcome: one of them processes; the other either waits-and-returns or returns 409. Never both process.
2. **Same key, different body.** Outcome: the second is rejected with 422 (or your equivalent). Never silently overwritten.

Run these in CI as integration tests, not unit tests. The interesting failure modes only show up against a real database with real concurrency.

### Takeaways

1. Idempotency is scoped. Make `(tenant_id, key)` the unique constraint.
2. Match retention to the longest realistic retry window. 7 days is a reasonable starting point.
3. Store the state machine, not just the response. `PROCESSING` matters.
4. Implement at the business-operation layer, not just HTTP. Internal jobs and consumers need it too.
5. Use atomic insert with a unique constraint. Never check-then-act.
6. Hash the request body. Same key + different body = 422, not a new operation.

---

# Resume

Two PDF + HTML variants are available. The content is the same skill set; the emphasis differs.

- **Engineering Manager version** — emphasizes team scope, hiring loops, performance calibration, on-call program design, blameless post-mortems, PCI DSS audit ownership, headcount planning, and cross-team partnerships. HTML: [https://fadymassoud.com/resume/em.html](/resume/em.html) · PDF: [https://fadymassoud.com/resume-em.pdf](/resume-em.pdf)
- **Staff / Lead / Principal Engineer version** — emphasizes architecture and distributed-systems work, scale numbers, technical leadership, design-doc culture, RFC authoring, and cross-team architectural influence. HTML: [https://fadymassoud.com/resume/ic.html](/resume/ic.html) · PDF: [https://fadymassoud.com/resume-ic.pdf](/resume-ic.pdf)

---

# Contact

- **Email:** fady.massoud@live.com
- **LinkedIn:** [https://www.linkedin.com/in/fadymassoud](https://www.linkedin.com/in/fadymassoud)
- **Indeed:** [https://profile.indeed.com/p/Fady_-Massoud](https://profile.indeed.com/p/Fady_-Massoud)
- **Website:** [https://fadymassoud.com](https://fadymassoud.com)

Response time: ~24h on weekdays. Timezone: Pacific Time.