← Writing

Idempotency Keys in Payments — Five Mistakes I've Watched Get Shipped

The "obviously correct" idempotency implementations that ship duplicate charges anyway. Composite keys, retry windows, what to actually store, and which layer is the right one.

Every payments engineer has read the same idempotency-key blog post. The same Stripe API doc. The same talk. The pattern looks simple: client sends a key, server stores the result against that key, repeat requests return the stored result. What could go wrong?

A lot. Here are the five mistakes I've watched ship to production over the years, including one I shipped myself.

The textbook version (so we agree on the baseline)

An idempotent endpoint accepts the same request multiple times and produces the same effect as a single request. In payments, the effect is "charge $42.50 to this card." If the client retries because the network dropped, idempotency means the second attempt doesn't double-charge.

The mechanism: client sends a unique key on each logical request. Server stores the key, the request, and the result. Repeat key → return stored result. New key → process normally.

Simple. Then production gets involved.

Mistake 1: Composite keys missing the tenant

You're a multi-tenant payments platform. Two merchants both happen to choose the same UUID for an idempotency key — astronomically unlikely with random UUIDs, but customers do clever things ("I'll use the order ID as the key" — and another customer uses the same order-ID format).

Merchant A sends key abc-123. Merchant B sends the same key. Your server returns Merchant A's response to Merchant B. Each merchant sees a different bug.

Fix: the idempotency key in your data layer is always (tenant_id, key), not just key. Always. Even if you "control the key generation," you don't, because tomorrow you'll integrate a partner who doesn't.

-- WRONG
CREATE UNIQUE INDEX idx_idemp ON idempotency_records (idempotency_key);

-- RIGHT
CREATE UNIQUE INDEX idx_idemp ON idempotency_records (tenant_id, idempotency_key);

Idempotency is always scoped. Decide what the scope is (tenant, account, API key, route) and make the scope part of the index — not a runtime check.

Mistake 2: Retry windows too short (or absent)

How long do you store the idempotency record? Many implementations answer with a default — 24 hours, 7 days, "until we delete it."

The right answer depends on the client. A point-of-sale terminal retrying on a 2G modem in Yemen will retry up to 4 hours later. A web client typically gives up in seconds. A batch job might retry the next morning. If your retention is shorter than the longest realistic retry window from any client, the late retry comes in, you don't have the record, you process it as a new request — duplicate charge.

Fix: match retention to the longest realistic retry window across all client types. For card-present payments, this means at least 24 hours and ideally 7 days. Run the numbers from your actual retry logs.

Bonus: don't conflate "retention" with "this key is reusable." Even 7 days later, the same key with a different request body must be rejected, not allowed to "create a new charge under an old key." Otherwise you've invented a new vulnerability.

Mistake 3: Storing the response, not the state machine

The naive implementation stores the response body. Client retries → server returns the stored response body. Done.

What if the original request hadn't finished yet when the retry came in? The first request is mid-flight to the card network. The retry hits the server. There's no stored response yet. The server begins processing the retry as a new request. Now you have two in-flight authorizations against the network for the same logical operation.

Fix: store the state, not the response. The idempotency record has a state field: RECEIVEDPROCESSINGCOMPLETED | FAILED.

  • Repeat request, state is PROCESSING → either block-and-wait, or return 409 Conflict with "operation in progress." Both are correct; pick one.
  • Repeat request, state is COMPLETED → return the stored response.
  • Repeat request, state is FAILED → depends on whether the failure was retryable. Often you let the retry through.

The state machine is the real idempotency contract. Stored-response-only implementations are a buggy subset.

Mistake 4: Idempotency at the wrong layer

Most teams put idempotency at the HTTP layer. The web framework intercepts requests, checks the header, decides whether to process. Clean. Easy to retrofit.

It also doesn't help you for the calls that don't come over HTTP. Kafka consumers re-processing messages after a rebalance. Scheduled jobs that run twice because of a deploy race. Internal service-to-service calls that retry on transient failure. All of those need idempotency too — and the HTTP-layer middleware doesn't reach them.

Fix: idempotency lives at the business operation layer, not the transport layer. Whatever your "charge" operation is — a method, a handler, a saga step — it accepts a key, checks the record, and decides whether to proceed. The HTTP layer can also enforce idempotency, but it's a convenience, not the source of truth.

The test for whether your layering is right: "Can I invoke this operation from an internal job, a Kafka consumer, and an HTTP request, and get idempotent behavior in all three?" If the answer requires three different middleware stacks, the layering is wrong.

Mistake 5: Concurrent requests with the same key

Two requests with the same key arrive at the server within milliseconds. Both check the database: no record exists. Both insert a new record. Both proceed to process the operation. Two charges.

This is the failure mode most code reviews miss because everyone is thinking about sequential retries, not concurrent ones. With network jitter, concurrent same-key arrivals are normal.

Fix: the insertion of the idempotency record must be atomic with the decision to proceed. Two patterns:

  1. Database unique constraint + insert-first. Try to INSERT a new record with status PROCESSING. If the insert succeeds, you own the operation; proceed. If it fails on the unique constraint, the operation is already owned by another request; check the state and respond accordingly.
  2. Distributed lock (Redis SETNX or equivalent). Acquire a lock keyed on the idempotency key before doing anything. Release on completion. Slower (extra round-trip) but works across systems that don't share a database.

The "check, then insert" pattern is wrong. There is always a window. Make the database tell you whether you own the key — don't ask, then act.

// WRONG (check-then-act, racy)
record = db.find(key)
if (record == null) {
  db.insert(key, PROCESSING)
  process()
}

// RIGHT (insert-with-unique-constraint)
try {
  db.insert(key, PROCESSING) // unique constraint protects us
  process()
} catch (UniqueConstraintViolation) {
  existing = db.find(key)
  // Handle based on existing.state
}

What you should actually store

A minimum-viable idempotency record:

  • tenant_id — scope
  • idempotency_key — the key from the client
  • request_fingerprint — a hash of the request body. Used to reject "same key, different request" requests with 422.
  • stateRECEIVED | PROCESSING | COMPLETED | FAILED
  • response_body — the response we returned (only set in COMPLETED/FAILED)
  • response_status — HTTP status
  • created_at
  • completed_at
  • expires_at — for TTL/cleanup

Index: UNIQUE (tenant_id, idempotency_key). Index on expires_at for cleanup jobs.

Testing

Two test cases that catch most of the mistakes above:

  1. Same key, same body, two requests in parallel. Outcome: one of them processes; the other either waits-and-returns or returns 409. Never both process.
  2. Same key, different body. Outcome: the second is rejected with 422 (or your equivalent). Never silently overwritten.

Run these in CI as integration tests, not unit tests. The interesting failure modes only show up against a real database with real concurrency.

Takeaways

  1. Idempotency is scoped. Make (tenant_id, key) the unique constraint.
  2. Match retention to the longest realistic retry window. 7 days is a reasonable starting point.
  3. Store the state machine, not just the response. PROCESSING matters.
  4. Implement at the business-operation layer, not just HTTP. Internal jobs and consumers need it too.
  5. Use atomic insert with a unique constraint. Never check-then-act.
  6. Hash the request body. Same key + different body = 422, not a new operation.
FM

Fady Massoud — Engineering Manager (Hands-On) at Kort Payments, formerly Lead Software Engineer at Paysafe. 18+ years building FinTech payments platforms. Get in touch.