From 500 to 50,000 Concurrent Users: Architecture Decisions That Actually Scale

Black Friday at 12,000 concurrent users on a platform you migrated from Magento 6 months ago is a different kind of stress test. Here's the architecture that held, and the decisions that got us there.
Start with the database, not the app servers
Every scaling conversation jumps to horizontal app server scaling, but the database is almost always the first bottleneck. Before Black Friday, we ran EXPLAIN ANALYZE on every query over 50ms. We found three queries doing sequential scans on tables with 2M+ rows. Adding the right composite indexes cut those queries from 800ms to 12ms. No new servers, no new cost.
Connection pooling is not optional
At 12,000 concurrent users, your app servers will open thousands of database connections simultaneously. PostgreSQL handles this poorly above ~200 connections. We used PgBouncer in transaction pooling mode sitting between our app tier and Postgres. It multiplexed 8,000 app-side connections down to 150 database connections. This single change increased our throughput by 4x.
Redis: the right data and nothing else
Redis is not a general caching layer. Put the wrong data in it and you'll have a distributed consistency nightmare. We cached three things: (1) product page data with a 5-minute TTL, (2) session data, (3) rate limit counters. Product inventory counts were NOT cached — stale stock levels kill conversions and cause overselling. Read inventory from the database every time, optimized with the right indexes.
Serverless for spiky, stateful for baseline
Our checkout pipeline was stateful and latency-sensitive — we kept it on ECS with pre-warmed instances. Our email notification system, PDF receipt generation, and inventory update events were spiky and async — we moved those to Lambda. This cut our baseline EC2 bill by 35% while giving us unlimited burst capacity for notifications.
Load test before you need to
We ran k6 load tests simulating 15,000 concurrent users three weeks before Black Friday. We found that our product image CDN was not configured for the right cache headers — browsers were re-fetching images on every page load. One CloudFront configuration change. Under actual Black Friday load, our CDN served 94% of traffic from cache.