DevOps

DevOps Best Practices: What Actually Moves the Needle

June 9, 202612 min read

DevOps Best Practices - Real-World Infrastructure Patterns

Most DevOps content is a rehashed checklist. Use CI/CD. Write tests. Monitor things. Helpful in the same way that "eat less, move more" is helpful for weight loss — technically correct and completely insufficient. This is not that article.

I've been building and running production infrastructure for 14 years. I've founded DietGhar (healthtech, live) and NyayX (legaltech, live), and I've consulted for teams running everything from two-person startups to platforms handling 2M-product catalogues under Black Friday load. What follows are the DevOps best practices that I've seen move real needles, with the specific numbers to prove it.

CI/CD is only as good as your slowest feedback loop

A pipeline that takes 40 minutes to run is a pipeline nobody trusts. Developers will start merging without waiting for it. For DietGhar I use GitHub Actions with a staged pipeline: lint and type-check in under 2 minutes, unit tests in parallel in under 5, integration tests against a real Postgres and Redis container in under 10. Total: 12 minutes from push to green. That speed is intentional. If I'd let it creep to 30 minutes, the pipeline would have become ceremonial.

The other CI/CD failure mode is treating deployment as a one-way door. Every pipeline I build has an automated rollback trigger: if the health check endpoint returns non-200 for 60 seconds post-deploy, the previous task definition is reactivated on ECS (or PM2 restores the prior release on single-server setups). For DietGhar's zero-downtime PM2 deploys, this means a bad release never touches more than the few seconds between deploy and health check failure. Users see nothing.

Infrastructure as code is not optional — it is your documentation

When I inherited a client's AWS account to do a cloud cost review, I found 23 EC2 instances, 11 RDS instances, and 6 load balancers. No Terraform, no CDK, no CloudFormation. Nobody knew what half of them did. Three were serving no traffic at all. One had been running since 2019.

That engagement paid for itself in the first sprint: I wrote Terraform to model what was actually in use, deleted the orphaned resources, rightsized the rest against CloudWatch metrics, and moved batch workloads to spot instances. The bill dropped 40% before I touched a single line of application code. Infrastructure as code would have prevented every one of those orphaned resources — you can't drift from a state file the way you drift from the console.

I use Terraform for everything that will live longer than a day. GitHub Actions for ephemeral environments. The rule is simple: if I can't recreate the entire environment from the repo in under 30 minutes, the infrastructure is undocumented.

Observability-first means before you ship, not after something breaks

With NyayX I instrumented the application before I wrote the first feature. Error rates, p95 latency, conversion funnel drop-off — all dashboarded before a single real user hit the system. This isn't paranoia. It's the difference between knowing you have a problem and guessing you might.

The payoff came in week two of beta. Onboarding had a 30% drop-off rate at step three. Without instrumentation I would have assumed users weren't interested. With it, I could see the drop-off happened on a specific form where validation errors weren't surfacing to the user. A two-line fix. That is the return on observability investment.

For the 2M-product e-commerce platform that faced Black Friday at 12,000 concurrent users: three weeks before the event, we ran EXPLAIN ANALYZE on every query exceeding 50ms. Found three queries doing full sequential scans on large tables. Added composite indexes. Query times dropped from 800ms to 12ms. No new servers. No code changes. Just observability data leading to the right fix before it mattered.

Database connections are infrastructure, not an afterthought

At 12,000 concurrent users, your application will attempt thousands of simultaneous database connections. PostgreSQL degrades sharply above 200. The solution that saved that Black Friday deployment was PgBouncer in transaction pooling mode: it multiplexed 8,000 application-side connections down to 150 actual database connections, delivering a 4x throughput increase with no application changes.

This is a DevOps concern, not a developer concern. By the time a developer notices connection exhaustion, the incident is already in progress. The right moment to configure connection pooling is during infrastructure provisioning, alongside your database deployment. It should be in your Terraform module, not discovered during an outage.

Cloud cost is a product decision

Teams that treat cloud cost as a finance problem instead of an engineering discipline consistently overspend. Every architecture decision has a cost consequence. Synchronous fan-out to 10 services instead of an event queue? You're paying for the latency and the over-provisioning required to absorb spikes. Polling an API every minute instead of using webhooks? You're paying for compute that generates mostly empty responses.

For the Amazon SP-API platform I built, the inventory sync originally polled the Catalog Items API for 50,000 SKUs on a 15-minute schedule. This burned through rate limit quota and required expensive Lambda concurrency to handle. The right architecture was to subscribe to ITEM_INVENTORY_UPDATE events via the Notifications API, pipe them into SQS, and process changes as they arrived. Sync latency dropped from 24 hours to under 15 minutes. API call volume dropped by 90%. The compute cost dropped in proportion.

Review your cloud bill monthly with the same discipline as your revenue. I flag any service growing faster than your user base for architectural review. That growth rate discrepancy is almost always a design problem, not a scaling success.

Environment parity prevents the class of bugs that only appear in production

"It works on my machine" is a symptom of environment divergence. For NyayX, every environment — local development, staging, production — runs the same Docker image built from the same Dockerfile. Local uses Docker Compose to wire up PostgreSQL and Redis. Staging and production run on AWS ECS with the same task definition, parameterised by environment-specific secrets. The application code does not know which environment it is in beyond what the environment variables tell it.

This sounds obvious until you see what happens without it. A client I worked with had Node.js 16 in production and Node.js 20 in development. The Array.prototype.at() method they used freely in development didn't exist in production. They found this during a customer demo. Environment parity is the practice that prevents that conversation.

Security is not a phase at the end

NyayX is a legaltech platform handling sensitive documents. Security had to be designed in, not bolted on. Row-level security in PostgreSQL ensures tenants cannot query each other's data at the database layer, regardless of application logic. Every API endpoint enforces authorisation before touching data. Sensitive fields are encrypted at rest with AES-256. The audit trail is append-only — no UPDATE or DELETE permissions on the audit log table. S3 objects are private by default with pre-signed URLs for time-limited access.

None of this was retrofit. Every one of those controls was in the first pull request that touched the relevant subsystem. Retrofitting security into an existing system costs 10x what it costs to design it in. This is the DevOps practice most frequently deferred and most frequently regretted.

Backups are a promise you have not tested

Everyone has backups. Almost nobody has restores. That gap matters more than any other single item on this list, because it is the one you only discover you got wrong at the worst possible moment.

For NyayX, which stores legal documents clients are legally required to retain, an untested backup is not a backup — it is a liability with good intentions. So I automate the restore, not just the backup. Once a week a job provisions a throwaway database, restores the latest snapshot into it, runs row-count and integrity checks against expected ranges, then tears it down. If the restore fails or the numbers look wrong, I get paged. A backup job succeeding tells me almost nothing. A restore succeeding tells me everything.

I learned this the expensive way years ago, on a system where nightly backups had been succeeding for months while silently writing zero-byte files — a rotated credential had quietly broken the export. The backup dashboard stayed green the entire time, and nobody checks a green dashboard. The job that restores and verifies is what turns a backup from a hope into a guarantee. If you do one thing after reading this article, schedule a restore test before you schedule another backup.

Incident response needs a script before the incident

The worst time to decide your incident response process is during an incident. For every production system I run, there is a written runbook for the five most likely failure modes: database connection exhaustion, high error rate on the main API, deployment rollback, third-party API degradation, and certificate expiry. Each runbook has a decision tree, the commands to run, and the escalation path.

This sounds like overkill for a solo-engineered system. It is not. At 2am during an incident, cognitive load is your biggest enemy. A runbook reduces the problem from "what do I do" to "follow the steps." When the SP-API platform went down during a peak sales window because Amazon's Notifications API was delayed, I had a runbook that said: check SQS queue depth, check Lambda error rate, check SP-API status page, switch to polling fallback if SQS depth exceeds threshold. The resolution time was 11 minutes. Without the runbook, it would have been 45.

The runbook is only half of it. After every incident I write a short post-incident note: what broke, what the signal was, how long detection and resolution took, and the single change that would have prevented it. It is not a blame exercise — it is a list of system improvements. That SP-API outage produced exactly one action item: alert on SQS queue depth so the next delay pages me before it becomes a customer problem. That alert has fired twice since, both times early enough that nobody downstream noticed. An incident you learn nothing from is just downtime; an incident that produces one concrete prevention is cheap insurance.

The compounding return on DevOps discipline

Each of these practices compounds. CI/CD with automated rollbacks means you can ship confidently. Infrastructure as code means you can rebuild after disaster in minutes. Observability means you find problems before customers do. Connection pooling and cost discipline mean you can scale without re-architecture. Security by design means you can sell to enterprise customers. Tested restores mean a bad day is an inconvenience instead of a closure. Runbooks mean incidents are learning opportunities instead of emergencies.

I run DietGhar and NyayX as a solo founder-engineer. The only reason that is viable is that the infrastructure runs itself 95% of the time because each of these practices is in place. DevOps best practices are not a tax on your time. They are how you get your time back.

Need help applying this to your project?

Book a free consultation →