Infrastructure Hardening: Lessons from Enterprise Scale
Most engineers have worked on systems held together by the knowledge of one or two people. The kind of setup where everything runs fine until someone goes on vacation, and then a single server hiccup turns into a weekend firefight. I lived inside that reality for months before deciding to do something about it.
This is the story of how I took a hand-managed monolith running on bare EC2 instances, replatformed it onto ECS with full infrastructure as code, and helped set the stage for a JPMorgan Chase acquisition. Not because I had a grand plan from the start, but because I watched my boss suffer and refused to accept that as normal.
What Was the Situation at Figg?
Figg was a card-linked offers company formed by merging Mogl and Augeo, which left the team with inherited infrastructure from both sides and very few engineers who understood either system deeply.
Before Chase bought us, the company was called Figg. We built card-linked offers, the technology behind those targeted deals that show up on your credit card statement. Figg itself was formed by merging two companies together: Mogl and Augeo. That merger meant inheriting infrastructure from both sides with very few people who understood either one deeply.
My boss, Jason Bausewein, was the Principal Engineer and the only engineer still around from the Mogl days. He understood the entire system because he had built large portions of it. The other person with deep knowledge was Jarrod Cuzens, but since Figg was formed by combining the two companies, Jarrod had been moved into a Chief Architect role. In practice, that meant his time went toward leadership and cross-team coordination rather than hands-on engineering.
That left Jason as the single person who could troubleshoot, deploy, and maintain production. And production was not simple.
Why Was the Monolith on Borrowed Time?
The monolith ran across roughly ten hand-managed EC2 instances with no configuration management, no infrastructure as code, and a bus factor of one.
Our API and website ran as a monolith deployed across roughly ten EC2 instances along with a number of dedicated database servers. Every one of those boxes was hand-managed. No configuration management, no infrastructure as code, no automated provisioning. If you needed to know why a particular server was configured a certain way, you asked Jason or Jarrod. If neither was available, you waited.
The fragility of this setup was not theoretical. Jason worked almost every weekend. When a single box had an issue at 2 AM, his phone rang. He was the on-call, the escalation path, and the institutional memory all rolled into one person. There was no redundancy in the system, and there was no redundancy in the team. The bus factor was effectively one.
I watched this play out for months. A talented engineer burning through his energy and personal time because the infrastructure demanded constant human intervention. It was not sustainable, and it was not fair to him.
Why Did the Fix Require Going Directly to the CEO?
The problem was organizational, not technical. The team had the ability to fix the infrastructure but lacked the authorization and sprint prioritization to dedicate time to it.
I decided to skip the normal chain and meet with the CEO directly. Not because I wanted to go around anyone, but because the problem was organizational, not technical. The team did not lack the ability to fix this. They lacked the authorization and prioritization to dedicate time to it. Every sprint was focused on features, and the infrastructure debt kept compounding.
In that meeting, I laid out exactly how fragile everything was. I walked through what would happen if Jason got sick for a week, or if two servers failed at the same time, or if we lost the database without a tested recovery process. These were not hypothetical scenarios. They were predictable failures waiting for their turn.
The CEO listened. To their credit, they gave me the green light to containerize the entire platform.
How Did the Replatform Work?
The replatform migrated everything off hand-managed EC2 instances onto ECS for container orchestration, RDS for managed databases, and Elasticache for caching, with the entire infrastructure codified in Terraform.
I worked with our SRE team to migrate everything off the hand-managed EC2 instances and onto a proper AWS architecture. The target stack was ECS for container orchestration, RDS for managed databases, and Elasticache for caching. The goal was straightforward: remove every piece of infrastructure that required someone to SSH into a box to maintain it.
We wrote the entire infrastructure in Terraform. Every VPC, every security group, every ECS service definition, every RDS parameter group. If it existed in AWS, it existed in code. This was not just about automation. It was about making the system understandable to anyone who could read a Terraform file, not just the two people who had configured the original servers by hand.
For CI/CD, we built pipelines in Drone. Every merge to main triggered a build, ran tests, pushed a Docker image, and deployed to ECS. No more SSH. No more manual deployments. No more crossing your fingers and hoping the new code behaved the same way on the production box as it did on your laptop.
The migration was not a weekend project. It required careful planning to move a live production system without downtime. We ran the old and new infrastructure in parallel, shifted traffic gradually, validated behavior at each step, and decommissioned the legacy instances only after we had confidence in the new platform.
What Were the Measurable Results?
After completing the replatform, the platform experienced only a single outage over the remaining three years, compared to the near-weekly incidents that had been occurring before.
Jason stopped working weekends. The on-call rotation became uneventful because the infrastructure handled failures automatically instead of paging humans.
Beyond stability, the replatform changed how the team operated. New engineers could understand the infrastructure by reading Terraform modules. Deployments happened multiple times per day through Drone pipelines instead of once a week through manual processes. Scaling was a configuration change, not a provisioning exercise.
The discipline of infrastructure as code also created an unexpected benefit: a complete audit trail. Every change to production was a pull request with a review, a timestamp, and a reason. When questions came up during compliance reviews, we could point to the git history instead of trying to reconstruct what happened from memory.
How Did Infrastructure Hardening Prepare the Company for Acquisition?
The containerized, codified infrastructure made JPMorgan Chase’s acquisition due diligence straightforward because there were no mystery servers to reverse-engineer and no tribal knowledge locked in individual engineers.
When JPMorgan Chase eventually acquired Figg, the replatform paid off in a way none of us had originally anticipated. Rolling our code into Chase’s infrastructure went smoothly because everything was already containerized, codified, and portable. There were no mystery servers to reverse-engineer. No tribal knowledge locked in someone’s head. The Terraform modules described exactly what we ran and how it was configured.
Acquisition due diligence often exposes infrastructure debt that becomes a negotiation liability or a post-acquisition headache. For us, the infrastructure was an asset. It demonstrated operational maturity and made the integration timeline realistic instead of aspirational.
What Are the Key Takeaways from This Experience?
The deepest lesson is that infrastructure problems often require organizational solutions, meaning a conversation with leadership about prioritization rather than another sprint ticket.
The technical lessons are the obvious ones: containerize your workloads, codify your infrastructure, automate your deployments. But the deeper lesson is about recognizing when a problem needs to be escalated beyond the engineering team. The infrastructure at Figg was not broken because engineers lacked skill. It was broken because the organization had not prioritized fixing it. That required a conversation with leadership, not another sprint ticket.
I also learned that infrastructure work often gets dismissed as unglamorous compared to feature development. But the stability we achieved after the replatform is what allowed the feature teams to move faster. They stopped losing days to production incidents and deployment failures. The compound effect of reliable infrastructure on engineering velocity is difficult to overstate.
If you recognize your own situation in this story, if one or two people carry the weight of your production environment and you are hoping nothing breaks on a long weekend, consider having that direct conversation with leadership. The cost of the replatform was real, but it was a fraction of the cost we were paying in engineer burnout, lost weekends, and accumulated risk.
Related Posts
- Why Quality Gates Matter in Multi-Agent AI Development
- A Security Audit Checklist for Modern Applications
Dealing with similar infrastructure challenges? Schedule a free compliance audit to review your infrastructure security posture and compliance gaps.