With a major product launch on the horizon, expecting to triple the user load. The infrastructure was non-optimized for demand and incident response was non-harmonious across teams with no set process.
As the SRE Manager, I'd have to implement project changes to infrastructure to accommodate high availability and scalability. I discovered a long-standing issue surrounding incident response and observability that would make it all the more difficult to achieve reliability under pressure and peak traffic.
Therefore, instead of fulfilling the project requirements as presented, I became an advocate for change across the organization. I established an observability platform with 100% stack coverage, created custom SLOs/SLIs/error budgets to enhance performance and reliability visibility, drafted and socialized an incident management playbook across the company and held hands-on trainings for all respective teams. In addition, I advocated for Terraform for Infrastructure as Code to help with deployment time and held FinOps reviews to help eliminate redundant costs in cloud spend.
As a result, we did not only survive the 3X traffic upon product launch with zero critical incidents, but we also achieved 15% saves the following years in cloud costs and 40% faster deployments. Many of the processes I created were adopted as company-wide best practices.