As the world changes digitally, the reliability of websites, cloud applications, and cloud infrastructure has become critical to business success. In addition, the way we manage systems and their workloads has changed as well. Entry-level servers are brought together through virtualization, with distributed software architecture, preventing outages from causing downtime and losses. The focus now is on digital infrastructure and efficiency.
In search of strategic improvements in its operations, one of our clients, the largest financial institution in Latin America and one of the largest in the world, sought out Inmetrics’ team of specialists. The bank had a digital ecosystem with several integrated technologies, therefore, we present the SRE methodology as the ideal solution so that the squad of the Institution responsible for the PIX project could focus on strategic areas, reaching the stipulated time to market without compromising the delivery quality. Thus, the team of Inmetrics specialists was allocated to our client to structure and implement the ideal model of SRE monitoring in the operations of the squad in question.
Site Reliability Engineering (SRE) is an approach to operations that ensures continuous applications run efficiently and reliably through automation and software engineering solutions. The key concept is engineering, which includes a data-driven approach to operations, and an automation culture to increase efficiency and reduce risk, and a hypothesis-driven methodology on incident, performance, and capacity tasks.
The SRE methodology is adaptable and can be included in any squad of a company, according to the demand, maturity, or need of these teams. Therefore, the initial phase of our monitoring project at this Financial Institution was developed as follows:
• We identified opportunities for improvement and understood the specific scenario of that technology environment alongside the squad responsible for the PIX project.
• From there, we surveyed their main needs.
• We structured an action plan based on brainstorm meetings, in which we verified the possibilities for evolution, and defined the strategies for that production environment.
• We started the implementation phase of the site reliability engineering (SRE) disciplines according to the maturity and focus of the squad in question.
From there, we defined our implementation methodology and the main objectives that we would pursue together with our client’s team. From the bottom to the top of the pyramid, we have directions from Inmetrics experts:
The final proof of user experience in relation to our customer’s products and services via intelligent monitoring
Data correlation, generation and validation of mathematical models, consumption projection, limit analysis, and improvement report with guaranteed SLA
Insertion of coordinated failures, result in monitoring and creation of systemic resilience gates in the application solution
Concentrate and structure event logs and reports. Define, improve and integrate infrastructure, business, and APM dashboards
Concentrate and structure event logs and reports. Define, improve and integrate infrastructure, business, and APM dashboards
Definition of SLIs & SLOs, instrumentation of critical services, creation of alerts and automation in the fault response process
An initial brainstorm with involved teams, process refinements and full system
Our specialists brought to the operations of the squad responsible for the PIX project the SRE principles to deal with infrastructure problems and process automation. We were responsible for developing performance, strategy, and optimization plans for these operations.
Right in the initial stages of implementation of the SRE methodology, the following gains could be observed:
In addition, with the implementation of SRE monitoring in squad operations, we made the systems more observable and considerably reduced the time spent on performing daily tasks, such as spot troubleshooting and war rooms, as we brought insights and accurate information, which effectively added value to our client’s processes.
COMPLETE TICKETS RESOLUTION
Reduced effort time during the troubleshooting process in dealing with tickets
WAR ROOMS
Average time spent in war rooms exponentially reduced
SRE monitoring increases efficiency and optimizes the working time of our client’s squad by 75%
There was an improvement in technical negotiations in general, with less effort and resources
Av. Eng. Luís Carlos Berrini, 105,
16º andar | Sala 1607
Brooklin Novo – SP
Brasil | CEP: 04571-010
+56 2 3203-9507
Cerro El Plomo, 5420
Oficina 1503
Las Condes | Santiago Chile
Código Postal : 7560742
comercial@inmetrics.cl
+57 1 646-9642
Carrera 19A #90-13
Oficina 304, Bogotá
Colômbia
Código Postal: 110221
comercial@inmetrics.co
+1 809.794.5333 ext. 5334
Calle Filomena Gómez de Cova No.3
Edificio Corporativo 2015, Piso 7
Local 701. Piantini
Av. Eng. Luís Carlos Berrini, 105,
16º andar | Sala 1607
Brooklin Novo – SP
Brasil | CEP: 04571-010
Cerro El Plomo, 5420
Oficina 1503
Las Condes | Santiago Chile
Código Postal : 7560742
comercial@inmetrics.cl
Carrera 19A #90-13
Oficina 304, Bogotá
Colômbia
Código Postal: 11022
comercial@inmetrics.co
18097945333 ext. 5334
Calle Filomena Gómez de Cova No.3
Edificio Corporativo 2015, Piso 7
Local 701. Piantini
Av. Tamboré 267 – 21º andar
Torre Norte,Tamboré
Barueri SP – Brasil |CEP:
06460-000
Av. Eng. Luís Carlos Berrini, 105,
16º andar | Sala 1607
Brooklin Novo – SP
Brasil | CEP: 04571-010
+56 2 3203-9507
Cerro El Plomo, 5420
Oficina 1503
Las Condes | Santiago Chile
Código Postal : 7560742
inmetricschile@inmetrics.com.br
+57 1 646-9642
Carrera 19A #90-13
Oficina 304, Bogotá
Colômbia
Código Postal: 110221
inmetricscolombia@inmetrics.com.br
+1 809.794.5333 ext. 5334
Calle Filomena Gómez de Cova No.3
Edificio Corporativo 2015, Piso 7
Local 701. Piantini
© 2021. Inmetrics. All rights reserved