Before, google or companies used to run complex software systems manually. These companies would appoint individuals to build software components. Software components helped in producing a service. Such individuals were called system administrators. They would run the service, respond to events and give an update about it. With time, traffic grew, and work became very difficult. Now the system had become more complex.
The conflict between the operation team and product development department
The operation team and development team had contrasting goals and targets. They couldn’t reach common ground, when to and how fast to release the software to production. The task of the development team was to launch new features which new users could adopt. And the operation team had the task of ensuring that smooth services were running.
However, the operation team had a hard time running its task smoothly. The task would not run smoothly because of the launch of new features. A new feature would interrupt the services. So, they had constant conflicts with each other.
Solution of the Conflict
Benjamin Treynor Sloss coined the idea of the Site Reliability Engineering team (2003). He joined google and was asked to run a production team. This team has matured to become Google’s present-day SRE team.
What is Site Reliability Engineering (SRE)?
SRE consists of principles and operations that incorporate aspects of software engineering. It applies these aspects of software engineering to developmental and operational issues. It targets establishing an authentic and scalable software system. Particularly, site reliability engineering is the implementation of DevOps. DevOps refers to the actions that combine software development and IT operations.
Who is a Site Reliability Engineer?
A site reliability engineer is a software engineer or someone with a similar qualification. As a result of their considerable skills, they enable high-quality software systems. Such systems help in solving complex problems. Also, these systems can design and implement automation.
In short, site reliability engineers create a bridge between development and operation. They apply the idea of software engineering to system management issues. They spend time on system operations and developments that improve site reliability. As a result, site performance also gets better.
Google has set a 50% upper limit to SRE operations and developments. In this way, the operation team can deliver stable and operational service. It enables services to run and repair on their own. So, the software can complete development tasks and a small number of operating roles in time.
The ultimate goal of SRE, as per Google, is to automate work. An important way to do this is to create self-service tools for groups of service-dependent users (automatic test environment provision, logging, statistical visualization, etc.). Automation reduces the ongoing work of all parties. It allows developers to focus on feature development and on the next task to be automated.
SRE works with product developers to ensure that the designed solution meets requirements. Such as availability, performance, security, and maintainability.