Many candidates and recruiters casually use the terms DevOps and Site Reliability Engineering interchangeably. In reality, they represent related but distinct ideas.

Over time, both terms have also been adopted as job titles. As a result, it is not uncommon to meet engineers working under the titles DevOps Engineer or SRE without having had much exposure to the philosophies that originally shaped these ideas in the first place.

The tech community has always had a habit of blurring terminology, but if you are serious about building a career around these ideas, understanding the differences is important. An understanding of their origins and underlying principles gives you an uncommon advantage.

This post aims to help you in two specific ways.

First, it will give you a clearer sense of what each concept is about. Each has its own philosophy, origins, and guiding principles, and you may find that one resonates with you more than the other.

Second, throughout the post I reference a number of books, talks, and thinkers who have shaped these ideas. Their work is well worth exploring if you want to go deeper.

Developer - Operator Cooperation

Origin

The origin of the word DevOps is widely attributed to Patrick Debois, a Belgian IT consultant and Agile practitioner. Debois first used the term as a Twitter hashtag while promoting the inaugural DevOpsDays conference in 2009 in Ghent, Belgium.

The idea itself, however, had been building in the community for some time. A key influence was a landmark talk by John Allspaw and Paul Hammond titled “10+ Deploys per Day: Dev and Ops Cooperation at Flickr”, presented at the Velocity Conference in 2009. The talk demonstrated how close collaboration between development and operations teams enabled Flickr to deploy code to production dozens of times a day. This was something that was almost unheard of at the time.

A quick search on YouTube should be enough to find this talk, which remains well worth watching today.

In the years that followed, several attempts were made to give the emerging movement a more formal definition.

One of the most influential frameworks was the CAMS model, introduced by Damon Edwards and John Willis shortly after the first DevOpsDays. The CAMS model describes DevOps through four key pillars: Culture, Automation, Measurement, Sharing.

Another widely cited framework is the Three Ways, introduced in the influential book The Phoenix Project by Gene Kim and his co-authors. This model describes DevOps through three principles: Flow, Feedback, Continuous Learning

A detailed discussion of these frameworks would make this post unnecessarily long. For our purposes, it is enough to note that both CAMS and the Three Ways attempt to formalise the philosophy behind the DevOps movement. Readers interested in exploring these ideas further are encouraged to read The Phoenix Project and related DevOps literature.

Finally, a more academic definition was proposed by researchers at the Software Engineering Institute (SEI). They describe DevOps as:

❝

“A set of practices intended to reduce the time between committing a change to a system and placing the change into normal production, while ensuring high quality.”

Site Reliability Engineering

Origin

The term Site Reliability Engineering (SRE) was coined by Benjamin Treynor Sloss at Google around 2003. Sloss, who currently serves as a Vice President of Engineering at Google, famously described SRE as:

❝

“What happens when you ask a software engineer to design an operations function.”

The discipline was first introduced publicly to a wider audience at SREcon in 2014, where Google engineers presented many of the practices that had been developed internally over the previous decade.

Two years later, the publication of Site Reliability Engineering: How Google Runs Production Systems provided the industry’s first comprehensive formalisation of SRE practices and principles.

The Reliability Philosophy

The SRE book explains in great detail, several principles that guide how reliable systems should be designed and operated. They are:

Embrace Risk
Define Service Level Objectives
Eliminate Toil
Monitoring Distributed Systems
Automate Everything Reasonable
Release Engineering
Simplicity

A full treatment of all these principles would make this post far too long, so I will focus on three of the most influential ones. Just as before readers are strongly encouraged to read the Site Reliability Engineering book, which remains one of the definitive texts on the subject.

Embrace Risk

In an organisation with zero risk appetite, developers and operators often find themselves at loggerheads because they have conflicting incentives. Developers are incentivised to ship new features as quickly as possible. Operators, on the other hand, are responsible for ensuring that system stability is never compromised. One group facilitates change while the other resists it, which naturally leads to friction.

In order to reduce this friction, organisations must accept a certain level of risk. The SRE book calls this an error budget. Both developers and operators agree that as long as this error budget is not depleted, new features can continue to be released. Conversely, once the budget is exhausted, feature releases are paused until the system becomes reliable again.

This shared agreement aligns the incentives of developers and operators.

Service Level Objectives

Every production system should have a realistic and measurable reliability objective.

This objective is chosen carefully: failing to meet it should noticeably degrade the user experience, but exceeding it should not meaningfully improve it.

For example, a critical web service might have a reliability target of 99.9% availability. Falling below that threshold would likely result in user complaints and visible downtime. However, pushing reliability much beyond that level may not significantly improve user experience.

A 99.9% objective corresponds to an error budget of 0.1%, which is roughly 43 minutes of downtime per month.

You can imagine how this shared reliability target creates an interesting dynamic. When the system is stable and the error budget remains largely intact, teams can move quickly and ship features. When reliability begins to degrade, attention naturally shifts toward stabilisation and operational improvements.

Eliminate Toil

Toil is defined as repetitive operational work that produces little long-term value and can be automated. Repeatedly running build commands, clicking through the cloud console to perform routine tasks, manually provisioning infrastructure are all examples of toil.

SRE teams are typically staffed from the same talent pool as software developers. This is not to say that an operator should have the same amount of coding experience as a developer but that they think alike, use the same jargon and share the same appreciation and taste for good software.

Because SRE practitioners can write software, they are able to automate repetitive tasks and systematically reduce operational toil over time.

The SRE model even introduces a structural mechanism to encourage this. Operational work performed by SREs is capped at roughly 50% of their time. If a service requires more operational effort than that, the excess work is routed back to the development team.

This creates a natural feedback loop: developers are incentivized to build more reliable systems so that operational burden decreases over time.

Class SRE implements DevOps

Now that we understand the core philosophies of both DevOps and SRE, let’s define the relationship between them using an analogy from the world of Object-Oriented Programming.

In Object-Oriented Programming, it is common to define a generic interface. An interface does not contain actual implementation, instead, it defines a set of principles or behaviours that any implementation must follow. Different classes can then implement this interface in their own way while still adhering to the same contract.

We can think of DevOps in a similar way. DevOps describes a philosophy about how software should be built and operated, but it does not prescribe a single concrete way of doing so.

Site Reliability Engineering can then be seen as one such implementation. If we were to express this relationship using a simple programming analogy, it might look something like this:


interface DevOps {
    Culture
    Automation
    Measurement
    Sharing
}

class SRE implements DevOps {
    ServiceLevelObjectives
    ErrorBudgets
    ToilReduction
    Automation
}

SRE takes many of the ideas promoted by the DevOps movement and turns them into explicit engineering practices. Concepts like Service Level Objectives, error budgets, and toil reduction provide concrete mechanisms for achieving the cultural goals that DevOps advocates.

When Philosophy Becomes Job Titles

If DevOps is a philosophy and SRE is a discipline, why do we now see job titles like DevOps Engineer and SRE Engineer advertised everywhere?

The answer is that over time, ideas that begin as philosophies often get translated into roles by organisations trying to operationalise them. In the process, some of the original meaning is inevitably lost.

In many organisations today, the title DevOps Engineer has come to describe someone responsible for building and maintaining CI/CD pipelines, managing infrastructure-as-code, and automating deployment workflows. While these are certainly valuable activities, reducing DevOps to a collection of tools or pipelines misses the original intent of the movement, which was to improve collaboration and shared ownership between development and operations teams.

Something similar and equally tragic has happened with the title Site Reliability Engineer. In its original conception at Google, SREs were software engineers tasked with designing reliable systems and automating operational work. In practice, however, some organisations simply rebrand traditional operations roles as SRE without adopting the underlying principles such as error budgets, service level objectives, or systematic toil reduction.

Understanding these origins helps cut through the confusion. DevOps and SRE were never meant to be just job titles. One describes a philosophy about how software teams should collaborate, while the other describes a disciplined approach to building reliable systems.

DevOps and SRE: A Philosophy and a Discipline

Developer - Operator Cooperation

Origin

Site Reliability Engineering

Origin

The Reliability Philosophy

Embrace Risk

Service Level Objectives

Eliminate Toil

Class SRE implements DevOps

When Philosophy Becomes Job Titles

Keep Reading

Quick Links

Subscription

Socials