Many people say that site reliability engineering (SRE) is “putting the Ops in DevOps.” In this DevOps Chat, we speak with Stig Sorensen of Bloomberg’s SRE team. Stig tells us how SRE is helping Bloomberg keep pace with the fast moving world of news and information.
Great discussion with a real-life practitioner about how SRE is being deployed in a real-world company. As usual, the streaming audio is immediately below, followed by the transcript of our conversation.
Alan Shimel: Hey, everyone. It’s Alan Shimel and you’re listening to another DevOps Chat. We’ve got a really good DevOps Chat lined up today on a really hot topic, and it’s SRE, site reliability engineering. And my guest is Stig Sorensen, who heads up – a member of the SRE team over at Bloomberg. Stig, welcome to DevOps Chat.
Stig Sorensen: Thank you very much. I’m glad to be on.
Shimel: So let’s get formalities out of the way. Stig, give our people your title and background here so that we can get started.
Sorensen: Okay, so my name is Stig Sorensen. I am the manager of the production visibility group at Bloomberg, responsible for building our CMDB tools and also building telemetry assay as a platform. I’m also one of the main evangelists for rolling out SRE as a role and as a – a new movement across Bloomberg.
Shimel: And Bloomberg is, of course, no small company, so rolling out this new role, this new way of looking at Ops and so forth, is – is of course not a – you know, it’s not something we should take lightly or think it’s easy.
Sorensen: Well, that’s very true. It’s – across our 5,000 engineers, it’s – that supports everything from our commercial website to – to market data feeds, to obviously our main product, the Bloomberg Professional Terminal which hundreds of thousands of the key influents around the world rely on. And, with various states and various different tech stacks, it’s actually quite – quite a challenge.
Shimel: Excellent. So, so Stig, why SRE, why now at Bloomberg?
Sorensen: So SRE in, like, what we do isn’t really new to Bloomberg. It’s always been sort of the engineers that own production and sort of own the – the availability of production, and own the process around it.
What we’re sort of trying to do now is sort of really more divide the responsibility between the application engineers and the SREs such that we can have people that are experts on – either of them, but it’s sort of unfair in today’s complex world for everyone to know everything from how a – how an instrument trades on the street to figuring out how to – how to run and manage a high availability system. Right? So it’s more fitting it into infrastructure and making people experts.
Shimel: So an interesting point you bring up, Stig, and you said, we’ve always been doing this kind of thing. And it’s true. A lot of what we now today are calling SRE isn’t necessarily brand new in terms of task and responsibility. It – it may be there’s more of the packaging and some of that mixed in.
When – when did Bloomberg, let’s say, start referring to this role, to this, you know, needs as, you know, quote/unquote “SRE”?
Sorensen: So we sort of started small about two years ago and it’s really sort of – really caught fire in, I would say, the last year where we really made it sort of a more firm-wide. It’s always been pockets of where it’s sort of been done, but then we really, like, part of the puzzle is really sort of internal branding as well in terms of labeling the – not labeling the people, but identifying who is there creating the right environment for these people to learn, to share ideas. Right?
I think, as you said, the – it’s not necessarily the new thing to do. What’s new for us is sort of really trying to do it in a more – more standard way and a more persistent way across the – across the firm.
Shimel: Yep. So Stig, you know, 5,000 engineers. How many are, you know, quote/unquote “on the SRE team”?
Sorensen: So they – we don’t actually have a centralized SRE team today. We sort of rolled it out in a way where we aligned the SRE teams with the application teams. The reason for that is really a lot of – in order for the SRE teams to be successful, the applications also have to change. Sort of separating the responsibility too far away makes that harder.
So we sort of put them parallel, side-by-side, two equal teams with the different responsibilities and – and that’s really sort of how we kick start the teams. Most of the SRE – a lot – quite a few of the SRE teams are seeded from – from the application teams and a lot of, like, the – the core infrastructure teams and so on.
So all-in-all we have about 75 people that are in sort of the – in the sort of teams that we have sort of – the teams that are SRE teams. As I said, we have a lot more people doing it across the firm, but they’re doing it as a part-time – part-time job.
Shimel: Got it. And so Stig, you know, SRE – this concept of an SRE – as you said, though they – we’ve been doing these kinds of things forever, the idea of calling them SRE and labeling them as such is relatively new.
Is there any particular training or skill sets that you’re asking people to – to make sure they’re, you know, proficient in in order to be quote/unquote “an SRE”?
Sorensen: That – I think – I think sort of – the answer is that we are hiring and we’re trying to hire many, many more. They sort of come from two different camps. It’s either people within operational or system admin background that’s sort of picked up programming and moving that way, or we have application engineers with an – with a more active view towards systems and towards availability – because we see SREs as software engineers just doing something – building a different type of software.
And I sort of jokingly tried to sort of classify SREs versus application developers in a way, sort of – I see application developers as inherently optimistic, they think everything will work all the time, and SREs are sort of the opposite, they think everything will break all the time. And that sort of relates – I think it’s more of a mindset than necessarily the skills. Skills you can always learn. And then I think some – as long as you have the right mindset, the right aptitude to do it, that – well, that’s what’s key to us. Right?
And I think there is – I think there’s way more SRE jobs than actually people that have done SRE before out there now. So we can’t go out and try to hire the – the ready SRE. We need to figure out how we get smart people with the right aptitude and training them.
And people mention training, we sort of try to do it in sort of like a lot of meet-ups, a lot of sort of internal meet-ups, external meet-ups, attending conferences to figure out what’s happening in industry, talking at conferences such that we also show what we’re doing.
And I think it’s – I think it’s really sort of a lot of on-the-job training, working with the application team, working with other SRE teams, and you realize that you’re not alone. There’s always – there’s always someone doing something very similar to you. You don’t need to invent the wheel every single time.
Shimel: Yeah, and Stig, we should not underestimate the importance of what you just said there. With a new field like SRE – and we saw this in DevOps, early on in DevOps – because there’s no sort of codified best practices, right? There’s the book from the Google team, and there’s other stuff coming out now but there really isn’t a – a best practices. There’s emerging practices. And so the idea of peer-to-peer sharing and – shared learning is – is really important to continue advancing, you know, the state of the art, if you will.
In addition to my DevOps dot com duties, I’m also one of the co-founders of DevOps Institute and this is exactly what we – the fact pattern that we saw in DevOps. And it’s the same fact pattern we’re seeing in SRE and frankly it’s one of the reasons why we are working, and expect to have out soon, an SRE course that kind of starts, you know, pulling together all of these emerging practices and having a common body of knowledge around it.
Sorensen: I think that’s great, that – that we’re sort of trying to standardize these things, but – because a lot of it also becomes sort of – you have to be flexible when you roll it out because not every application or every system can be rolled out – can – can get an SRE team and roll out in the same way because there – there’s different maturity of the systems, there’s different problems. Right?
Like, in terms of, like, do you focus on monitoring and a production environment or do you focus on more sort of the classic DevOps things and how you get things into production. Like, you really have to look at it from a holistic point of view. The goal of the SRE team is to improve availability of production. There’s multiple ways to do it. You could pick one and make it better, you’d probably – you’d probably improve it. Right?
And the – thinking – like, it has to do everything to make a difference is also probably not the right thing to think about.
Shimel: Yeah. Absolutely. So, Stig, let me – let me kind of pivot here a little bit and let’s talk about – what’s in it for Bloomberg with SRE? Why are they supporting you?
Sorensen: Well they – it’s – I think it’s really just looking at the services we provide. Right? Like, we’re providing real-time financial information to 300,000 of the most influential people around the world and it’s not really just about availability, it’s also about performance and latency of this – the data going around the world. Right? We – we process something – what, like 100 billion market data picks a day and one-and-a-half million news stories a day. Right? So it’s – so in order to keep all these pipelines going, it is actually quite a big job. Right? And I think just – just making sure that we are available and don’t influence the financial markets by not being there.
Shimel: Yeah. I mean, that – you know, and that brings home the, you know, we’re not – sorry, Bloomberg doesn’t sell widgets that no-one cares about. Downtime and performance there is – is, you know, teamed up or, you know, ingrained with financial market performance and reporting. You can’t afford – you can’t afford downtime. You can’t afford, kind of, mess-ups.
But Stig, we spoke a little off-mic about this and it’s a topic I want to bring up is, you know, at Bloomberg, you guys have been doing this SRE thing now and you think it – you’re making progress and it’s good, but we really haven’t done any sort of formal ROI analysis or, you know, similar type of work. Correct?
Sorensen: Yeah, no – that’s very true. It’s – it’s probably one of the biggest challenge in terms of, we have support and everyone believes it’s the right thing to do. You have the gut feel that this is – we have to do this to make it better or to be – or to keep the service going we’ve always had.
But I think it’s sort of a – for us it’s – for us, like, measuring – _____ measuring availability is actually not black and white. It’s not like, how many – how many failures you had on a website. Right? Because if you are a certain market player and the real-time market data is delayed by a few – by one millisecond or hundreds of milliseconds, it could make a big difference for you. Right? And it’s – is that complete failure or partial failure? And, sort of, all these things in between.
It’s sort of something we’re – we’re really working on to try to define better metrics around it such that we can track our progress effectively to see if – if we had – if we – if we are succeeding. It’s – if you go by the gut feel, it only lasts so long but at some point you need to really ask yourself, you need to make sure that the return is there
Shimel: Absolutely. Stig, I mean, and let’s be clear. This is not a Bloomberg issue at all. I – you know, we haven’t really seen it across the board yet. You know, formal ROI type of analyses for SRE, but as you said, it – you know, it obviously makes sense or, you know, logical to think that it’s good for – for organizations and I think this is why we’re seeing so many organizations starting to adopt this.
Another thing I wanted to touch on, though, Stig, was the idea of, again, paralleling the DevOps experience, originally everyone said, oh, DevOps is great for unicorns. It’s great for Google, it’s great for Facebook, it’s great for the Twitters of the world, but it doesn’t really work on the horses. Right? On the just regular, large enterprises.
SRE, well, yeah, that’s a Google thing and it’s great for Google but is it going to work for Bloomberg. Right? Or similar kinds of things. And, you know, you’re living proof, if you will. But every road, Stig, there’s – it’s – you know, I call it the DevOps cha-cha. It’s never linear, one step in front of the other and it’s just, you know, smooth sailing. There’s always a failure here, a lesson learned here, a step back, two steps forward, one-two, one-two-three.
Give us, you know, I’m sure you could give us plenty of successes. Give us a failure or two that you guys really learned from. Lessons learned.
Sorensen: Cool, yes, it’s – I think sort of in – the lesson learned in failure of the transformation was really, I think – some of the teams that was – that was labeled SRE was given too much operational work such that they didn’t necessarily have enough bandwidth to get themselves out of the quagmire they were in. And that sort of created, like, a new pattern in terms of that, like, SREs are operations, and they’re not. SREs are developers. But developing things to get rid of operations work, but if you don’t have time to develop, you’re never going to get out of it.
So I think that’s probably about the biggest lesson learned in terms of – in terms of the building the culture and getting sort of the – the optics to be right internally. And sort of in the same way, I mean, sort of like, it was also, sort of, SREs within the application teams and not having dedicated teams was also quite a – sort of an anti-pattern that we saw a bit where it’s easier to steal the SRE’s time to do application work. And then again, we don’t actually move forward. So the – so that’s sort of probably the biggest failure.
I think also we need to make sure that we have a good set of base tools and base things that – that can be shared between the teams such that _____ everyone needs to – needs to build everything from the ground up. And we probably – we probably started sort of in silos where sort of parallel things were built but it improved. It improved – it improved the production quality but it didn’t necessarily make us go faster in terms of rolling out SRE and so it – we sort of had to step back and sort of standardize a bit more of tools and make sure we had central ownership of some of these tools rather than individual teams owning them.
Shimel: Excellent. Stig, we’re coming up on the end of our time here. I wanted to kind of just ask you, so, you know good progress, off to a heck of a start, a fine beginning. Where do you see the future of SRE at Bloomberg specifically and in the greater market, let’s say, if you had a crystal ball, over the next three to five years?
Sorensen: I think – I think internally here, right, like we’re – we’re trying to grow significantly so for us to just be able to – to both convert people internally from an application to an SRE and also hire the people.
And I would hope for us and for larger companies like us where there’s a transformation, that it becomes – becomes more standardized such that we can have a centralized org that is the SRE org, and have sort of more of a – where we can – where we can scale – scale and support more teams whereas now it’s really one-to-one and – not necessarily one-to-one in terms of people but one-to-one in terms of teams.
And I would hope the industry in general also starts to think about the people that do not run on the big cloud providers and sort of actually own everything from the ground up, like us, and I would hope that more companies that are like us get into this such that –
[Inaudible due to radio noise]
– that it would be more pools and more things to support us. Right?
Shimel: Sure. Sure. Well, Stig Sorensen of the SRE team at Bloomberg, I want to thank you for being our guest on today’s DevOps Chat, and, you know, this is really – this whole SRE thing is certainly an area where we are watching closely. Maybe we could have you back on in a couple months and get an update.
Sorensen: I would love to, that would be great.
Shimel: All right. Stig, thanks again. This is Alan Shimel for DevOps dot com and you’ve just listened to another DevOps Chat. Have a great day, everyone.