Find and Fix, Sink or Swim
A few LinkedIn NOC engineers including Ariel Casas had a hand in forming what the once fledgling business / professional networking company today calls its Site Reliability Engineering organization. “The main difference between SRE and devops is that SREs are basically developers who happen to be very good at operations. So we call them Site Reliability Engineers,” says Ariel Casas, now Manager, Site Reliability Engineering, LinkedIn, explaining the choice of terminology.
In those days, circumstances such as outages would thrust Casas and his colleagues into the middle of the excitement, forcing them to devise solutions to the operations challenges at hand. This gave them opportunity to engage developers about how to work together to avoid future issues.
“Getting thrown into the middle of the storm and having to reverse engineer a service not knowing what it does, not knowing why it’s broken, we basically had to identify the root cause. We had a very, I mean, we still do, but we had a very, very complex infrastructure back when I first started. It was like descrambling an omelet,” says Casas.
Casas and crew located and scraped through log data, combed through config files, identified service dependencies, and figured out how services worked on the spot in order to understand an unexpectedly broken service. “Those opportunities were a great way for us to learn and figure out how our infrastructure worked,” says Casas, whose early training at LinkedIn was largely on-the-job, experiential learning.
Early team members like Casas received a lot of hands-on exposure to the various stages in LinkedIn’s application lifecycles, from conception to design to maintenance. Team members touched the application stack at many points and worked with developers to help them understand how the code they wrote would affect the infrastructure that the NOC had to support.
Playing With Fire
Speaking of particular fires he and his co-workers had to extinguish, Casas recalls an early example when only a few engineers were on hand as they watched the site, waiting for something anomalous to happen. And something did. “We had a company widget go down,” says Casas.
The widget served data that LinkedIn collected about specific enterprises that external media outlets were reporting on. In this instance, the CNN news service’s application of the widget stopped working. “We had no idea about the service or how the widget served that data, we simply knew it was broken and causing a poor experience for the customer,” says Casas.
The team used a Firebug debugging tool to determine the context path for the widget. They connected the dots from there to the service and the port that it was listening on. Then with some log mining relative to that service, they found errors that confirmed that the service could benefit and perhaps fully recover with the aid of some additional memory resources. “We increased the memory for the service, deployed it, restarted it, and it was really rewarding to see the widget behaving the way it should,” says Casas.
On With the Formalities
With more than 150 engineers at LinkedIn now, the company has fewer fires to put out and more need for formal training, including external trainings such as conferences and internal trainings at LinkedIn. SREs attend yearly conferences like Velocity, a WebOps-focused event where speakers flock from different organizations to share on topics where they have subject matter expertise. SREcon is another conference and SREs come to seed knowledge and training in their fellow workers there. LinkedIn also sends speakers to each of these conferences as well as attendees. “We also go to Python trainings to learn how to better code and to understand new concepts that we can use to improve our tooling,” says Casas.
For internal training at LinkedIn the company’s SREs who support select services, such as the internal SaltStack will provide education about developments in those areas. Other training includes Java programming, as the LinkedIn software stack is formed with Java. LinkedIn’s Java SMEs also contribute code back to Java.
LinkedIn also trains SRE staff on Kafka, the messaging system that supports much of that group’s data streaming. “We have some pretty smart engineers who support, own, and develop Kafka here. They provide good training around how we implement and use it,” says Casas. LinkedIn sends Kafka SMEs to meet-ups to train other enterprises in its use.
An Open Door to the SRE Career Advancement Ladder
LinkedIn offers internal trainings with an open door to any high-performing NOC engineer or developer in SRE who wants to attend. “We also have more general trainings called TechTalks and these are high-level and open to anyone in the company,” says Casas.