I was having a conversation with a CxO level customer as part of an AIOps/Observability workshop, and from what I could tell, most are confused about how to properly operationalize cloud-native production environments – especially the monitoring/observability portion. Here is how the conversation went.
“Andy, we are thinking about getting [vendor] to use for our observability solution based on your recent research. What do you think?”
“Well, I don’t want to endorse any specific vendor, as they are all good at what they do. But let’s talk about what you want to do, and what they can do for you, so you can figure out whether or not they are the right fit for you.” The conversation continued for a while, but the last piece is worthy of being called out specifically.
“So, we will be running our production microservices in AWS in the ____ region. And we are planning to use this particular observability provider to monitor our Kubernetes clusters.”
“Couple of items to discuss. First, you realize that this particular provider you are speaking of also runs in the same region of the same cloud provider as yours, right?”
“We didn’t know that. Is that going to be a problem?”
“Not particularly. However, you may get into a ‘circular dependency’ situation.”
“What is that?”
“Well, as an enterprise architect, I always call for separation of duties as a best practice. For example, having your developer testing the code is a bad idea, having your developer figuring out how to deploy is a bad idea. In much the same way as when your production services run in the same region as your monitoring software – how would you know about a production outage if the cloud region takes a hit, and your observability solution goes down at the same time your production services do?”
“Should we dump them and go get this other solution instead?”
“No, I am not saying that. Figure out what you are trying to achieve and have a plan for it. Selection of an observability tool should fit your overall strategy.”
For those who don’t understand the above conversation, here is the reason why this scenario could be a problem.
Coming from an enterprise architecture background, we were taught, as a best practice, to operationalize production systems to avoid circular dependencies. This includes not having two services depend on each other, or not to colocate monitoring, governance and compliance systems as part of the production systems themselves. If you were to monitor your production system, you would do it from a separate and isolated sub-system (server, data center rack, sub-net, etc.) to make sure that if your production system goes down, the monitoring system doesn’t go down, too. The same goes for public cloud regions – although it’s unlikely, individual regions and services do experience outages. If your production infrastructure is running on the same services in the same region as your SaaS monitoring provider, not only won’t you be aware that your production systems are down, but you also won’t have the data to analyze what went wrong. The whole idea behind having a good observability system is to quickly know when things went bad, what went wrong, where the problem is and why it happened so you can quickly fix it. You can check out this blog where I explain this in detail.
The best practice would be to either:
When/if you get the dreaded 2 a.m. call, what is your plan of action? Just think it through thoroughly before it happens and have a playbook ready, so you won’t have to panic in a crisis.
By investing in open source frameworks and LGTM tools, SRE teams can effectively monitor their apps and gain insights into…
Cognition Labs' Devin is creating a lot of buzz in the industry, but John Willis urges organizations to proceed with…
While most app developers work for organizations that have platform teams, there isn't much consistency regarding where that team reports.
Day Two DevOps is a phase in the SDLC that focuses on enhancing, optimizing and continuously improving the software development…
A global survey of 500 IT professionals suggests organizations are not making a lot of progress in their ability to…
In part five of this series, hosts Alan Shimel and Mitch Ashley are joined by Bryan Cole (Tricentis), Ixchel Ruiz…