Part of my background includes a variety of clustered systems. First VMware, then OpenStack, Hadoop and its derivatives for a while, and these days the Mesos family. One of the many things I have learned along that journey is the inter-related nature of infrastructure and application. While we all know about that relationship, it goes further, I think, than most people recognize. A single degraded hard disk or loose network connection, which pushes work back out to the rest of the cluster, can drag an entire cluster to its knees.Since my time at F5 Networks taught me a metric ton about the nature of load balancing, I can say confidently that the same could happen to your load-balanced application.
The biggest frustration for me in most multi-system configurations is the inability to track enough places to quickly diagnose the problem, because your application could be brought to its knees just as easily by a memory leak. But the logs are spread across multiple logs on multiple machines, between the hardware, OS, and software, and you generally have to go looking for more information. Even if you have some of the best tools in the business, you are still looking in multiple places to find information about one problem and having to do things like sort by time to correlate events between the systems.
To be certain, other vendors are offering hardware/OS diagnostics with their APM. But DataDog’s approach, adding APM from the infrastructure side (rather than the more common reverse scenario), will offer different insights. So I’m interested in seeing how DataDog’s offering pans out.
I obviously haven’t had the chance to play with DataDog’s new functionality, but it is a space that very much interests me and one we as an industry do need more work in. So just the fact that DataDog is integrating these three monitoring layers is important. From here, it has a platform for adding things (security monitoring? Inclusive root cause analysis?) or increasing the quality and quantity of across-the-application, metal-to-user information available to Ops and for alerting. In fact, it disappoints me that actual infrastructure isn’t a core part of the APM Conceptual Framework, because the hardware under the VM that is hosting the app server for your app really does matter when performance degrades.
At a minimum, APM can help you get from problem to solution somewhat faster. If done well, it can alert you to the problem and show you where it is, helping you get from problem to solution even more quickly.