At a company I previously worked at, I helped build out the initial infrastructure to support a large cloud security product. It was one of the first times I had the opportunity to build something from nearly the very beginning. Like many operations professionals, so much of my career prior to this experience had been spent cleaning up other people’s messes and technical debt that I was finally excited to “do things right” from the onset.
Looking back, I wish I knew that building our monitoring platform would be hampered without the ability to retain and analyze both warm and cold log data. Here’s why.
Time Series to the DevOps Rescue
Hindsight is almost always 20/20, and today I can proudly say that my team and I built a killer app monitoring platform. My goal was to make it easy for Dev and Ops folks to report time series metrics for their applications and build dashboards and alerts for app health. And we did just that.
While many ops colleagues would say that monitoring journeys usually start by capturing application logs with open source tools such as Logstash/Elasticsearch, Graylog2, or even buying commercial services from Splunk or Sumo Logic, not all journeys start that way. My team and I took a different approach and optimized for time series metrics versus the more traditional structured log management approach, leveraging newer toolsets such as Librato for real-time operations monitoring and performance analytics. Librato enabled us to scale much faster than with traditional log solutions and gave us almost instant visibility across our containers, cloud servers and the apps themselves.
Logging is Too Expensive
Using hosted services such as Librato to manage our time series metrics worked fine for a while. However, the cost of SaaS monitoring eventually grew, and, since we had the time and energy to bring metrics in house, we deployed Graphite, the open source enterprise-grade monitoring tool. Graphite allowed us to continue enabling our engineers with affordable access to granular time series metrics to help them manage their applications.
While this allowed the business to respond quickly when an issue arose, time series metrics were only part of the solution. They allowed us to see what was happening, but not always why it happened. While time series metrics and even some of the more advanced tracing technologies including Jaeger and Opentracing are fantastic to help debug complex distributed systems, they fall short when investigating and analyzing causes for problems, issues and hacks.
Digging Into Your Log History Gets Expensive, Quick
Tools such as the ELK stack have been booming, with Elastic alone hitting more than 100 million downloads last year, which shows that DevOps teams want more from traditional log management solutions. They’re looking for simplicity, ease of use and affordability, among others. With JSON effectively the de facto standard for logging, most teams take log data with its predefined schema and then index it into Elasticsearch/Lucene. Unfortunately, the Lucene indexing mechanism results in an unintended consequence: a 5 to 10X increase in disk usage. And just like that, ops cloud budgets get blown away when they realize that 10GB of JSON data indexed into Elasticsearch could be 50GB of disk used. You wanted that index to be replicated for disaster recovery as well? Then add 100GB of storage for every 10GB of logs.
DevOps Normal: Still the Tough Choice Between Retention and Cost
Unfortunately, most people choose to simply purge data from our “hot” Elasticsearch clusters as soon as we possibly can to minimize cost issues. This is far from ideal.
Companies moving to microservices-based architectures, deployed on containers such as Kubernetes, are going to be in for a rude awakening when they start building out their centralized logging platforms to keep up with the volume. Security teams are going to feel the burden as well, as they come under increasing pressure to keep data around for long periods of time due to compliance and audit reasons. When the next big web-facing vulnerability comes out, the ability to go back weeks or months will be key to see what IP address accessed a potentially malicious endpoint. With that data available, we could then take all those IPs and correlate them with other access logs.
Is centralized logging any better than it was 10 years ago? Or has all this data generated exploded so much that the tools we have today are just barely keeping up? One thing is certain, this problem is far from solved, and many companies may find themselves drowning in their own data lakes.