Log Analysis has the ability to transform teams. It centralizes communication, increases responsiveness, moves efforts from reactive to proactive, reduces time to resolution TTR, and allows increasingly complex environments to grow without fear. But, if log analysis becomes the bottleneck, it reverses all these benefits.
*** Author note: The credibility of this post has come into question do to my relationship with two of the four vendors mentioned in this post. My company Fixate IO has had a financial relationship with these vendors. However the position, and opinions are all my own.
As with all the new DevOps tooling, you will find that it sometimes creates more problems than it solves. Chef and Puppet encourage server sprawl with rapid un-managed provisioning. Selenium decreases functional oversight when script coverage is not the same as a human. And log analysis when limited, or poorly fed, can overlay issues across the entire environment. The good news is with the right tool, and implementation these things can be avoided altogether.
First the problem. Because of the deep reliance on log analysis for organizations that have implemented it, if anything goes wrong with the platform it can hamper the entire delivery pipeline. The reason for this is that it is well connected to nearly every component of your environment, or it should be; that is why logs are so valuable. And it is sticky, there is a high amount of lockin when you choose a log analysis platform. Normally lockin is bad, but in DevOps analytics is a must and it needs to be a consistent format and “language”. Finally it has two potential bottlenecks, throughput and storage limits.
I’m assuming that if you have implemented log analysis you did it correctly. Which means you were deliberate, before implementing you asked questions and documented expected results, you designed or evaluated your logs beforehand to know how you would tag and query on them, and you evangelized it, not hoarding it for operations only. If you did not do these things, and you already have a system up and running. You have far greater problems.
Even when you have a great log analysis platform implementation a few things can still go wrong. They are:
1.) Not enough throughput: Not being able to get the logs in when they are created is a huge bottleneck. It also kills the value of response time. If you don’t get the logs stored near real-time you can’t analyze them, and thus you can’t react fast enough to interesting insights. This can also lead companies, because throughput is an additional expense, to limit what they transfer. Avoiding the benefit of an integrated environment, and forcing a selection bias on the logs you do store. You should always be motivated to store more, provided what you store has been architected to answer real questions.
2.) Now storage limits. Similarly, if you are limited on how long you can store logs for you lose the value of historical analysis. Historical analysis is useful for not only reporting at your next Quarterly Business Review (QBR). Where you will be questioned on the root cause, and impact of a major outage. Historical data can also let you know where your delivery pipeline weak spots are, in order to enhance them as you developers do their code.
3.) And finally, they can turn your team into zombies. If you do not properly design your queries and the logs you are querying. Then your team will not only waste time trying to get answers. They will become addicted to log deep dives, it’s kinda fun, just like Reddit and Facebook during the day. Oh and! They will start creating their own reports, so that at the end of the year you have so many saved reports you do not know what is good, what is bad, what to save, and what to get rid of. And eventually you blame the platform.
This list can double as criteria for the log vendor if you do not yet have a log platform, or thinking of changing. Some vendors make the problem worse by pricing models which foster deep dives, and encourage the log dumping ground. Some of the market leaders and open source tools are the most guilty here, along with the open source tools. Both of these also encourage log analysis to be an operations only tool. Which is completely against DevOps.
But there is good news. One is, spend a little effort and no matter who the vendor is, you can fix it yourself. Don’t expect the vendor to make DevOps or analytics successful for you. They can’t, and they won’t. Put in a little effort in damn it!
But also the more modern platforms, and the purpose built platforms, have seen these issues, and often built around them. Solutions like Loggly, LogEntries, NewRelic Insights, and Sumo Logic.
In particular, LogEntries, a tool I have familiarity with, has spent specific effort to start with analysis and insights first, to steer clear of the query first mentality. And if you get to query, then they offer a regular expression based language, a standard language. Not a language you have to go to special proprietary language lockin school for. They also allow for higher storage limits, by allowing externalization of logs to long term AWS storage. And finally they have a program that allows even free tier users to get more throughput. Via the Dropbox style, share with a friend. Showing their understanding of how important throughput is.
If you have already implemented an older log analysis tool, that encourages collection, not analysis. You may find that it will serve you better in other ways, and can consider the move to a more modern approach.
But forget the tool. The biggest problem is you. And your inability to think beyond the now. It is the future that is going to burn you. At that point in time you can cut bait and leave to a new company who is hopefully doing the right thing, avoid the clean up, or you can start slipstreaming the better practices now. They include:
- Pre-mortem, or post mortem if you are already setup
- Invest in bandwidth and storage, do not limit yourself in either. And stop log favoritism.
- Start sharing reports. Yes this can create its own set of problems too, but they are all solvable.
- Think about what you want from a set of logs, before you start storing them.
- Do not trust the vendor to help your processes or culture, only execution of their tool.
It is up to you to pick the right tool AND implement it the right way. You are capable of building out a great logging system that does not become the bottleneck, and give your software delivery pipeline some strange degenerative disease.