Log Analysis, or Log Hoarding

Let’s take a stroll down memory lane to remember the beginning of web application era: when a client had a problem, it called the support team to complain about the service failure or bug, and a customer support team would have written an incident report and forward it to the operations team, and then the operations team would examine it, mainly by looking at the logs with whatever helpful information they had from the customer.

You need Log Analysis

Can you imagine end users to be counted on to report problems today, when the average log today amasses 864,000 entries/day (600 entries/minute) and a network with 15 devices generates 13 million events per day ? How many people and how much time will be needed to sift through all those files?

The two extreme positions regarding this high volume of data are: “Who needs it? Let’s make some space!” and “I want to keep all the data, all the time, because having more data equals better decisions and who knows when I will need it!” (digital hoarding).

Probably the right approach is something in the middle: to carefully consider which data are likely to be of used (and are required for PCI-DSS, HIPAA, SOX, GLBA, FISMA compliance audits), and focus only on these data, but for big companies, that own 80% of all that data right now, it doesn’t help too much, because even after all this “clean-up” the amount is still extremely high.

And here comes tools like Splunk to help, which makes searchable data not only from any network traffic, Web servers, custom applications, application servers, GPS systems, stock market feeds, social media, but from preexisting structured databases, logs, config files, messages, alerts, scripts and metrics too.

Splunk offers DevOps’s team members a centralized view across all of the machine data, enabling threat prevention and detection and intelligence gathering, while in the same time it captures web interactions and key metrics such as time spent on page, bounce rates, navigation paths and product performance.

But even if Splunk has essentially become a verb in the world of machine data and is arguably the most feature rich logging solution, it has it’s caveats: beside the fact that its price is prohibitive for small companies, it comes with a steep learning curve and it’s not very user friendly. Setting up Splunk is a lengthy and tedious process, it is highly unlikely that when data is imported for the first time, it would be indexed correctly by Splunk, so you need a team to go through a time-consuming procedure, formatting the coarse data as required, as well as preparing the platform to accurately read each field.

To grapple with complex real life technical issues, not only do you need to have a very good understanding of the entire system and good intuition on what to query, but also knowledge of a language called Splunk’s Search Processing Language(SPL), which is very powerful but complex.

Every minute the DevOps team spends looking for an issue (performance, bug, security problem etc.) it’s time that the customers are being affected by it, so in the last years the requirements to be front-runner have changed. To react in hours or even minutes from an unexpected and unwanted event happening is not acceptable. You can no longer be only reactive, you need to be proactive in preventing issues. Real-time messages will alert to activities or key metrics, that indicate abnormal system behavior patterns, and allow you to predict and take action before a problem could happen.

Lets contrast the Splunk approach with the “more modern” examples hitting the market.

The modern log analysis platform like Logentries, a cloud based log management and real-time analytics service, is doing that by creating a future picture of what performance will look like based on historical activity. If the performance or characteristics do not meet this model, admin are alerted immediately.

Through it’s visualization platform, Logentries, it’s aggregating data in real time, and asynchronously pushing information when needed, rather than based on a time-based regular snapshot. In this data are included, not only events that should happen, but also missing events (using the Inactivity Alerting feature): credit cards that are not being processed or if website traffic halts unexpectedly. All that without spending a fortune or hiring an army of experts.

Providing a simple, intuitive and flexible search, Logentries fetches events using keywords, regular expressions and field patterns. Logs are supplemented with real-time information coming from the infrastructure: CPU, memory, and disk I/O information. Using Logentries in combination with HipChat, PagerDuty and Campfire, it provides actionable insights in every phase of operations, from development to deployment, to ongoing management and support.

The modern log system is designed for Devops methodology, where collaboration is paramount:

the team Annotation feature, allows placing annotation with comments, advice, solution and relevant system context, that can be seen, by other Devops Engineers, when similar problems are identified later
the Shareable dashboards centralizes monitoring by publishing log and time-stamped data and trends across the DevOps team and the entire organization
the Notifications module sends warnings to people, groups or the entire team, using custom tagging and real-time alerting, to avoid issues caused by app crashes, memory shortages, request timeouts etc. before they become fatal.

By adding logs from firewalls, routers, vulnerability scanners, Intrusion Detection Systems/Intrusion Prevention Systems (IDS/IPS), modern log analysis monitors, explores, and diagnoses system security events in real-time and track malicious attempts against the network.

These services can be connected to any platform using an API. They also already have a wide variety of integrations into popular cloud providers (such as AWS) and has agents and hooks for OS platforms.

Logging can be improved even more, going beyond making sense of an overwhelming volume of data and providing real time information. New and interesting features appear all the time, let’s take an example of one of them that caught my eye while researching for this article: passing complete information directly from the production machines logs to the development tools. Because in a production environment the verbosity of logs is often limited by performance constraints, a private company called Takipi, came with the idea to get from a log file error directly into a recorded debugging session, and as a result seeing source code and variable values at the moment of error.

With the Internet of Things slowly emerging and creating limitless possibilities of connecting to everything, the volume of data that need to be logged and analyzed will just continue to expand into Brontobytes and Geopbyte, requiring new technologies difficult to predict. Which means the effort of getting value from these logs is critical. Thus the approach of hoard all logs, and learn complex query languages cannot work.

Ten or twenty years from now we will look back and smile thinking of the “legacy” logging software, of the first decade of the century, when the biggest data centre (the Utah Data Centre), was only capable of storing 12 Exabyte of data.