Log Management: What DevOps Teams Need to Know

One of the most important shifts of this decade is the rise of interconnected data across distributed systems. This explosive growth not only sparked the need to rethink cloud and global IT strategies, but it also disrupted traditional development, DevOps and ITOps practices. These changes are pushing requirements for higher-velocity development of features and products—together with more responsive support requests—onto technology professionals. These new requirements have prompted teams to evaluate new tools and agile processes that help support the needs of a competitive modern business.

Regardless of the pace of change within your group, basic functions such as infrastructure monitoring are crucial to operating production applications. At this new speed of business, concepts such as observability are important for enabling successful deployments with fewer user interruptions—not only in production, but also in the early stages of your continuous integration/continuous delivery (CI/CD) pipeline. As part of “shifting left,” log and event management in your CI/CD pipeline empowers developers to monitor and observe application behavior before releasing it to production. While some might say this adds additional layers of process, time, effort and due diligence in dev environments, it can help you evolve your organization’s fundamental software development behaviors while reducing preventable issues from post-deployment into production. It can create a smoother, more seamless user experience and reduce the need to fight fires or rearchitect your solution after the application is live in production and at scale.

Setting the Testing and Dev Stage

The rapidly evolving technology landscape has increased the need for log management and observability across distributed systems and containers. Changes to how applications and services are created, the ability to deploy applications across combinations of global logging-as-a-service (LaaS) providers and the pervasiveness and volatile lifespan of containers, along with the capacity to build services using various development languages, has increased the need to collect, monitor and trace data points across connected systems, which provide critical end user-facing value.

Traditionally, many considered log management a tedious task because each instance of a new server required you to run a search command across the logs local to that server. That approach isn’t scalable when every cluster of more than 20 servers is built up in minutes, and each application has independent logging formats. In this scenario, you’d need to run 20 individual searches across multiple systems, compare timestamps across servers residing in different time zones and so forth. With the decreasing cost and increasing prevalence of containers and virtual systems, the number of systems running within a typical organization has grown exponentially as the business prospers.

In addition to increased usage of virtual systems and containers, independent scrum teams of developers can build in parallel using technologies which are the best fit for the services they are producing, while also sharing environments for builds, staging and production. With these distributed services, aggregated log systems can quickly determine the root cause for debugging their application’s errors, exceptions or performance bottlenecks. A log aggregation system can reduce the element of “personalization” that can often happen when developers are using different technologies with various logging formats, and also reduces the noise by filtering out information that delays the team’s ability to find the source issue. This can cause confusion or slowed productivity when one developer is left to understand another’s non-standardized, yet shared, workload.

The reduced cost, increased collaboration and prevalence of container services are collectively driving an uptick in virtual environments and containers, often requiring more logs to be created in testing and dev environments, which can force organizational shifts to enable full observability across their infrastructure.

The 3 Pillars of Observability

Businesses are in the midst of a shift toward adopting these principles—the three pillars of observability—and using a three-pronged method of monitoring that takes data from several vantage points to depict a more accurate and consumable view of overall health and stability:

External Monitoring: An example of this is health checks run against your internal and external applications and websites to see the “digital experience” of your users.
Metrics and Distributed Tracing: This enables you to trace communications between applications distributed across systems/containers to identify errors and exceptions from your applications and resolve latency quickly.
Events and Logs: Data, which helps provide contextual information about events, enables you to identify issues in the code when combined with information from the first two pillars.

The above types of information are critical to DevOps to provide elastic and resilient services. Bringing application, system and infrastructure events into logs together with external monitoring and metric collection is quickly becoming the standard. With powerful log-parsing features such as live-tail search accessible from a web browser and the ability to visually pair log insights with added health metrics, the real value of log data is made apparent with little effort.

Site reliability engineering (SRE) is another area where significant evolution has occurred. ITOps has always been asked to react to fire and run from fire to fire. The evolution from ITOps to DevOps to SRE is forcing the alignment of different types of data to be used very early in a CI/CD pipeline, and enabling SRE teams to get involved in architectural discussion at project inception instead of in the latter stages or at release. The SRE is more involved in monitoring, understanding the development and architectural principles used to develop application services and build in automation, and high availability and circuit breaker patterns from inception.

This approach has allowed teams to be proactive and responsive rather than reactive. It has equipped our teams (which keep our business running successfully and our customers loving what we do) with the ability to focus on being scalable, elastic and resilient. Instead of troubleshooting systems in production, we auto-scale and replace to minimize disruptions and investigate that incident offline when possible.

Best Practices

Given the increasing pace of application development required to support the needs of the business, the spread of containerized applications and virtual environments and the emergence of new disciplines for observability (due to the lowered costs and increased reliability), log and event management has become a critical aspect for all involved in building, supporting and even using mission-critical applications. It’s moving to a defined area where logs should be compartmentalized, and access control should be defined:

Compartmentalizing Logs: Critical as a best practice so we can differentiate development, staging and production environments and segregate logs based on practical grouping. Using a log aggregation service provides you with a consistent experience and a common set of capabilities across all your environments.
Access Control: Another best practice that should be implemented more broadly; for example, developers don’t need access to all of your production logs if approximately 95 percent of their work happens prior to reaching production environment. In other words, even though tech pros are working to “de-silo” log management, we still want to silo access control and manage visibility. This can also help to ensure personally identifiable data (PII) isn’t exposed to anyone it shouldn’t be.

Looking Toward the Future of Log Management

As the landscape evolves and best practices become more defined, we should expect revitalized strategies with elevated expectations for the traditional scenarios where basic/free monitoring is leveraged. While the pillars of observability are leveraged more broadly in the ITOps and DevOps communities, we’ll be able to move more efficiently from an end user-facing incident—for example, an application error traced across distributed systems—straight into the logs to find the specific error causing the issue. We’ll be able to see greater correlation between logs and other custom metrics, as well as much broader use of these techniques outside of production environments.

DevOps teams need an intuitive event and log management system that is as de-siloed as we are. No longer can log management systems be relegated just to one on-premises storage system or one virtual environment; they need to fit the reality of how work is increasingly done today and will be in the future.

— Keith Kuchler