Logging is often considered an automatic process that doesn’t require much thought. We see logs, in general, from any system and take for granted that those logs have the information that’s needed. However, thinking through a logging strategy and architecture is extremely important for troubleshooting and performance efficiency. Regardless of how you log, the most important part of logging, after simply having logs in the first place, is having logs you can actually use. Actionable logs provide enough information, enough detail and enough history to ensure you can use the logs to get something done.
Let’s examine an example system. This example system is an application, with a microservices architecture, running on a Kubernetes cluster that lives on top of an RHEL box. By default, you get logs from the operating system and Kubernetes. But to get logs for the different microservices, you have to program those logs into your application while building the codebase. So how do you set up useful, actionable logs successfully?
Three Main Considerations
Consider the following elements when you think about building actionable logs.
Your logs should provide enough information to answer ‘when,’ ‘who’ and ‘where’ for any event. As an example, let’s say you want access logs. If there was an access event for your system, the log line for that access event should tell you when that access occurred; this is known as the timestamp. The log line should also include information about who accessed the system. Often, this detail comes in the form of an IP address or other identifier, and the ‘who’ should include not just human interaction, but also actions by other systems. Always log every access! Finally, the log line should include the location from where the system was accessed, meaning which service or which part of the platform was accessed.
Your logs should provide enough details to tell you exactly what happened during any single event. Simply answering the who, when and where doesn’t actually tell you what happened, and a basic answer such as “This pod started,” doesn’t provide enough detail to differentiate a code change triggering a pod restart from a system crash triggering the orchestrator to restart pods.
Finally, your logs should provide enough history to understand why something happened. Logs, overall, help you understand the context of the action that was taken. Writing log lines to a terminal somewhere without storing them is pointless. A single log line doesn’t provide you with enough context to understand the “why” for any action or event. How could you tell if an event was normal for a system without an easily accessible history?
All three of these components are necessary to create a plan that will lead to actionable logs—logs you can actually use to understand what happened, why it happened, and how you can act to handle the situation. If you can answer who, what, where, when and why from your logs, you probably have actionable logs, and are well on your way to meeting or exceeding logging best practices.
Once you’ve thought through those higher-level considerations, you can think about the nitty-gritty of your logs themselves. When I think about the practicalities of logging, I think about these main considerations: log levels and data needs, format and structure and security and compliance
Log Levels and Data Needs
Using log levels properly ensures that you collect and store the data that you need, and only that data. Logging can get very noisy, very fast in modern environments. Log levels provide the ability to fine-tune your logs to store only what you need; this can help you find information quickly in an emergency, rather than digging through the haystack for the proverbial needle. In addition, someone is more likely to keep logging capabilities turned on when it gets noisy if they have the ability to reduce the noise and fine-tune the information, versus having only an on-off switch.
Different programming languages use different combinations of log levels, but all of them have general categories of major, minor and debug logs. Major logs, as you might guess, should be on for all environments. Minor logs, generally, are turned on in environments for which you need to be able to understand the data flow without necessarily knowing the minute details. Finally, debug logs are generally turned on in dev or QA environments, and only when you need to know every detail – such as when you’re looking for issues that could turn into problems later on. You will need to ensure the necessary log levels are set in your code, as they aren’t necessarily automatic, but taking this extra step is worthwhile.
Format and Structure
You need to think about who or what is the primary audience of your logs. If you anticipate sifting through the logs yourself, you’ll likely not mind a text-based format. However, most modern logging uses machines to parse the logs and make the mounds of data more useful. In that case, you should use structured logging; JSON is the standard for structured logs due to the plethora of tools available for working with the standard.
Security and Compliance
Logging for security and compliance could be (and has been!) a full article, on its own. The short version, however, is that logging is a key pillar of a compliant system that is aware of, and actively monitoring, its security. If your stack doesn’t log everything to one or more append-only log files, stored in secure location(s) with a read-only archive stored in a separate location, then you’re at risk of an attacker coming in and modifying or deleting logs so you can’t track their actions across your system.
Your logging processes shouldn’t store personally identifiable information (PII) in clear text or behind an easily broken cypher, either. It’s better to keep that PII in a secured and hashed database, with a unique identifier (UID) that correlates to the database, if you definitively need to keep that PII in your logs in some form, or strip it out if you don’t need it.
All of these tips are fine and dandy, until you realize that you cannot get enough buy-in from others to ensure everyone follows the same best practices. It’s generally best to come to an agreement as to logging standards and practices across an organization. That will also help, as anyone working on a different system will start from the same baseline, and not have to understand another team’s standard, if they come in to help in an emergency. Ops teams working across many systems will appreciate the standards, as well. To get everyone to the table in the first place, start by appealing to each of their needs. Ask about people’s problems with logging and listen to responses. You likely will need to address assumptions folks make, misconceptions, complacency and resentment. The key is listening; you won’t get buy-in until everyone feels heard and validated.
To get everyone on the same page, you should first identify, as an entire organization, the needs you have from your logs based on the components of actionable logs. Ensure everyone has a seat at the table and the ability to discuss and unearth all needs, so no one is tempted to go around the team’s decision. One team’s needs will not match another’s, so you need to ensure that all viewpoints are heard and accounted for. Then, organize related needs and set standards for those needs. Once you’ve agreed to your new standards, work together to update your systems. This will take time, just like it does with resolving any technical debt! Start with smaller systems, or less-critical systems, so you can monitor how the standards are surfacing in real life, and how well everyone’s needs are met. Adjust, if you need to, with everyone’s agreement. Needs and standards will change, just like your systems will, and they need to adapt to fulfill the needs of your teams. If you take the time to get everyone to the table and make sure everyone is on board, you’ll be better off.
What You Need to Succeed
Logging may seem like simple magic that ‘just works,’ but there’s a lot to think about before you can have a successful, useful system. You need actionable logs that have enough information, details and history to be useful. You need to think about the data you need for all of your various systems, how you’ll structure that data and how you’ll ensure you stay secure and compliant. Finally, you need to ensure your team is on board with standards so logging can be as useful as possible. If you do all of that, though, you’ll find logging becomes a great way to communicate across teams and organizations to ensure you deliver the best product possible.