Every company is trying to get better insights into its operational effectiveness, and they are running into the same problem: Scale. So what does a scalable monitoring strategy look like and how can you safeguard against the most significant issue in observability?
What is a Scalable Monitoring Strategy?
We’ll begin by identifying the two things most impacted by scale: Cost and performance. Cost can be broken down into storage and computation. It is obvious that to hold more data, more storage is needed, but what about compute power? This is important because queries need to search through more data—and thus, use more compute power—to return a result.
This creates a trade-off between performance and cost-effectiveness. It becomes more and more challenging to have queries that resolve instantly as the dataset increases, so engineers will often simply tolerate a performance decline. This further impacts usability and the general usefulness of the system. If it takes 10 seconds to resolve every query, you’ve just cut down your data discovery efficiency by a factor of 10.
Scaling Means Performance and Cost-Effectiveness
To scale a monitoring strategy, a wise architect needs to break this standoff between performance and cost and approach the problem like a data scientist. A scalable monitoring strategy starts with a simple question: What is the use case for this data?
Some data is only ever ingested because it might be needed. It spends its entire lifetime undisturbed and is eventually compressed or deleted. On the other hand, some data is queried every minute of every day and is integral. Knowing how data is used means an engineer can decide the value of that data.
In the battle to create a scalable monitoring strategy, there’s a three-step approach.
Step One: Track Data Usage
Which data is ingested and never queried? Which data never spends any time at rest? Build a map of how often data is consumed so that any strategic choices are informed by the use case.
It is most common to group data into three different use cases:
Frequent Access – Data that is constantly queried and needs to be available at a moment’s notice.
Monitoring – Data that drives dashboards or trains machine learning models but which is largely useless after it has been processed.
Compliance – Data that is held in case it is needed, but it is only queried sometimes. For example, audit logs.
Step Two: Optimize Storage and Consumption
Once the use case of data—whether it’s logs, metrics or traces—is understood, the next stage is to optimize. This means creating different storage and query solutions for the different usages we’ve seen above.
Frequent Access – Rapid queries that can be optimized and tuned. OpenSearch is a good option, although management overhead can be painful, especially at scale.
Monitoring – This is largely about transforming data. For example, ingesting logs, converting them into metrics and deleting the original log. This is very powerful because metrics take up far less space than logs and can be stored much more cost-effectively.
Compliance – Low-cost storage like Amazon S3 is a good option, but this data must still be accessible, even if simply by reindexing.
How Easy is it to Build These Capabilities?
It is trivial to create an OpenSearch cluster, but it’s challenging to manage it at scale. Likewise, it’s simple to convert logs to metrics, but doing so in a performant way at scale is complex. Holding onto compliance logs as they scale near infinitely can be difficult, and reindexing this data is a non-trivial operation when dealing with large data volumes.
However, the key takeaway from this analysis should be that these are capabilities that any organization should have if they intend to scale their monitoring solutions. If they can be attained through a SaaS vendor, then this is a serious option to consider because it enables companies to immediately take advantage of this capability without the upfront, unpredictable and often ongoing cost of in-house engineering.