Last issue, I cited CPU utilization as an example of a metric that is often misused to describe/explain/infer system performance and asserted that improved visibility can help to overcome such misuse. In this issue, I will expand on what improved visibility means and present two case studies that illustrate a recent antipattern that I’ve noticed in visibility: Interference caused by visibility interfaces.
I will describe some properties of ideal visibility interfaces with some real-world analogies. Together with two case studies of performance problems stemming from interfering monitoring interfaces, I will explain why interference can be so damaging to observability efforts. Because so few libraries and applications provide visibility through ideal interfaces, I also continue to argue for better visibility primitives from our operating systems.
Clarifying Visibility
Observability is a big deal in the DevOps and SRE communities. We cannot know how well our systems are doing without visibility into their state. Unlike Scotty on Star Trek, we absolutely should not rely on gut feeling to intuit the health of the engines that drive our enterprise systems. We have to build our systems to make their internal state visible to the outside world.
But what is not often discussed is the degree of visibility and the associated cost. A system’s visibility is not a binary “yes/no” attribute. Just as in the physical world, there are qualities to visibility that impact how observable a system actually is:
- Is the visibility continuous or is it discrete (in either time or space)?
- Does the act of observing perturb the system?
- Can everyone observe or can only a few agents observe and report to everyone else?
- Can everyone observe at the same time or is it one at a time?
Let’s ground these attributes of visibility by looking at some examples in the physical world. In the first article of this series, I mentioned that there is a lack of standardized language used in performance engineering. As far as I know, there is no standardized language around the attributes of visibility, so the adjectives that I am using are by no means “standard”–but the distinctions they identify are important.
Continuous Vs. Discrete Visibility
When in London, if I can see one of Big Ben’s1 four faces, I can always know what time it is just by looking up. My access to the time of day is continuously available. But if I can’t see any of the faces, I can still rely on the chimes—except that the bells only chime once every 15 minutes. This means that if I can’t see any of Big Ben’s four faces, my access to the time of day will be discrete.
If I witness an accident, the continuous access to time allows me to precisely note the time of the accident. If I have to fall back to relying on the chimes, the time that I attribute to the accident becomes less precise.
Likewise, in our cars we often need to know what our current speed is (e.g., as we approach speed traps or enter school zones). It would be absolutely useless if our speedometer reported our speed in five-minute snapshots. I do not want to know how fast I was going three minutes ago. I need to know how fast I am going now!
All other things being equal, continuous visibility is preferable. Being able to examine a system’s state on demand is far preferable to having to wait for some arbitrary interval to pass before that information becomes available—unless there is an excessive cost to examining the system state in real-time.
Overhead, Interference and Multiple Viewer Contention
Measurement overhead is likely the type of cost that most people are familiar with—and the type of measurement cost that many of us focus on. The time to place down a ruler to measure the dimensions of a sheet of paper or the time that it takes to make inter-process communication calls to obtain a service’s state are costs associated with taking a measurement. Of course, we want this measurement overhead to be as low as possible, but we can sometimes live with overhead because the cost is mostly borne by the observer.
Then there are measurements that interfere with the operation of the system being measured. For example, take the annual physical exams that our doctors recommend. The purpose of these exams is to take measurements of the state of our bodies to detect health regressions. Pretty important stuff. But these exams are intrusive. We have to take time off from work, break our daily routine and trek over to the doctor’s office to get poked and prodded. Some people even skip their annuals because it is so disruptive to their lives. Many of the exciting advances in consumer medical devices essentially facilitate more continuous monitoring of vitals so we can rely less and less on these intrusive annual physicals. Again, continuous visibility is preferred!
In computer systems, it is virtually impossible to take software-based measurements that don’t contend with the target software for hardware resources. So, implementing visibility in software means that some interference is unavoidable. But, contending for software resources like software locks and critical regions is another matter entirely. As such, we must be cognizant of contention for software resources when designing our systems for visibility. This is the essence of the Non-Interference Prime Directive for Visibility.
A Non-Interference Prime Directive From the PMWG The participants in the Performance Management Working Group (PMWG) had different goals and priorities. My priorities lay in establishing a minimally intrusive low-level visibility interface for operating system and process metrics. This became the Data Capture Interface (DCI) layer in the Universal Measurement Architecture specification. A key part of the DCI included this statement about the performance impact of monitoring: 1.2.2 Performance The addition of any metrics acquisition subsystem should not noticeably affect the performance of the measured system. (Many performance tool builders assert that system performance should not be altered by more than 5% when there is measurement activity.) Although it is beyond this specification to stipulate a performance degradation figure, that figure belongs in an implementation’s design specification, the performance goal does impose a requirement that the programming interfaces specified in this document be capable of being implemented in the most efficient manner possible on the target operating system. I actually wanted something stronger than a generic statement about performance impact in the specification and thought we should provide a reference implementation that would embody the key principles of low overhead, low interference and low contention. To that end, I prototyped a version of UNIX SVR4 where the kernel sysinfo data structure used by the sar/sadc utilities was placed on a page-aligned area of kernel address space. I also made sure that the rest of the page was not occupied by any other data (to address security concerns) and arranged for that page in kernel address space, which is normally protected against reads and writes from user space, to be readable (but still not writable) by all user processes at a fixed address in the user address space (it was a prototype to demonstrate how “things could be”). Sadc (the data collection component of sar) normally accessed the sysinfo data structure via /dev/kmem (a special file that gives file access semantics to the kernel virtual address space). On my prototype system, I modified sadc to simply dereference the fixed virtual memory that I placed sysinfo into. No open, lseek and read—just memory dereference. On the 3B2/400 system on which my prototype ran, running sadc once a second normally took almost 5% of a single CPU. On my prototype system, I was able to run sadc 100 times a second with almost no measurable CPU usage. Additionally, in my prototype, EVERY process had read-only access to the sysinfo data via simple memory dereference. The system became super visible—with virtually no overhead. Everyone could look at sysinfo at any time at the cost of memory access. Visibility Nirvana! Alas, Nirvana nixed; paradise lost. PMWG participants did not wish to put the requirements of a reference implementation on their kernel developers. Instead, we opted for the weaker statement about performance you see above. |
Just as with annual physical exams, a computer system measurement implementation that results in the slowing down of the system presents a tremendous disincentive to taking that measurement.2 Measurement interference is something that we need to design out of our implementations—not design into them.
For example, if a system maintains a linked list of objects that are actively being added and removed, and traversal and manipulation of this linked list is protected by read and write locks, an implementation of a monitoring interface that needs to traverse this collection of objects in read-only fashion should not participate in this locking protocol. This statement might rankle some software engineers who are used to “data integrity at all costs,” but this highlights some of the differences in priorities between general software engineering and the practical needs of performance engineering.3 Implementing monitoring interfaces that cause software interference is counterproductive—they won’t be usable when they are most needed.
So far, we have only considered a single observer of our systems. When there are multiple observers, we also have to ask whether the observers interfere with each other. There are plenty of examples in real life where this happens. For example, if ten people are trying to measure the dimensions of a piece of paper, physical constraints probably allow only for two or four measurements (at most) to be taken at a time (i.e., assuming an organized pipeline of participants measuring width and then height). An even more realistic example of serialized observer access might be the portholes in the lower decks of a cruise ship. Some of the smaller portholes only allow one person to view the outside at a time. So, when a pod of whales swims by, viewers have to queue and take turns to gain access to observe them.
With computer systems, measurement contention between multiple observers also tends to discourage observation—not because everyone is so polite but because no one wants to get stuck in a line.
Fragmented Vs. Holistic Visibility
A final dimension of visibility is related to the field of vision. In the physical world, this is illustrated by comparing the porthole view from inside a cruise ship with the wide open, unobstructed view from the top deck.4 Compared to the top deck, portholes only provide a partial view of the ocean. One sometimes needs to stitch together the views from multiple portholes to get a more complete view of the outside world. The importance of the more complete view is the inherent bigger picture—where each fragment provides context for and about neighboring fragments.
In computer systems, visibility is inherently fragmented. Each component provides a visibility portal (or not) into its own state. Like the portholes on the lower decks of cruise ships, we can approximate a holistic view if we are able to query the state of each component in a reasonably timely fashion. Again, this is only possible if each component supports continuous visibility. Unfortunately:
- Many key low-level components do not provide any visibility.
- Many other components only provide discrete visibility.
- Some limit visibility even further by providing discrete visibility through network sockets with endpoints that are off-host. As a result, co-located components are unable to examine each other’s visibility interfaces in any practical manner.
In turn, the opaque, fragmented views that result from the discrete visibility in many components results in a fragmented mentality among developers. Engineers develop a mindset that they are implementing visibility only for themselves and their components, rather than thinking about their visibility as part of a whole—which provides a circular argument for justifying the discrete view they fall back on.
Rather than thinking of passengers on a cruise ship and portholes, think of the components co-executing on a host as workers in a factory—with data moving through and between them. The current fragmented view places each worker in their own room, where they are unable to see the state of the other workers. With a holistic view supported by continuous visibility into one’s neighbors, workers can make better plans and decisions. And “factory monitors” can get a more holistic view of events and better correlate different components’ states. Being able to build useful context around observations is a key attribute of observability.
In contrast, modern UNIX and Linux systems support continuous visibility for operating system and process metrics through the /proc interface. We can query /proc for a supported metric at any time. The utility (and success) of the /proc interface is unquestioned–being able to access process and system metrics through the filesystem namespace is a fantastic idea.5
Applications and libraries have grown accustomed to the ability to query standard system metrics at will. If this mode of continuous visibility were to spread to the full stack, it stands to reason that the ecosystem would evolve to benefit from this enhanced observability. This should provide an impetus for the operating system to provide primitives to facilitate continuous visibility for the entire stack.
But while we wait for visibility nirvana, there are some practical problems that need to be addressed. Some important visibility interfaces that currently exist are violating the prime directive in a big way.
Prime Directive Violations
I delivered a talk at SRECon 2021 entitled Latency Distributions and Micro-benchmarking to Identify and Characterize Kernel Hotspots, in which two of the hotspot case studies are examples of standard, commonly-used visibility interfaces that violate the Non-Interference Prime Directive. Let’s examine those two cases.
When the Cost of Running Netstat is Too Dang High
We recently discovered that netstat can seriously delay the creation and destruction of UNIX domain subsystem (UDS) sockets on Solaris. More specifically, in Solaris, the kstat mechanism to query for the list of all active UDS sockets (an ioctl call) can take seconds of CPU time, especially when there are tens of thousands of the sockets in the system. By itself, this makes logical sense and is not a problem since more sockets means more data transferred from the kernel to the user space. But the impact of a long netstat run time on actual UDS socket creation and destruction is surprising (and disappointing).
Here is the output of a program I wrote which collects a logarithmic histogram of the socket() and close() times for creating and destroying UDS sockets 100K times on a Solaris 11.3 system:
<1us | <10us | <100us | <1ms | <10ms | <100ms | <1sec | >1sec | max | |
socket | 0 | 0 | 72556 | 27433 | 11 | 0 | 0 | 0 | 1967055ns |
close | 0 | 98262 | 907 | 815 | 16 | 0 | 0 | 0 | 1364475ns |
Now, here are the results from the same program, but this time I’ve concurrently run a single instance of netstat -f unix:
<1us | <10us | <100us | <1ms | <10ms | <100ms | <1sec | >1sec | max | |
socket | 0 | 0 | 39006 | 60976 | 18 | 0 | 0 | 0 | 2137995ns |
close | 0 | 96837 | 1388 | 1757 | 15 | 2 | 1 | 0 | 308073600ns |
Look at how the single netstat call has pushed the maximum close time from about 1ms to over three seconds!6 And it looks like the distribution of socket() times that was clustered in the 10-100usec range has increased to be more clustered in the .1-1msec range. The act of running netstat has a huge impact on the time to create and close UDS sockets.
Disabling netstat is not the right long-term solution, as doing so would eliminate key visibility. Artificial throttling of netstat usage is also not the right long-term solution, as this compromises the timeliness of visibility (i.e., continuous visibility). The right thing to do is to eliminate this tight coupling between netstat monitoring and the UDS.
When PS Stands for “Pretty Slow”
Before Linux aficionados smugly dismiss monitoring interference as only a Solaris problem, there is potentially an even bigger problem on Linux. There, the monitoring contention is seen by accessing /proc itself and is of the “contending observers” type.
Let’s look at running concurrent instances of “ps auxww” on one of our Linux development boxes with 36 cores and 72 hardware threads (RHEL 7.6):
Concurrency | Wall time (avg) | User time | Sys time |
1 | .984 | .081 | .902 |
2 | 1.269 | .071 | 1.013 |
4 | 3.043 | .066 | 1.429 |
8 | 4.106 | .070 | 1.396 |
As we can see, with increasing concurrency, the average wall time for ps completion goes up (along with system CPU time). The ps instances compete with one another!7 This is even more surprising since ps is essentially a read-only operation.
Further tests showed that /proc traversal for processes and threads not only contend with each other, but also that they contend with access to other, non-process related special files under /proc (e.g., /proc/vmstat). This is significant because so many libraries and applications depend on non-process data available through /proc to make runtime decisions.
My colleague Gary Liku will be presenting additional findings around this /proc contention problem at SRECon22 EMEA in late October.
What Happened?
In the old days, tools like ps and netstat would obtain their data by opening the special file /dev/kmem – a file-based interface to the kernel virtual address space. If we knew the virtual address for an object and its size, we could simply seek to the virtual address in that file and read() the object. The decoupled nature of this read meant that it was impossible to do any synchronization or coordination with actual code that operated on the object. An object could change in the middle of a read.8 As mentioned earlier, the expectations around monitoring data consistency can be looser than the standard data flows that most software engineers deal with. This decoupled data access adhered to the Non-Interference Prime Directive and suited the monitoring use case very well.
In modern UNIX systems, the use of /dev/kmem has fallen out of favor–and for good reason. In addition to the performance overhead of seek/read to access kernel objects, /dev/kmem gave all-or-nothing access to the full kernel virtual address space. This meant that applications that did monitoring via /dev/kmem had to also be trusted with elevated privileges.
The general purpose decoupled access through /dev/kmem has been replaced with (potentially) coupled access to kernel objects through interfaces like /proc on many UNIX and Linux systems. But, as we have concluded, just because coupled access is allowed does not mean that it should be exercised without consideration of the interference cost.
Assuming that a visibility interface will be used sparingly or sporadically often turns out to be a bad assumption. The more useful a visibility interface, the more often it will be used.
Visibility in User Space
As mentioned earlier, while the operating system provides continuous visibility interfaces (e.g., /proc) for accessing well-known system and process metrics, application code generally does not. It can be argued that, in avoiding continuous visibility interfaces, applications are also able to avoid the problem of interference and observer contention–and that would be a valid argument. But, I feel that the ideal of holistic views of the co-located components of a system is so compelling that applications should either (1) lobby for the operating system to provide proper continuous visibility primitives for applications (/proc provides an excellent framework for building such primitives) or (2) implement their own continuous visibility interfaces.9
Java and the JVM provide examples of how continuous visibility interfaces can work in user space. Very early on, Java’s designers saw the benefit of continuous visibility into JVMs and came up with the Java Management Extensions (JMX) framework. Through this framework, the JVM, Java libraries, and Java applications are able to expose metrics through getters on management beans. These getters can be discovered through a common namespace. JMX also allows external clients to invoke these getters through a standardized remote procedure call mechanism.
Getters can be implemented as simply as returning the current value of an object that summarizes code state, or implemented with arbitrarily complex code–including invoking getters on other management beans or querying remote databases. But, the overhead and non-interference concerns outlined earlier should apply to getters implemented for performance visibility.
Finally, it is interesting to note that Oracle (and perhaps other) JVMs also implement a lower-level metric exposure mechanism that leverages a memory-mapped file to provide a shared memory interface to JVM metadata and some low-level metrics.10 Even with the continuous visibility provided by JMX, JVMs also recognize the advantages of lower overhead, more decoupled, coordination-free visibility utilizing shared memory.
We should all keep the prime directive in mind and design our visibility mechanisms in the least intrusive, most decoupled manner that is practical. Or as users, require that the visibility mechanisms that we use adhere to the prime directive.
1Yes, I know; Big Ben is not actually the name of the tower with the clock faces: https://en.wikipedia.org/wiki/Big_Ben
2These implementations are not useless. But they are generally relegated to being used for performance debugging – often in non-production environments.
3Similar compromises are taken when building loosely coupled distributed systems—especially in extremely large systems where eventual consistency is an acceptable norm. We need to think of the design of monitoring systems as loosely coupled with the system being observed.
4Although recent insights from cognitive neuroscience suggests that this complete visual view is actually an illusion created by our brains: https://www.science.org/doi/10.1126/sciadv.abk2480. But the concept of an accurate holistic view of the state of systems remains an ideal goal.
5Unfortunately, no one has chosen (yet) to implement a low cost, low interference, low contention mechanism like the one I prototyped for UNIX SVR4.
6The amount of impact of netstat on UDS socket creation and close turns out to be a function of the number of UDS sockets on the system. The more UDS sockets, the greater the impact.
7This behavior gets more pronounced as the number of lightweight processes/threads in the system increases because the overhead appears to grow quadratically.
8My memory access prototype, which was orders of magnitude less expensive than a read of /dev/kmem, was also decoupled in a similar manner.
9Everywhere I go, I re-implement my libmetrix C library—metric registration and exposure through memory mapped files which approximates what I believe would be ideal metrics primitives based on /proc. An ideal metrics registration and exposure interface for user-level libraries and applications should be implemented under /proc itself.
10Similar in spirit to my aforementioned libmetrix library.
Thanks
Thanks again to everyone who provided feedback on earlier drafts of this installment. And again, special thanks to Peter Wainwright for helping to get my thoughts better organized for this installment.