How to Minimize Latency and Its Impact on UX

Latency can have an enormous impact on user experience. Here’s what you need to know to mitigate latency before it impacts your customer experience

Businesses have been digitizing for decades, but recent events have accelerated the process beyond all expectations. Customers are now using more digital services than ever, and their switching costs are often extremely low. That puts enormous pressure on businesses to deliver not just online services but also consistently high performance. Businesses that once measured performance in seconds now measure it down to the millisecond and even the microsecond. For that reason, they’re eager to minimize latency and its impact on their bottom line.

Latency can be defined as the delay before a transfer of data begins following an instruction for its transfer. However we define it, its effects are clear: A study by Amazon found that 100 milliseconds (ms) of latency cost 1% in sales. A similar study by Google found that a half-second (500 ms) lag in returning search results reduced traffic by 20%. A study by Loadstorm shows that latency can have similar effects on any business’s conversion rates. Long latencies lead to abandonment of shopping carts and frustrated users leaving your site to find the same information, product or service elsewhere.

The moral: As businesses try to convert users and move them through their pipeline, latency can be the greatest enemy. But how do you measure it, and what can you do about it?

Measuring Latency

Given the impact of latency on the user experience, it’s surprising how many businesses don’t give it enough attention—or even measure it properly. A common mistake is to focus on average performance, which is ultimately a theoretical measurement that might not reflect actual end user experience.

A better, more meaningful, way to measure real-world latency is with percentiles.

Latency is typically calculated in the 50th, 90th, 95th, and 99th percentiles, commonly referred to as p50, p90, p95, and p99. Imagine 10 latency measurements: 1, 2, 5, 5, 18, 25, 33, 36, 122, and 1000 milliseconds (ms). A p50 measurement represents the median performance of the system. In this case, the p50 measurement is 18 ms, meaning 50% of users experienced that latency or less. The p90 measurement is 122, meaning that 9 of the 10 latencies measured less than 122.

Each measurement within a given percentile reflects a real latency that actually impacted the experience of at least one, but likely many more end users.

By contrast, the mean of our 10 measurements comes to roughly 125 ms, not even close to the midpoint in the observed data–in fact, slightly above the 90th percentile.

It is vital to keep in mind the outliers, what are called “long-tail latencies.” Under real-world conditions, p99 (and higher) latencies are often the ones you need to worry about most.

Gil Tene, the CTO of Azul Systems, addresses this problem in his presentation, “How NOT to Measure Latency.” He notes that p99 latencies can easily impact a far larger than expected proportion of users. If a typical user session involves five page-loads, averaging 40 resources per page, about 18% of users will experience at least one response longer than even the p99.9. He also points out that the average response percentile is experienced by more than 95% of users will be the p99.97 latency.

As Google engineer Luiz André Barroso put it:

“When a request is implemented by work done in parallel, as is common with today’s service-oriented systems, the overall response time is dominated by the long tail distribution of the parallel operations. Every response must have a consistent and low latency or the overall operation response time will be tragically slow.”

In other words, the slowest operation in a set of parallel operations usually defines the user experience. It doesn’t matter that 99% of the operations finish quickly if the other 1% holds everything up. This explains why it’s not enough to keep p95 latencies blazing fast. You have to keep long-tail latencies (p99 and up) consistently low as well.

Causes of Long-tail Latency

To fix the problem of long-tail latency, we need to understand why it occurs. Long-tail latency has many causes, but, significantly, it is not usually caused by application-specific problems or normal network lag. Instead, long-tail latency has systemic causes, usually related to problems in the underlying infrastructure. These can include pauses due to Java Virtual Machine (JVM) “garbage collection,” context switches, database repair, cache flushes and so on. This makes long-tail latency very tricky to diagnose and fix, as it’s often a “whack-a-mole” exercise.

Disk I/O and Network Bottlenecks

The main causes of long-tail latency are network and system bottlenecks. IT commonly invests heavily in the best components available: great solid-state drives (SSDs, and associated NVMes), CPUs and memory. But it often shoots itself in the foot by configuring a “narrow” network that bottlenecks all that highly performant hardware.

SSD settings are also commonly misconfigured. Cloud SSDs have two associated throttles: burst and sustained. Background operations such as streaming and compaction can chew up the sustained throughput, which may be 1/10 of burst. While the backend process hogs I/O, there is no capacity left for frontend, user-induced operations. While some transactions will be able to use the burst capacity, the steady-state frontend traffic will start backlogging and eventually time out.

Speed Mismatch

Database performance is strongly influenced by the combined speed of CPU and memory on the one hand and disk speed on the other. CPU tends to be faster, while disk tends to be slower. SSDs are catching up with CPUs, but the rule still generally holds.

Speed mismatch is a more common problem with write workloads bounded by disk speed. This can happen either because the disk is slow or because the payloads are large. Large payloads shift the bottleneck to the disk. When the time comes to write to disk, any relative disk slowness will cause queries to back up in the buffer. This will result in outlier p99 latencies.

Exceeding Your Latency Budget

Every system is designed with a “latency” budget, which is the ratio of latency to throughput. p95 and p99 latency are a function of the throughput a given system is designed to support. You can think of this ratio in terms of highway traffic and throughput as the number of lanes on the highway. More lanes mean that the highway can support more vehicles, and hence more trips. That’s throughput. Latency is defined by the speed limit. Having many lanes combined with a slow top-speed reduces the overall number of trips. The ratio between the two defines the latency budget.

Ideally, if you just add more throughput you should be able to reduce latencies, right? Well, not necessarily. Just like traffic on freeways can expand to fill an extra lane, still leaving you stuck in a traffic jam, you can easily overwhelm systems that weren’t designed well to begin with, even if you try to add more throughput. In fact, trying to add more throughput can degrade all of your latencies. In the routing world, that sort of poor architectural consideration is called bufferbloat.

If a system is trying to add too much throughput while other bottlenecks still exist in a system, then you only end up with longer latencies, even to the point of transactions timing out. That can exacerbate a bad situation since all failed transactions will attempt one or more retries.

Systems that handle big data have several considerations. These include the most obvious hardware-oriented factors; storage (the reason why NVMe SSDs are widely prevalent these days, for example), CPU cores, memory (RAM) and network interfaces.

Modern servers running on Intel Xeon Scalable Processor Platinum chips are common across AWS, Azure and Google Cloud. Server needs are based upon how much throughput each of these multicore chips can support.

Additional considerations get into how your database system was designed to handle data distribution and replication for high availability, peer-to-peer versus leader/follower designs, how it minimizes hops and redirects between servers, effective use of caching, how it efficiently it utilizes all the hardware available to it and so on.

These are all components that define a system’s latency budget. It’s important to understand and stay within those limitations. When they are exceeded, outlier latencies are virtually assured to occur.

Latency and the Database

So far we’ve treated long-tail latency as a systemwide phenomenon. Now let’s consider the role of the database. Most HTTP and API requests result in at least one, if not more, database calls, often a mix of reads and writes. Hence, the database plays a central role in the end user experience of latency.

NoSQL databases were invented specifically to address the need for low-latency, globally distributed data stores. The basic premise of the NoSQL database is the ability to “scale out” on cheap, commodity hardware. When more capacity is needed, administrators simply add servers to database clusters. This simplicity, however, has morphed into a vulnerability. Clusters often grow out of control, resulting in a phenomenon called “node sprawl.” The architecture of the first-gen NoSQL databases actively encourages larger clusters of less powerful machines. From the perspective of latency, this architecture virtually ensures inconsistent p99 performance due to more internode traffic, context switches, disk failures and network hiccups, all of which create latency outliers.

One database that is particularly susceptible to this problem is Apache Cassandra. First, Cassandra is implemented in Java, making it vulnerable to pauses caused by garbage collection (GC). Some monitoring teams even have a dedicated metric devoted to GC stall percentage. Compaction and repair operations, which become more onerous in larger clusters, add to the outliers that plague Cassandra deployments.

Second, Cassandra has limited ability to utilize modern hardware for compaction and streaming. Since Cassandra is an “append-only’ system, keeping data coherent for reads requires a compaction process. Servers with a large number of cores are readily available on public clouds, IaaS and for on-prem deployment. However, Cassandra can’t scale compaction linearly with the increased number of available cores, so it cannot effectively utilize those larger servers. This limitation results in higher read latency due to the need to deploy many more smaller servers in the clusters.

Since Cassandra exposes settings for storage media, JVM heap cache and concurrency on read and writes, the burden is placed on administrators to properly configure the system for specific workloads. When workloads are unpredictable and spiky, administrators must constantly tune the Cassandra cluster to optimize read and write latencies.

Many IT organizations turn to external caches as an aid to insulate end users from the poor long-tail performance of their databases. Caches are often used to enhance read performance but write latencies derive little or no benefit. Transactional systems, which depend on real-time performance, cannot drop mutations, so caches offer no help. Transactional systems depend on low long-tail latency for writes as well as reads. If spikes are too frequent, the client experience deteriorates, becoming very sluggish.

Still, front-end cache for reads is often a necessary evil, to the point where it is treated as obligatory. However, maintaining coherence between cache and persistent storage is a well-documented hassle that can easily bog down operations and negatively impact customers. The best approach is to identify a system that intrinsically combines cache and persistent storage in a way that optimizes performance and data integrity.

Winning the Battle for Consistently Low Long-Tail Latency

With these issues in mind, there are a number of guidelines your organization can follow to minimize long-tail latency.

Design a system that ensures the lowest p99 latency based on your organization’s current and future throughput targets.
Don’t ignore performance outliers. Embrace them and use them to optimize system performance.
Reduce the number of components in your architecture, from redundant cache systems to bloated hardware infrastructure. Shrink database clusters by scaling vertically on more powerful hardware.
Adopt database infrastructure that scales linearly with your resources: CPU, memory, storage, network and number of nodes in the cluster.
When using database-as-a-service (DBaaS) offerings, look out for “scaling bloopers”—points at which you can’t scale linearly and have to increase spending beyond the marginal benefit you get from the system.
Make sure your DBaaS vendor can offer multi-region deployments, so you can utilize the offering near your application, eliminating long-haul data traffic. While this will not reduce latency within a database, it will reduce end-to-end database latency between the database and the client application.

Latency often gets short-shrift in IT calculations because it’s so poorly understood and so badly measured. But you can’t ignore it. It does its damage whether it’s recognized or not (and the less you recognize it, the worse damage it does). Luckily, as we’ve seen, there are good ways to measure latency and squash it down to size. With the right metrics and an informed strategy, businesses can turn what used to be a fearsome adversary into a customer-experience advantage.

How to Minimize Latency and Its Impact on UX

Measuring Latency