Unreliable Server Scare | Information Batteries

Welcome to The Long View—where we peruse the news of the week and strip it to the essentials. Let’s work out what really matters.

This week: We worry about chips failing randomly, we ponder a new way of thinking about workload shifting, and we grok Arm’s IPO.

1. Smaller Features Lead to Random Failures

First up this week: Could cosmic rays be causing increasing numbers of errors in cloud servers? As chips get faster and more densely packed, the worry is that undetected bit-flips could cause serious problems.

Analysis: How can you mitigate the problem?

As workloads grow in number and complexity, this is becoming a real issue—not merely theoretical. (See also: DevSecOps worries such as Rowhammer.) Think about where the critical data are and how to detect random errors. Even consider running important workflows in lockstep pairs.

John Markoff: Tiny Chips, Big Headaches

Imagine for a moment that the millions of computer chips inside … the largest data centers in the world had rare, almost undetectable flaws. … As the tiny switches in computer chips have shrunk to the width of a few atoms, the reliability of chips has become another worry for the people who run the biggest networks. … There is growing evidence that the problem is worsening with each new generation of chips.
…
Both Facebook and Google have published studies describing computer hardware failures whose causes have not been easy to identify. … Facebook researchers noted that some processors would pass manufacturers’ tests but then began exhibiting failures when they were in the field. … Smaller transistors, three-dimensional chips and new designs that create errors only in certain cases all contributed to the problem … according to Google.
…
Intel executives said they were familiar with the Google and Facebook research papers and were working with both companies to develop new methods for detecting and correcting hardware errors. [But] computer engineers are divided over how to respond to the challenge.

Have we hit a wall? So asks Thutmose V:

For more than 50 years, semiconductor manufacturers have been trying to keep up with Moore’s Law, which … has effectively served as the target for semiconductor engineers to push them forward. … The size of some parts of the transistors is now 10 nanometers or less. 10 nanometers is about equal to the width of 100 silicon atoms—100 atoms is not very many.
…
When things are that small, does quantum randomness start to be noticeable? Normally if the odds of something failing is a trillion to one, you don’t need to worry. But if you have 10 billion transistors operating at a clock speed of 3 billion times a second, a one in a trillion chance of failure is nowhere near good enough. … It is a miracle these things work as well as they do.

We’ve known about the problem for years, but only now is it becoming reality, says dsgrntlxmply:

We observed certain network failures to be caused by corruption of an on-chip routing table memory in a SoC that lacked parity protection for these tables. The failure rate was low, but was higher in units that were deployed at higher altitudes. … The problem was eventually mitigated by software table sweeps, and in newer SoC generations, by parity in critical tables.

These Single Event Upsets are said to be caused by muons generated by high energy cosmic ray interactions in the upper atmosphere. Third semester undergrad physics.

2. Mind-Bending Speculative-Compute Research

Many of us are familiar with delaying a workload until a time when servers are idle—or when energy is cheap. But some fascinating research asks: What if we could predict a future workload and perform it in advance?

Analysis: “Information batteries” can be cheaper than Li-ion

The researchers argue that pre-caching even parts of a workload’s output when energy costs are low—or even negative—can save far more money than traditional grid-scale storage such as batteries or pumped hydro. For example, speculatively transcoding video versus generating it on the fly, or predicting future machine learning training workloads.

Caitlin Dawson: A New Way to Store Sustainable Energy

Renewable energy has an intermittency problem. … What if surplus renewable energy could be stored as computation? … That’s the thinking behind “information batteries,” a new system proposed by … Barath Raghavan, an assistant professor in computer science at the USC Viterbi School of Engineering … and Jennifer Switzer, a Ph.D. student from UC San Diego.
…
When renewable energy is available in excess, it is used to speculatively perform computations. … The stored computed results can then be used later when green energy is less plentiful. … For instance, every day, YouTube data centers transcode more than 700,000 hours of videos to different resolutions. Many of these computations are predictable and can be performed at a time when there is excess green energy.
…
[But] computations that are completed in advance do not need to match exactly with the computations completed at a later time. … The challenge, said the researchers, is determining what computation to perform, where and when, and how … to efficiently retrieve the results later. … It is only possible in some workloads and in some contexts. But Raghavan believes with improved prediction and integration into large systems, the technology points towards a promising future alternative.

This isn’t just batch computing with a fancy name. cognizant_ape explains:

On AWS we do a lot of batch processing when the data centers are low on traffic. This is just the standard SPOT instances which you can trigger once the cost goes below a certain amount and then end when it goes about a certain amount.

The big issue is being effective with I/O and setup time. The article seems to be getting ambitious to the point of a distributed compiler with a flexible optimization metric.

Cut to the chase. It hinges on prediction, says bradley13:

Predicting loads is easy, but predicting the computations that will make up those loads? Not so much. … Imagine: Most employees start work in an hour — let’s predict which database queries they will generate. Your success rate better be high.

3. Confirmed: Arm will IPO, Not be Sold to Nvidia

SoftBank and Nvidia have bowed to the inevitable. Arm, designers of ubiquitous chip cores, will IPO sometime in the next 12 months.

Analysis: The only sane choice, but ecosystem risk remains

As I’ve said before, it seemed everyone was against the idea of Nvidia owning Arm. It put the kibosh on Arm’s open licensing stance, dooming the ecosystem to scraps from the table as Nvidia grabbed first dibs on everything. Now regulators must turn their attention to the very real risk that Nvidia will still acquire Arm anyway—via the stock market.

Richard Waters, Arash Massoudi, James Fontanella-Khan and Antoni Slodkowski: $66bn sale of chip group Arm to Nvidia collapses

The deal, which would have been the largest ever in the chip sector, was set to give … Nvidia control of a company that makes technology at the heart of most of the world’s mobile devices. A handful of Big Tech companies that rely on Arm’s chip designs, including Qualcomm and Microsoft, had objected to the purchase [arguing] that Nvidia would get an unfair advantage by having first rights to Arm’s technology.
…
Jensen Huang, Nvidia’s chief executive, hoped to use Arm’s processor designs to cement his company’s growing role in data centres. … Nvidia abandoned its pursuit of Arm at a board meeting on Monday, said a person familiar with the discussion.
…
The deal’s failure brought an immediate management shake-up at Arm, with chief executive Simon Segars replaced by Rene Haas, head of the company’s intellectual property unit. … In the UK, where politicians view Arm as a strategic national asset, attention is set to shift to whether the company will be listed in London [but] Haas said … no decisions had been made on where Arm would be listed.

Reminding us why this matters to DevOps, here’s Timothy Prickett Morgan:

Now there is no question that Arm Holdings will be going public. … The plan is to do so before the end of [its] fiscal year in March 2023. … Arm Holdings will be returning to its IP licensing roots, having spun off some uncore business back into SoftBank, and seeking out new markets as it has been able to do with hyperscalers [and] cloud builders.
…
When Intel was worried about AMD Epyc chips, what it really needed to worry about was boosting price/performance to keep Arm CPUs out of the hyperscaler and cloud builder datacenters. [But AWS] has Graviton Arm server chips on a pretty regular schedule, and you can bet Masayoshi’s last yen that Graviton4 will debut at re:Invent 2022 this year.

But what happens next? Pun it up, HereAndGone:

If Softbank is going to take ARM public, Nvidia will find it much cheaper simply to acquire a controlling interest. … Looks like a duck, walks like a duck, but cheaper than a duck—while ducking regulatory issues. This plan is not quacked up to solve what it pretends to solve and will likely prove most fowl.