Ken Shirriff Tracks Down Intel's Infamous Pentium FDIV Bug — in the Silicon Itself
"I can confirm [the flaw] in silicon," Shirriff writes — having tracked down the PLA block responsible for a half-billion-dollar recall.
Noted reverse engineer and vintage electronics enthusiast Ken Shirriff has turned his attention to one of the darkest days in Intel's storied history: the discovery of, and subsequent impact of, the infamous FDIV bug in its Pentium processor range.
"Intel launched the Pentium processor in 1993. Unfortunately, dividing sometimes gave a slightly wrong answer, the famous FDIV bug," Shirriff explains. "Replacing the faulty chips cost Intel $475 million. I reverse engineered the circuitry and can explain the bug."
Intel's Pentium processors, the first to receive a copyrightable name in place of the easily-cloned numerical nomenclature that had been used for the 80486 and its predecessors, was a smash-hit for the company, delivering a serious performance uplift. Sadly, it also came with a design flaw: an issue with the integrated floating-point unit (FPU) that would, sometimes but not often, deliver the wrong answer during floating-point division — and while Intel would initially downplay the severity of hte problem, it would eventually replace on-demand any affected parts at a cost to itself of $475 million.
"The Pentium uses a division algorithm called SRT. It generates two bits at a time, making division twice as fast," Shirriff explains. "SRT's secret is quotient digits can be negative: -2, -1, 0, 1, 2. A 2048-entry table gives the digit for a particular divisor and remainder. Unfortunately, five entries were wrong."
These entries, Shirriff explains, were stored in a programmable logic array (PLA) on the chip itself — visible in high-resolution photography of an unencapsulated Pentium processor die. "Smart mathematicians figured out Pentium's division algorithm and the missing entries in 1995 by examining the pattern of errors," Shirrif says. "But I can confirm it in silicon.
"Moreover, I see 16 missing entries in the table, not just five, but 11 of them don't cause errors due to luck. Intel claimed the bug was due to an error in a script to download the entries into the PLA. But due to the 16 missing entries, I think they made a mathematical error in constructing the table, misjudging the effect of a seven-bit adder."
Shirriff's full analysis is available on his Mastodon account; at the time of writing, a promised longer write-up had yet to be published on his website.