Intel’s Heracles Chip Speeds Up FHE Computing


Worried your latest request to a cloud-based AI will reveal a little too much about you? Do you want to know your genetic risk of disease without revealing it to the services that calculate the response?
There is a way to perform calculations on encrypted data without ever decrypting it. This is called fully homomorphic encryption, or FHE. But there is a pretty big catch. Computing on today’s CPUs and GPUs can take thousands or even tens of thousands of times longer than just working with the decrypted data.
So universities, startups and at least one processor giant have been working on specialized chips that could bridge this gap. Last month at the IEEE International Solid-State Circuits Conference (ISSCC) in San Francisco, Intel presented its answer, Heracles, which accelerates FHE computing tasks up to 5,000 times compared to a high-end Intel server processor.
Startups are racing to beat Intel and each other at commercialization. But Sanu Mathew, who leads security circuitry research at Intel, believes the processor giant has a big lead because its chip can do more computing than any other FHE accelerator ever built. “Heracles is the first hardware that works at scale,” he says.
Scale is measurable both physically and in terms of computational performance. While other FHE research chips are about 10 square millimeters or smaller, Heracles is about 20 times larger and is built using Intel’s most advanced 3-nanometer FinFET technology. And it’s flanked inside a liquid-cooled package by two 24GB high-bandwidth memory chips, a configuration typically seen only in GPUs intended for training AI.
In terms of scaling compute performance, Heracles demonstrated its power during live demonstrations at ISSCC. At its core, the demo was a simple private request to a secure server. It simulated a voter’s request to ensure their ballot had been recorded correctly. The state, in this case, has an encrypted database of voters and their votes. To preserve their confidentiality, the voter would not want the information on their ballot to be decrypted at any time; So, using FHE, she encrypts her ID card and vote and sends them to the government database. There, without decrypting it, the system determines if it is a match and sends back an encrypted response, which it then decrypts on its own.
On an Intel Xeon server processor, the process took 15 milliseconds. Heracles did it in 14 microseconds. While this difference isn’t something a single human could notice, verifying 100 million ballots represents over 17 days of CPU work compared to just 23 minutes on Heracles.
Looking back on the five-year journey to bring the Heracles chip to life, Ro Cammarota, who led the project at Intel until last December and is now at the University of California, Irvine, says, “We have proven and delivered on everything we promised.”
Expansion of FHE data
FHE is basically a mathematical transformation, much like the Fourier transform. It encrypts data using a quantum-proof algorithm, but, more importantly, uses corollaries to mathematical operations usually used on unencrypted data. These corollaries lead to the same ends on encrypted data.
One of the main factors holding back such computer security is the explosion in the size of data once it is encrypted for FHE, Anupam Golder, a research scientist at Intel’s Circuit Research Lab, told ISSCC engineers. “Usually, the size of ciphertext is the same as that of plaintext, but for FHE it is several times larger,” he said.
While volume is a big problem, the type of calculations you need to do with that data is also a problem. FHE involves very large numbers that must be calculated precisely. Although a processor can do this, it is very slow: adding and multiplying integers takes about 10,000 extra clock cycles in FHE. Worse still, processors are not designed to perform such calculations in parallel. Although GPUs excel at parallel operations, precision is not their strong point. (In fact, from generation to generation, GPU designers have devoted more and more of the chip’s resources to calculating less and less precise numbers.)
FHE also requires strange operations with names like “twiddling” and “automorphism,” and relies on a computationally intensive noise removal process called bootstrapping. None of these things are effective on a general purpose CPU. So even though clever algorithms and cheat software libraries have been developed over the years, the need for a hardware accelerator remains if FHE wants to solve large-scale problems, Cammarota says.
The Labors of Heracles
Heracles was launched as part of a DARPA program five years ago to accelerate the FHE using specially designed hardware. It was developed as part of an “overall system-level effort from theory and algorithms to circuit design,” Cammarota says.
One of the first problems was how to calculate with numbers larger than the 64-bit words that are the most precise on a processor today. There are ways to divide these gigantic numbers into chunks of bits that can be calculated independently of each other, thus providing some degree of parallelism. From the start, the Intel team made a big bet that they could make this work in smaller 32-bit chunks, while still maintaining the necessary precision. This decision gave the Heracles architecture some speed and parallelism, because 32-bit arithmetic circuits are considerably smaller than 64-bit ones, Cammarota says.
At the heart of Heracles are 64 computing cores, called tile pairs, arranged in an eight-by-eight grid. These are so-called Single Instruction Multiple Data (SIMD) calculation engines designed to perform the polynomial math, manipulations, and other elements that make up computing in FHE and to perform them in parallel. An on-chip 2D mesh network connects tiles to each other with large 512-byte buses.
It is important to make encrypted computing efficient to quickly transmit these enormous quantities to the computing cores. The large amount of data involved meant connecting 48 GB of expensive, high-bandwidth memory to the processor with connections of 819 GB per second. Once on the chip, the data is gathered into 64 MB of cache memory, which is a little more than an Nvidia Hopper generation GPU. From there, it can flow across the board at a speed of 9.6 terabytes per second moving from one pair of tiles to the next.
To ensure that computation and data movement do not interfere with each other, Heracles runs three synchronized instruction streams simultaneously, one to move data in and out of the processor, one to move data within it, and a third to perform the calculations, Golder explained.
All of this translates to massive speedups, according to Intel. Heracles, running at 1.2 gigahertz, takes just 39 microseconds to complete the critical mathematical transformation of FHE, a 2,355 times improvement over an Intel Xeon processor running at 3.5 GHz. Across seven key operations, Heracles was 1,074 to 5,547 times faster.
The different ranges relate to the amount of data movement involved in the operations, Mathew explains. “It’s all about balancing the movement of data and the analysis of numbers,” he explains.
FHE competition
“It’s a very good job,” Kurt Rohloff, chief technology officer at software company FHE Duality Technology, says of Heracles’ results. Duality was part of a team that developed a competing accelerator design under the same DARPA program in which Intel designed Heracles. “When Intel starts talking about scale, it usually carries a lot of weight.”
Duality’s focus is less on new hardware and more on software products that perform the kind of encrypted queries that Intel demonstrated at ISSCC. At the scale used today, “there is less need for [specialized] hardware,” says Rohloff. “Where you start to need hardware is emerging applications around deeper machine learning-oriented operations, like neural networks, LLMs, or semantic search.”
Last year, Duality introduced an FHE-encrypted language model called BERT. Like more famous LLMs such as ChatGPT, BERT is a transformer model. However, its size is only a tenth of that of even the most compact LLMs.
John Barrus, vice president of product at Dayton, Ohio-based Niobium Microsystems, an FHE chip startup that spun off from another DARPA competitor, agrees that encrypted AI is a key target of FHE chips. “There are many smaller models that, even with the expansion of FHE data, will perform very well on accelerated hardware,” he says.
With no stated commercial plans from Intel, Niobium expects its chip to be “the world’s first commercially viable FHE accelerator, designed to enable encrypted calculations at speeds practical for real-world cloud and AI infrastructure.” Although it did not announce when a commercial chip would be available, the startup revealed last month that it had signed a deal worth 10 billion South Korean won ($6.9 million) with Seoul-based chip design company Semifive to develop the FHE accelerator for manufacturing using Samsung Foundry’s 8-nanometer process technology.
Other startups, including Fabric Cryptography, Cornami and Optalysys, have worked on chips to accelerate FHE. Nick New, CEO of Optalysys, says Heracles achieves the level of acceleration you could hope for using a fully digital system. “We are looking to overcome this numerical limit,” he says. His company’s approach is to use the physics of a photonic chip to perform the computationally intensive transformation steps of FHE. This photonic chip is in its seventh generation, he says, and one of the next steps is to integrate it in 3D with custom silicon to perform the non-transformation steps and coordinate the entire process. A commercial, fully stacked 3D chip could be ready in two or three years, New says.
As competitors develop their chips, Intel will too, Mathew says. This will improve the chip’s ability to speed up calculations by fine-tuning the software. It will also involve trying larger FHE problems and exploring hardware improvements for a potential next generation. “It’s like the first microprocessor…the start of a whole journey,” says Mathew.
From the articles on your site
Related articles on the web



