To break the memory wall, HBM can be integrated with DDR5

tech

A new memory scheme emerges, capable of taking bandwidth to the next level.

In 2024, if the need arises to stitch together dozens, hundreds, thousands, or even tens of thousands of accelerators, interconnectivity becomes a significant challenge.

Nvidia possesses NVLink and InfiniBand. Google's TPU pods use optical circuit switches (OCS) to communicate with each other. AMD has Infinity Fabric for chip-to-chip, chip-to-chip, and the upcoming node-to-node traffic. And of course, there's the good old Ethernet.

The trick here is not to construct a large enough grid but to fend off the substantial performance losses and bandwidth bottlenecks associated with packetization. It also does nothing to address the fact that all this AI processing relies on HBM memory, which is tied to computation at a fixed ratio.

"The industry is using Nvidia GPUs as the world's most expensive memory controllers," says Dave Lazovsky, whose company Celestial AI has just secured $175 million in a Series C funding round, backed by USIT and many other venture capital giants, to commercialize its photonic fabric.

Advertisement

Last summer, we examined Celestial's photonic architecture, which includes a series of silicon photonics interconnects, interposer layers, and small chips, designed to decouple AI computation from memory. Less than a year later, they are working with several hyperscale customers and a major processor manufacturer to integrate their technology into their products. Lazovsky did not name names.

However, the fact that Celestial lists AMD Ventures as one of its supporters, and its Senior Vice President and Chief Architect of Product Technology, Sam Naffziger, discussed the possibility of co-packaging silicon photonic small chips on the same day of the announcement, undoubtedly caught some attention. That being said, AMD's funding of a photonics startup does not mean we will forever see Celestial's small chips in Epyc CPUs or Instinct GPU accelerators.

While Lazovsky could not disclose who Celestial is collaborating with, he did provide some clues about how the technology is being integrated and a sneak peek at the upcoming HBM memory devices.As we discussed when first delving into Celestial's product strategy, the company's components are divided into three major categories: small chips, interposer layers, and optical rotation using Intel EMIB or TSMC CoWoS, referred to as OMIB.

Not surprisingly, most of Celestial's appeal is concentrated on the small chips. "What we're not doing is trying to force our customers to adopt any particular product implementation. The lowest-risk, fastest, and least complex way to provide an interface for photonic structures is currently through small chips," Lazovsky told The Next Platform.

Broadly speaking, these small chips can be used in two ways: either to add additional HBM memory capacity or to serve as a chip-to-chip interconnect, categorized or similar to optical NVLink or Infinity Fabric.

These small chips are slightly smaller than an HBM stack, providing optoelectronic interconnects with off-chip aggregate bandwidth of 14.4 Tb/s or 1.8 GB/s.

That being said, we were informed that a small chip could be manufactured to support higher bandwidths. The first-generation technology can support speeds of about 1.8 Tb/s per square millimeter. Meanwhile, Celestial's second-generation Photonic structures will increase from 56 Gb/s to 112 Gb/s with PAM4 SerDes and will double the number of channels from 4 to 8, effectively doubling the bandwidth.

Thus, 14.4 Tb/s is not the upper limit, but rather the result of what the existing chip architecture can handle. This makes sense, otherwise, any additional capacity would be wasted.

This connectivity means that Celestial can achieve interconnect speeds similar to NVLink, only with fewer steps along the way.

While the chip-to-chip connection is relatively self-explanatory—placing a photonic fabric small chip on each package and aligning the optical fiber connections—memory expansion is a completely different animal. Although the 14.4 Tb/s speed is not slow, it is still a bottleneck for multiple HBM3 or HBM3e stacks. This means that adding more HBM will only exceed your capacity beyond a certain point. Nonetheless, replacing a single stack with two HBM3e stacks is not a big deal.

Celestial has an interesting solution with its memory expansion module. Since the bandwidth limit is 1.8 GB/s, the module will only contain two HBM stacks totaling 72 GB. In addition, it will be equipped with a set of 4 DDR5 DIMMs, supporting up to 2 TB of additional capacity.

Lazovsky was reluctant to spill all the beans on the product, but did tell us that it will use Celestial's silicon photonics interposer technology as an interface between HBM, interconnects, and controller logic.Speaking of the controller for the modules, we were informed that the 5nm switch ASIC effectively turns HBM into a direct-write cache for DDR5. "It gives you the capacity and cost of DDR and the bandwidth and all the advantages of the 32 pseudo-channels of the HBM interconnect, hiding the latency," Lazovsky explained.

He added that this is not far from what Intel has done with the Xeon Max or what Nvidia has done with its GH200 super chip. "It's essentially a turbocharged Grace-Hopper without all the cost overhead and with higher efficiency."

How much of an efficiency improvement? "Our memory transaction energy expenditure is about 6.2 picojoules per bit, while the overhead for remote memory transactions via NVLink, NVSwitch is about 62.5 picojoules," Lazovsky claimed, adding that the latency is also not high.

"The total round-trip latency for these remote memory transactions, including two trips through the photonic structure and memory read time, is 120 nanoseconds," he added, "So, it will be a bit more than the local memory of about 80 nanoseconds, but it's faster than going to Grace and reading parameters and pulling them to Hopper."

As we understand, 16 of these memory modules can be ganged together into a memory switch, and multiple such devices can be connected using optical random access.

This means that chips built with the Celestial interconnect will not only be able to connect with each other but also share a memory pool, in addition to computing, storing, and managing the network.

"It allows you to perform machine learning operations, such as broadcasting and reducing, in a very, very efficient way without switching," Lazovsky said.

The challenge for Celestial is timing. Lazovsky told us that he expects to begin providing samples of the photonic fabric chiplets to customers sometime in the second half of 2025. Then, he anticipates at least another year before we see products using this design hitting the market, with sales growth expected in 2027.

However, Celestial is not the only startup pursuing silicon photonics. Another photonics startup, Ayar Labs, which is backed by Intel investment, has already integrated its photonic interconnect into prototype accelerators.

Then there's Lightmatter, which secured a $155 million Series C funding round in December last year and is trying to do something very similar to Celestial with its Passage interposer. At the time, Lightmatter CEO Nick Harris claimed that it had customers using Passage to "scale to supercomputers with 300,000 nodes." Of course, like Lazovsky, Harris won't tell us who its customers are.And then there's Eliyan, which is attempting to completely eliminate the need for intermediary layers through its NuLink PHY—or, if you must have them, to enhance the performance and scale of intermediary layers.

Regardless of who emerges on top in this race, the shift towards co-packaged optical devices and silicon photonic interlayers seems to be only a matter of time.

tech
1562 73

Comment Box