How to build a better GPU than Nvidia

Computing is easy, while data movement and storage are becoming increasingly difficult.

Although many people focus on the floating-point and integer processing architectures of various computing engines, researchers are spending more and more time studying memory hierarchies and interconnect hierarchies. This is because computing is easy, while data movement and storage are becoming increasingly difficult.

To illustrate this with some simple numbers: over the past two decades, the computing power of CPUs and GPUs has increased by 90,000 times, but DRAM memory bandwidth has only increased by 30 times, and interconnect bandwidth has also only increased by 30 times. In recent years, the industry has made progress in some areas, but the balance between computing and memory is still far away, which means spending too much on computing engines with insufficient memory for a large number of AI and HPC workloads.

With this in mind, researchers have considered architectural innovations in the physical layer (PHY) of the network created by Eliyan, which were presented in different and very useful ways at the MemCon 2024 conference this week. Co-founder and CEO Ramin Farjadrad took some time to show everyone how the NuLink PHY and its use cases have evolved over time, and how to use them to build better, cheaper, and more powerful computing engines than the current silicon interposer-based packaging technology.

PHY is a physical network transmission device that links any number of other types of interfaces on or inside a switch chip, network interface, or computing engine to a physical medium (copper wire, optical fiber, radio signal), which in turn connects them to each other or on the network.

A silicon interposer is a special circuit bridge used to connect HBM stacked DRAM memory to computing engines, such as GPUs and custom ASICs, which are typically used for bandwidth-sensitive applications in the HPC and AI fields. Sometimes regular CPUs that also require high-bandwidth memory use HBM.

Eliyan was established in San Jose in 2021 and currently has 60 employees. The company has just secured a second round of financing of $60 million, led by memory manufacturer Samsung and Tiger Global Capital in the B round. Eliyan raised $40 million in the A round in November 2022, led by Tracker Capital Management, with Celesta Capital, Intel, Marvell, and memory manufacturer Micron Technology contributing.

Farjadrad worked as a design engineer at Sun Microsystems and LSI Logic during the internet boom, served as the chief engineer and co-founder of the switch ASIC at Velio Communications (now part of LSI Logic), and was a co-founder and chief technology officer at Aquantia, which produces Ethernet PHYs for the automotive market. In September 2019, Marvell acquired Aquantia and put Farjadrad in charge of network and automotive PHYs. Marvell has become one of the largest PHY manufacturers, competing with companies such as Broadcom, Alphawave Semi, Nvidia, Intel, Synopsis, Cadence, and now Eliyan in the design of these critical system components.Eliyan's other co-founders include Syrus Ziai, who serves as the head of engineering and operations and was formerly the vice president of engineering at Qualcomm and Ikanos. Over the years, PsiQuantum and Nuvia, as well as Patrick Soheili, who was in charge of business and corporate development, have been responsible for product management and artificial intelligence strategy at eSilicon. The company is renowned for creating ASICs in Apple's iPod music players and developing 2.5D ASIC packaging and HBM memory controllers. Of course, eSilicon was acquired by Inphi for $213 million at the end of 2019, expanding its PHY capabilities, and in April 2021, Marvell completed the cycle by acquiring Inphi for $10 billion in October 2020.

Funding is allocated for PHY as well as I/O SerDes and retimer devices. SerDes, which is a special type of PHY used in switch ASICs to convert parallel data emitted from devices into serial data that can be transmitted through wires, optical fibers, or air, is also considered a special kind of PHY from a certain perspective. As bandwidth increases and the length of copper wires capable of pushing clean signals decreases, retimer devices will be used more and more.

Next, let's discuss 2.5D packaging together.

Intel chips will no longer pay all foundry costs

The former unicorn company is deeply involved in bankruptcy rumors, and the disp

The aftermath is not over yet: Risks and preparedness of the Asian chip industry

Interconnection is critical to heterogeneous integration

ASML battle

How to build a better GPU than Nvidia

To break the memory wall, HBM can be integrated with DDR5

High-computing chips are embracing Chiplet

SIA: Global semiconductor sales increased by 16.3% year-on-year in February

What is missing from 2.5D EDA tools?

2.5D Packaging

As Moore's Law slows down in terms of transistor density and the cost of transistors increases with each subsequent process technology rather than decreasing, we have all become aware of the limitations of modern chip etching processes in terms of mask sizes. Using conventional extreme ultraviolet (EUV) water immersion lithography, the maximum size of transistors that can be etched on a silicon wafer is 26 mm by 33 mm.

Many may not realize that this is also the limit for the size of the silicon interposer, which allows small chips to link with each other on top of an organic substrate, with the organic substrate acting like the motherboard beneath each computing engine slot and its associated HBM memory. The size of this silicon interposer depends on the technology used to create the interposer. The interposer is manufactured using the same lithography process as the chips, but nowadays, with some techniques, the interposer can reach an area of 2,500 mm^2, instead of the 858 mm^2 mask size limit like chips, and others are close to 1,900 mm^2; according to Farjadrad, plans are to increase it to 3,300 mm^2. There is no such area limitation for the organic substrate sockets. This is important when you talk about 2.5D packaging of small chips.

Farjadrad introduced everyone to the different 2.5D methods competing with Eliyan's NuLink PHY, their feed, speed, and limitations.

Here is how TSMC achieves 2.5D through the Silicon Interposer Wafer (CoWoS) process, which is used to create Nvidia and AMD GPUs and their HBM stacks, etc.:Technically speaking, the diagram above illustrates TSMC's CoWoS-R interposer technology, which is typically used to link GPUs, CPUs, and other accelerators to HBM memory. The silicon interposer of CoWoS is limited to about two die units, which is precisely the size of Nvidia's just-launched "Blackwell" B100 and B200 GPUs. This is not a coincidence. It represents the maximum scale that Nvidia has been able to achieve.

TSMC possesses a less conspicuous technology known as CoWoS-L, which is more complex to manufacture, akin to the embedded bridges used in other methods.

A bridging technique called Fan-Out Wafer Level Packaging with Embedded Bridge, advocated by chip packager Amkor Technology, has a variant named FOCoS-B from ASE Holdings. Here is the speed of this packaging method:

High trace density means you can achieve high chip-to-chip bandwidth at low power, but the range is limited, and the wiring capability is also limited.

Intel's method of placing the silicon bridge directly into the organic substrate that houses the small chips (minus the interposer) is similar to what Eliyan has done with NuLink:

However, EMIB is plagued by issues such as long production cycles, low yield, limited coverage, and restricted routability.

This leaves the modified 2D MCM process NuLink proposed by Eliyan:Farjadrad stated that NuLink is a PHY with data rates approximately ten times that of traditional MCM packaging, and the wiring length between NuLink PHYs can reach 2 to 3 centimeters, which is 20 to 30 times the 0.1-millimeter wiring length supported by CoWoS and other 2.5D packaging options. As you can see, the additional distance on the wiring, coupled with the fact that NuLink PHYs have bidirectional signaling on these wires, makes the design of the compute engine unique.

In the current architecture, when you run data packets between memory and ASIC, the packet data is not bidirectional at the same time. We require our own special protocol to maintain memory consistency and ensure there are no conflicts between reads and writes. We know that when we make a PHY, we need to create a related protocol for specific applications. This is one of our biggest differentiators. Having the best PHY is one thing, but combining it with the correct expertise for AI applications is another important factor, and we know how to do that.

When NuLink was first introduced in November 2022, it did not have this name, and Eliyan had not yet proposed the method of using PHY to create a Universal Memory Interface (UMI). NuLink was simply a way to implement the UCI-Express chiplet interconnect protocol and support any protocol supported by the original Bunch of Wires (BoW) chiplet interconnect created by Farjadrad and his team a few years ago, and it was donated as a proposed standard to the Open Compute Project. Here is how Eliyan stacks NuLink with various memory and chiplet interconnect protocols:

Intel MDFIO, short for Multi-Die Fabric I/O, is used to interconnect the four compute chiplets in the "Sapphire Rapids" Xeon SP processor; EMIB is used to link these chiplets to the HBM memory stack of the Max series CPU variant of Sapphire Rapids with HBM. OpenHBI, based on the JEDEC HBM3 electrical interconnect, is also an OCP standard. The UCI-Express we write about here is a trendy PCI-Express with a CXL coherence overlay layer, designed to be the grain-to-grain interconnect for chiplets. Nvidia's NVLink is now used to bond the chiplets on the Blackwell GPU complex, but Intel's XeLink, used on the GPU chiplets of the "Ponte Vecchio" Max series GPU, is missing from the table. Unlike UCI-Express, the NuLink PHY is bidirectional, meaning you can have as many or more wires than UCI-Express, but the wiring bandwidth is doubled or more.

As you can see, there is an expensive packaging option that uses bumps with a pitch of 40 to 50 microns, while the die-to-die distance is only about 2 millimeters. The PHY's bandwidth density can be very high (bandwidth density per millimeter of beach on the chiplet at Tb/s), and power efficiency varies by method. Latency is also comprehensively below 4 nanoseconds.

On the right side of the table are the interconnect PHYs that can be used with standard organic substrate packaging and use 130-micron bumps, thus being a cheaper option. These include Cadence's Ultralink PHY, AMD's Infinity Fabric PHY, Alphawave Semi's OIF Ultra-Short Reach (XSR) PHY, and the NuLink version.

Longer links open up the geometry of compute and memory complexes and also eliminate thermal crosstalk effects between ASICs and HBM. Stacked memory is very sensitive to heat, and as GPUs get hotter, cooling HBM is necessary for proper operation. If you can keep HBM further away from the ASIC, you can run the ASIC faster (Farjadrad estimates about 20%) and at higher temperatures because the memory is not close enough to be directly affected by the increase in ASIC heat.

Additionally, by removing the silicon interposer or equivalent in devices such as GPUs and turning to organic substrates and using fatter bumps and spaced components, you can reduce the manufacturing cost of dual ASIC devices with a dozen HBM stacks from about $12,000 (with a chip plus packaging yield of about 50%) to devices with a yield of 87% (costing around $6,800).Eliyan has been pushing for the bidirectional functionality of its PHY, and now it has the capability to handle bidirectional traffic simultaneously, which it calls UMI-SMD.

Therefore, the NuLink PHY (now renamed UMI) is smaller and faster than UCI-Express. What can you do with it?

How about building larger computing engines with 24 or more HBM stacks and 10 to 12 reconstructed computing engine packages? This kind of device takes a quarter to a fifth of the time to manufacture because it is on standard organic substrates. In the early 1990s, there was a saying associated with IBM after it began to decline from its peak around 1989: You can find better, but you can't pay more.

Here is what Eliyan believes the role of HBM4 might be in the future:

Using the NuLink UMI PHY, it is almost possible to cut it in half again, leaving more logical space for the XPU of your choice. Alternatively, if you want to forgo the interposer layer, create a larger device, and tolerate a 13 square millimeter UMI PHY, you can also build a cheaper device and still drive 2 TB/second from each HBM4 stack.

As early as November 2022, when Eliyan proposed his idea, he compared a GPU with an interposer layer connected to its HBM memory to a machine that removes the interposer layer and doubles the ASIC (just like Blackwell did), and compared 24 HBM ranks with these ASIC chips.

On the left is the architecture of the Nvidia A100 and H100 GPUs and their HBM memory. In the middle is an Nvidia chart showing how performance improves with more HBM memory capacity and more HBM memory bandwidth available to AI applications. It is well known that the H200, equipped with 141 GB of HBM3E memory and a 4.8 TB/second bandwidth, has a workload 1.6 to 1.9 times that of the H100, which has the exact same GH100 GPU but only 80 GB of HBM3 memory and a 3.35 TB/second bandwidth.

Memory is not a large part of power consumption; the GPU is, and the limited evidence we have seen so far suggests that the GPUs from Nvidia, AMD, and Intel in this field are limited by HBM memory capacity and bandwidth—and have been for a long time, due to the difficulty of manufacturing this stacked memory. These companies produce GPUs, not memory, and they maximize revenue and profit by providing as little HBM memory as possible to combat the powerful computational demands. They always show more than the previous generation, but GPU computing always grows faster than memory capacity and bandwidth. The design proposed by Eliyan could restore balance between computing and memory and make these devices cheaper.

Perhaps this is a bit too powerful for GPU manufacturers, so with the introduction of UMI, the company has stepped back a bit and demonstrated how to create larger, more balanced Blackwell GPU complexes using a hybrid of interposer layers and organic substrates, as well as NuLink PHY.

In the lower left is how to create a Blackwell-Blackwell superchip, which has a single NVLink port running at a speed of 1.8 TB/second, connecting two dual-chip Blackwell GPUs together:Using the NuLink UMI method, as shown on the right side of the diagram above, there are two ports providing approximately 12 TB/second of bandwidth between two Blackwell GPUs - slightly higher than the 10 TB/second NVLink ports provided by Nvidia, which compress two Blackwell chips together in the B100 and B200. This is six times the bandwidth of the Eliyan super chip design, rather than the Nvidia B200 super chip design (if there is one). If Nvidia wants to stick to its CoWoS manufacturing process, Eliyan could place the same 8 sets of HBM3E memory on the interposer layer, but it could also place an additional 8 sets of HBM3E on each Blackwell device, totaling 32 sets of HBM3E, which would result in a capacity of 768 GB and a bandwidth of 25 TB/second.

This UMI approach is applicable to any XPU and any type of memory, and you can do such crazy things, all on a huge organic substrate without the need for an interposer layer:

Any memory, any co-packaged optical devices, any PCI-Express or other controllers can be linked to any XPU using NuLink. At this point, the slot truly becomes the motherboard.

For larger complexes, Eliyan can build a NuLink Switch.

*Disclaimer: This article is the original creation of the author. The content of the article represents their personal views, and our reposting is solely for sharing and discussion, not an endorsement or agreement. If you have any objections, please contact the backend.

tech

611 80

How to build a better GPU than Nvidia

Comment Box