AI helps us achieve 1 trillion transistor GPU

The advancement of semiconductors is fueling the boom of artificial intelligence.

In 1997, IBM's Deep Blue supercomputer defeated the world chess champion Garry Kasparov. This was a breakthrough demonstration of supercomputer technology and the first showcase of the potential for high-performance computing to one day surpass human intelligence levels. Over the next decade, we began to apply artificial intelligence to many practical tasks, such as facial recognition, language translation, and recommending movies and products.

Fifteen years later, artificial intelligence had evolved to the point of being able to "synthesize knowledge." Generative AI, such as ChatGPT and Stable Diffusion, can compose poetry, create art, diagnose diseases, write summary reports and computer code, and even design integrated circuits that rival those manufactured by humans.

Artificial intelligence has become a digital assistant for all human endeavors, presenting immense opportunities. ChatGPT is a great example of how artificial intelligence democratizes the use of high-performance computing and brings benefits to everyone in society.

All these wonderful AI applications owe their existence to three factors: innovations in efficient machine learning algorithms, the availability of vast amounts of data for training neural networks, and advancements in energy-efficient computing through the progress of semiconductor technology. Despite its ubiquity, the last contribution to the generative AI revolution has not received the recognition it deserves.

Over the past thirty years, significant milestones in artificial intelligence have been achieved through the leading semiconductor technology of the time, and they would not have been possible without it. Deep Blue was realized using a mix of 0.6-micrometer and 0.35-micrometer node chip manufacturing technology. The deep neural network that won the ImageNet competition, which launched the current era of machine learning, was realized using 40-nanometer technology. AlphaGo, which conquered the game of Go, used 28-nanometer technology, and the initial version of ChatGPT was trained on computers built with 5-nanometer technology. The latest version of ChatGPT is supported by servers using even more advanced 4-nanometer technology. Every layer of the computer systems involved, from software and algorithms to architecture, circuit design, and device technology, acts as a multiplier for AI performance. But it is fair to say that the foundational transistor device technology has driven the progress of the layers above it.

If the AI revolution is to continue at its current pace, it will require further contributions from the semiconductor industry. Within a decade, it will need a GPU with one trillion transistors, which means the number of devices in a GPU will be ten times the number of devices in today's typical devices.The computational and memory access required for artificial intelligence training have increased by several orders of magnitude over the past five years. For instance, training GPT-3 necessitates the equivalent of over 50 billion computational operations per second for an entire day (i.e., 5,000 petaflops per day), along with 3 trillion bytes (3 TB) of memory capacity.

The computational power and memory access required by new generative artificial intelligence applications are continuing to grow rapidly. We now face an urgent question: how can semiconductor technology keep pace?

From Integrated Circuits to Integrated Chiplets

Since the invention of integrated circuits, semiconductor technology has been dedicated to reducing feature sizes so that we can cram more transistors into chip-sized thumbnails. Today, the level of integration has risen a notch; we are transcending 2D scaling into 3D system integration. We are now synthesizing many chips into a tightly integrated, massively interconnected system. This represents a paradigm shift in semiconductor technology integration.

Zhang Zhongmou and his three stages

500 billion yuan! The People’s Bank of China has set up a re-loan for scientific

Intel chips will no longer pay all foundry costs

The former unicorn company is deeply involved in bankruptcy rumors, and the disp

The aftermath is not over yet: Risks and preparedness of the Asian chip industry

Interconnection is critical to heterogeneous integration

ASML battle

How to build a better GPU than Nvidia

To break the memory wall, HBM can be integrated with DDR5

High-computing chips are embracing Chiplet

In the era of artificial intelligence, the capability of a system is proportional to the number of transistors integrated within the system. One of the main limitations is that photolithographic chip manufacturing tools are designed to produce ICs no larger than approximately 800 square millimeters, known as the reticle limit. However, we can now expand the size of integrated systems beyond the photolithographic mask limit. By connecting multiple chips to a larger interposer (a silicon wafer with built-in interconnections), we can integrate a system that contains a significantly greater number of devices than what could be contained on a single chip. For example, TSMC's Chip-on-Wafer-on-Substrate (CoWoS) technology can accommodate computational chips spanning up to six reticle areas, along with a dozen or so High Bandwidth Memory (HBM) chips.

How Nvidia Utilizes CoWoS Advanced Packaging

CoWoS is TSMC's advanced chip packaging technology on a silicon wafer, which has been implemented in products. Examples include Nvidia's Ampere and Hopper GPUs. Each consists of a GPU chip and six high-bandwidth memory cubes, all situated on a silicon interposer. The size of the computational GPU chip is approximately the size currently permitted by chip manufacturing tools. Ampere has 54 billion transistors, and Hopper has 80 billion. The transition from 7-nanometer technology to the denser 4-nanometer technology has allowed for a 50% increase in the number of transistors packaged in roughly the same area. Ampere and Hopper are the mainstays of today's Large Language Model (LLM) training. Training ChatGPT requires tens of thousands of such processors.

HBM is another example of a key semiconductor technology that is becoming increasingly important for AI: the ability to integrate systems by stacking chips together, which we at TSMC call System on Integrated Chips (SoIC). HBM consists of a stack of vertically interconnected DRAM chips atop a control logic IC. It uses vertical interconnections known as Through-Silicon Vias (TSVs) to allow signals to pass through each chip and solder bumps to form connections between the memory chips. Today, high-performance GPUs widely utilize HBM.

Looking ahead, 3D SoIC technology can offer a "bumpless alternative" to today's traditional HBM technology, providing denser vertical interconnections between stacked chips. Recent advancements have demonstrated that HBM test structures using hybrid bonding technology have stacked 12 layers of chips, with a copper-to-copper connection density higher than what solder bumps can provide. This storage system is bonded onto a larger base logic chip at low temperatures, with an overall thickness of just 600 micrometers.For high-performance computing systems composed of chips that run large-scale artificial intelligence models, high-speed wired communication may soon limit computing speeds. Optical interconnects are now being used to connect server racks in data centers. We will soon need optical interfaces based on silicon photonics, packaged together with GPUs and CPUs. This will allow for an expansion of bandwidth in terms of energy and area efficiency, enabling direct optical GPU-to-GPU communication, so that hundreds of servers can act as a single giant GPU with unified memory. Due to the demands of AI applications, silicon photonics will become one of the most important enabling technologies in the semiconductor industry.

Towards Trillion-Transistor GPUs

How AMD Uses 3D Technology

The AMD MI300A accelerator processor unit not only utilizes CoWoS but also TSMC's 3D technology, System on Integrated Circuit (SoIC). The MI300A combines GPU and CPU cores, designed to handle the largest artificial intelligence workloads. The GPU performs intensive matrix multiplication operations for AI, while the CPU controls the computations of the entire system, with high-bandwidth memory (HBM) serving both. Nine compute chips, built with 5-nanometer technology, are stacked on top of four 6-nanometer technology base chips, which are dedicated to cache and I/O traffic. The base chips and HBM are located above a silicon interposer. The processor's compute section consists of 150 billion transistors.

As mentioned earlier, typical GPU chips used for AI training have reached the limit of the critical area. Their transistor count is around 100 billion. The continued trend of increasing the number of transistors will require multiple chips to perform computations through 2.5D or 3D integrated interconnects. Integrating multiple chips through CoWoS or SoIC and related advanced packaging technologies can result in a much larger total number of transistors per system than compressing them into a single chip. We predict that within ten years, multi-chip GPUs will have over one trillion transistors.

We need to connect all these small chips together in a 3D stack, but fortunately, the industry has been able to quickly reduce the pitch of vertical interconnects, thereby increasing connection density. There is also plenty of room for more. We believe there is no reason why interconnect density cannot increase by an order of magnitude, or even more.

Energy-Efficient Performance Trends for GPUs

So, how do all these innovative hardware technologies improve system performance?If we observe a steady improvement in a metric known as Energy Efficiency Performance (EEP), we can see a trend that already exists within server GPUs. EEP is a comprehensive measure of a system's energy efficiency and speed. Over the past 15 years, the energy efficiency performance of the semiconductor industry has improved approximately threefold every two years. We believe this trend will continue at a historical pace. It will be driven by innovations on multiple fronts, including new materials, devices and integration technologies, extreme ultraviolet (EUV) lithography, circuit design, system architecture design, and the co-optimization of all these technological elements.

Largely thanks to advancements in semiconductor technology, a metric known as Energy Efficiency Performance is expected to double every two years (EEP units are 1/femtojoule-picoseconds).

In particular, the increase in EEP will be achieved through the advanced packaging technologies we discuss here. Additionally, concepts such as System Technology Co-Optimization (STCO) will become increasingly important, where different functional parts of a GPU are separated onto their own chips and built using the best-performing and most economical technologies for each part.

The Mead-Cone Moment for 3D Integrated Circuits

In 1978, Professor Carver Mead of the California Institute of Technology and Lynn Conway of Xerox Palo Alto Research Center invented the computer-aided design method for integrated circuits. They used a set of design rules to describe chip scaling, allowing engineers to easily design very large scale integration (VLSI) circuits without knowing much about the process technology.

3D chip design also requires similar functionality. Today, designers need to understand chip design, system architecture design, and hardware and software optimization. Manufacturers need to understand chip technology, 3D IC technology, and advanced packaging technology. Just as we did in 1978, we once again need a common language to describe these technologies in a way that electronic design tools can understand. This hardware description language allows designers the freedom to engage in 3D IC system design without considering the underlying technology. It is on the way: an open-source standard called 3Dblox has been accepted by most technology companies and electronic design automation (EDA) companies today.

The Future Beyond the Tunnel

In the era of artificial intelligence, semiconductor technology is a key driver of new capabilities and applications in AI. New GPUs are no longer constrained by past standard sizes and form factors. New semiconductor technologies are no longer limited to shrinking the next generation of transistors on a two-dimensional plane. Integrated AI systems can be composed of as many energy-efficient transistors as possible, efficient system architectures for specialized computational workloads, and optimized relationships between software and hardware.

Over the past 50 years, the development of semiconductor technology has been like walking through a tunnel. The path ahead was clear because there was a defined road. Everyone knew what needed to be done: shrink the transistors.Now, we have reached the end of the tunnel. From here on out, semiconductor technology will become increasingly difficult to develop. However, beyond the tunnel, there are even more possibilities.

tech

1823 59

AI helps us achieve 1 trillion transistor GPU

Comment Box