AI giants plan a supercomputer project worth more than $110 billion

tech

The United States' supercomputers have a new highlight to look forward to.

According to reports, Microsoft and OpenAI are developing a massive data center to house an AI-focused supercomputer with millions of GPUs. The Information reports that the project could cost "over 11.5 billion dollars," and the supercomputer, currently known internally at OpenAI as "Stargate," will be located in the United States.

The report states that Microsoft will foot the bill for the data center, which could be "100 times more expensive" than some of the largest operational centers today. Stargate is set to be the largest of a series of data center projects that the two companies hope to establish within the next six years, with executives aiming to have it operational by 2028.

The report indicates that OpenAI and Microsoft are constructing these supercomputers in phases, with Stargate being the system for Phase 5. Sources from The Information suggest that the Phase 4 system, which is less costly, could be initiated as early as 2026 and might begin in Mt. Pleasant, Wisconsin. This system may require several Stargates and sufficient power supply (at least several thousand megawatts), to the extent that Microsoft and OpenAI are considering alternative energy sources, such as nuclear power.

Advertisement

Sources mention that data centers of this scale will be challenging to construct, partly because existing designs need to "place more GPUs into a single rack than in the past to enhance the efficiency and performance of the chips, which means also devising innovative methods to maintain good thermal performance."

It sounds like these companies might also use this design to break free from their reliance on NVIDIA. The report states that OpenAI wishes to avoid using NVIDIA's InfiniBand cables in Stargate, even though Microsoft uses them in current projects. OpenAI claims it would prefer to use Ethernet cables instead.

Many details remain to be determined, so the price and plans seem likely to change, and it is currently unclear when the final details will be settled. The information also points out that the location of this computer has not yet been determined, nor whether it will be built in a single data center or "multiple nearby data centers."Earlier this year, reports indicated that OpenAI Chief Executive Officer Sam Altman was ambitiously developing artificial intelligence chips and hoped to raise up to 7 trillion dollars to build a wafer fab to produce them. Last year, Microsoft released a 128-core Arm data center CPU and Maia 100 GPU specifically for AI projects. There were also reports that Microsoft was developing its own networking equipment for AI data centers. With the rise of artificial intelligence, there is a high demand for Nvidia's GPUs, so it makes sense that companies like Microsoft and OpenAI might want some alternative options.

"We have been planning the next generation of infrastructure innovation to continue to drive the development of artificial intelligence," Microsoft Chief Communications Officer Frank Shaw told The Information, but he did not directly comment on the supercomputer plan.

Microsoft has invested billions of dollars in its partnership with OpenAI, primarily in the form of computing power to run its models. If something like Stargate becomes a reality, this partnership will only deepen as the scale and complexity of the investment grow.

What are the supercomputers in the United States?

In November 2023, the Global Supercomputing Conference officially announced the 62nd edition of the TOP500 list of the world's supercomputers. The Frontier at the Oak Ridge National Laboratory in the United States still holds the number one position, while China's Sunway TaihuLight and Tianhe-2A also entered the top fifteen, ranking 11th and 14th, respectively.

The number one Frontier continues to lead with an HPL performance of 1.194 EFlop/s. It is equipped with 2GHz AMD EPYC 64C processors based on the latest HPE Cray EX235a architecture, with a total of 8,699,904 CPU and GPU cores. In addition, Frontier also has a rated energy efficiency of up to 52.59 GFlops/W and can transfer data with HPE's Slingshot 11 network.

The second-ranked Aurora supercomputer at the Argonne National Laboratory in the United States entered the list with an HPL performance of 585.34 PFlop/s. It should be noted that the value of Aurora this time was submitted in its current not fully completed state, with only half of the planned final scale. According to the plan, Aurora will be equipped with 21,248 Intel Xeon Max series CPUs, 63,744 Intel Max series GPUs, and 20.42 PB of memory after completion, with a peak performance of 2 EFlop/s, far exceeding Frontier.

The third-ranked Eagle, installed in Microsoft's Azure cloud in the United States, has an HPL performance of 561.2 PFlop/s, which is also the highest ranking achieved by a cloud service provider. It is built with Intel Xeon Platinum 8480C processors and Nvidia H100.

The fourth-ranked Fugaku supercomputer in Japan has an HPL score of 442.01 PFlop/s, based on Fujitsu's self-developed 48-core A64FX processor based on the Arm architecture, with a total of about 160,000 CPU chips installed.The fifth-ranked supercomputer is LUMI from the European High Performance Computing Center in Kajaani, Finland, with an HPL performance of 379.07 PFlop/s. It is based on the HPE Cray EX235a architecture and is equipped with 2GHz AMD EPYC 64C processors and AMD Instinct MI250X GPUs.

The sixth-ranked system is the Leonardo system at a EuroHPC site of CINECA in Italy, with an HPL performance of 238.7 Pflop/s. It is an Atos BullSequana XH2000 system with Intel Xeon Platinum 8358 32C 2.6GHz processors, NVIDIA A100 SXM4 40 GB accelerators, and interconnected using four-rail NVIDIA HDR100 Infiniband.

The seventh-ranked supercomputer globally is Summit at the Oak Ridge National Laboratory (ORNL) in Tennessee, USA, built by IBM. It currently has an HPL performance of 148.8 Pflop/s, with 4,356 nodes, each equipped with two POWER9 CPUs (each with 22 cores) and six NVIDIA Tesla V100 GPUs (each containing 80 SMs), connected via Mellanox dual-rail EDR InfiniBand network.

The eighth-ranked system is the MareNostrum 5 ACC supercomputing system, recently installed at the EuroHPC/Barcelona Supercomputing Center in Spain. This system, using Xeon Platinum 8460Y processors and NVIDIA H100, along with Infiniband NDR200, achieves an HPL performance of 183.2 Pflop/s.

The ninth-ranked new Eos system is a NVIDIA-built system based on the DGX SuperPOD, equipped with Intel Xeon Platinum 8480C processors, NVIDIA H100 accelerators, and Infiniband NDR400's NVIDIA DGX H100, achieving a performance of 121.4 Pflop/s.

The tenth-ranked system is the Sierra system at Lawrence Livermore National Laboratory in California, USA. Its architecture is very similar to the seventh-ranked system, Summit, consisting of 4,320 nodes, each with two POWER9 CPUs and four NVIDIA Tesla V100 GPUs, capable of achieving a performance of 94.6 Pflop/s.

Additionally, the top position on the GREEN500 list remains Henri from the New York Flatiron Institute in the United States. This system has an energy efficiency rating of 65.40 GFlops/Watt, with an HPL score of 2.88 PFlops/s. Henri is a Lenovo ThinkSystem SR670, equipped with Intel Xeon Platinum and NVIDIA H100, with a total of 8,288 cores, and ranks 293rd on the TOP500 list.

tech
315 42

Comment Box