Select your language

Article summary :: TL;DR

The article "Elon Musk's Colossus Supercomputer with NVIDIA chips powering GROK" can be summarized by following abstract text: Colossus with NVIDIA chips powering GROK: The largest and most powerful AI training supercomputer developed by Elon Musk's xAI. GROK Supercomputer Colossus runs with 100.000 NVIDIA chips - for now, soon been 200k! Or in even shorter words lays the main focus on Colossus, largest, most powerful, AI training, GROK, supercomputer developed, Elon Musk, xAI, NVIDIA, chips, powering, GROK, GROK, Supercomputer, Colossus, 100.000, NVIDIA as well as 200k.

GROK Supercomputer Colossus runs with 100.000 NVIDIA chips - for now, soon been 200k!
GROK Supercomputer Colossus runs with 100.000 NVIDIA chips - for now, soon been 200k!
  • Colossus: The largest and most powerful AI training supercomputer developed by Elon Musk's xAI.
  • Location: Memphis, Tennessee, in an industrial park on the Mississippi River. The building was previously home to a Swedish appliance manufacturer, ElectrX.

Key Features

  • Hardware:
    • Over 100,000 Nvidia HGX H100 GPUs connected with exabytes of data storage.
    • Liquid cooling system using vast amounts of water to maintain optimal temperatures.
  • Speed: Claimed to be the fastest supercomputer on the planet, built to power the AI model Gro.
  • Construction: Built in just 122 days, significantly faster than traditional supercomputer clusters which take years. 

Data Hall Configuration

  • Structure:
    • The facility features a raised floor data hall design, separating power, cooling, and GPU clusters into three levels.
    • Four data halls, each containing 25,000 GPUs

Cooling System 

The water cooling system of the GROK supercomputer
The water cooling system of the GROK supercomputer
  • Liquid Cooling:
    • Utilizes a network of pipes to circulate water, removing heat from GPUs efficiently.
    • Hot water is sent to a chiller before being pumped back in, maintaining optimal temperatures.

GPU and CPU Configuration

  • GPU Racks:
    • Each rack contains eight Nvidia H100 GPUs and has an independent water cooling system.
    • Racks can be serviced without shutting down the entire cabinet, minimizing downtime.
  • CPU Usage:
    • Two CPUs for every eight GPUs, handling data preparation and operating system tasks.

Data Management

  • Storage:
    • The system holds exabytes of data (1 exabyte = 1 billion gigabytes) for training purposes.
    • Data is transferred via a high-speed network powered by Nvidia Bluefield 3 DPUs, capable of handling 400 Gbps.

Energy Supply

  • Power Source:
    • Primarily powered by Tesla Megapack batteries, ensuring consistent energy supply to the supercomputer.
    • This setup mitigates fluctuations from the traditional power grid, crucial for efficient training sessions.

Financial Aspects

  • Funding:
    • xAI raised 6 billion inventure capital, valuing the company at 6 billion in venture capital,
    • while valuing the company at 24 billion USD in total 
    • Elon Musk is reportedly seeking additional funding to increase the company's valuation to $40 billion.

Future Developments

The massiv GROK supercomputer with NVIDIA chips is currently the worlds leading system
GROK supercomputer with NVIDIA chips is the worlds leading system
  • Expansion Plans:
    • Plans to double the size of Colossus to over 200,000 H100 GPUs within the next two months. 
  • AI Evolution:
    • Gro has recently been upgraded to include vision capabilities, allowing it to analyze images alongside text. 

Conclusion

Colossus represents a significant leap in AI training capabilities, combining cutting-edge hardware, innovative cooling solutions, and efficient energy management to pave the way for advanced artificial intelligence development. The rapid growth and ambitious plans of xAI position it as a formidable player in the AI landscape.