There is so much to share, so here is our take from Jensen’s Keynote address…
After a five year Covid hiatus, Jensen Huang took the stage for an in-person Keynote at the SAP Arena to an adoring crowd of techies and investors. Like in the olden days, the man in the black leather jacket spoke for two hours about the miracles of AI, and the next generation of hardware and software technologies that make AI sing and dance.
I know your time is valuable, so lets get to the news! First we’ll cover the hardware, then the software.
Blackwell: Four times faster training, and 30 times inference performance.
Blackwell has arrived. Blackwell is a GPU composed of two full-sized GPUs delivering 20 PetaFlops in a “single” (double-chip) GPU that delivers four times the training and, get this, THIRTY TIMES the inference throughput per GPU compared to an H100. Blackwell can scale up to 576 GPUs (H100 scaled to 256) thanks to a new and faster 5th-gen NVLink. An included 2nd-gen Transformer Engine with FP4 precision adds to the performance boost as does a decompression engine that is 20X faster.
GB200 and the NVL72: The Rack-scale design point for AI
Most of the marketing thrust, however, is focussed not on the Blackwell GPU, but the three-die Superchip called GB200, which has two Blackwells and one Grace Arm CPU. This departure from the 1-1 ratio of Grace-Hopper chips makes a lot of sense, as Grace was overkill for the GH200, having adequate I/O and compute bandwidth that can feed/manage two Blackwells, or four GPUs. This should help lower TCO for using the Grace-driven platform and could take Grace from a small portion of Nvidia revenue to a significant driver of new installations.
The NVLink-enabled GB200 NVL72 rack includes 72 Blackwell GPUs and 36 Grace CPUs. Nvidia says this rack alone can train a 27 Trillion Parameter model. Of course most AI Factories, for whom this rack is designed, will use multiple racks to train such a massive model even faster. Nvidia said that its AWS-hosted Ceiba AI supercomputer will now consist of 20,000 GB200 GPUs instead of the initially announced 16,000 H100’s. .
Two GB200’s go into a compute tray, and there are 18 trays per rack. Then two NVSwitches go in a Switch Tray. Everything is water cooled at 2 liters per second and weighs 3000 Lbs. The total rack consumes 120KWatts.
So, back to that Mixture of Experts problem. Nvidia astounded the audience by claiming that while the GB200 is eight times faster than the H100 for “traditional” LLMS like GPT3, which isn’t bad for sure, the GB200 is an amazing 30 times faster than an H100 for inferencing a 1.8T parameter MoE. (Queue the Mike Drop.)
New Transformer Engine
I get a lot of questions about the Nvidia Transformer engine. Basically, this technology allows each tensor to be computed at the optimal precision, now down to FP4. This means that if a competitive GPU has the same number of Flops, the Blackwell will be perhaps twice as fast in inference processing thanks to the Transformer Engine. Per Ian Buck, Nvidia VP of HPC and Hyperscale, “What it does is it tracks the accuracy dynamic range of every layer of every tensor and the entire neural network as it proceeds in in computing and as the model is training over time, we’re constantly monitoring the ranges of every layer and adapting to stay within the bounds of the numerical precision to get the best performance.”
Now lets see how this beast scales. Nvidia pointed out that today’s AI models like Meta Lama 2 is 95% compute-bound (and memory-bound) and only 5% communication bound. But the next generation AI models, for which Grace Blackwell is designed, use a “Mixture of Experts”, which is 40% compute-bound and 60% communications-bound. They conclude that a chip like the H100 would be 18 times slower, as the GPU’s are all trying to talk to each other.
NVLink Gen 5: faster with 3.6TFlops offload
Nvidia is expanding NVLink to Multi-Rack Scale with 3.6TF in-network compute support for Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology. SHARP improves the performance of MPI and Machine Learning collective operations by offloading operations from CPUs and GPUs to the network and eliminates the need to send data multiple times between endpoints.
DGX and the DGX SuperPOd
As usual, Nvidia is also deploying the new chips (B200 and GB200) in HGX system boards for OEMs and the Nvidia DGX system, respectively. Unlike the NVL72, both are air-cooled with a reported 15X inference performance and 3X training compared to the DGX H100.
The new SuperPOD is a liquid-cooled rack-scale architecture built with NVIDIA DGX GB200 systems and provides 11.5 exaflops of AI supercomputing at FP4 precision and 240 terabytes of fast memory per rack. Each DGX GB200 system features 36 NVIDIA GB200 Superchips — which include 36 NVIDIA Grace CPUs and 72 NVIDIA Blackwell GPUs — connected as one supercomputer via fifth-generation NVIDIA NVLink.
HW Availability
So, when is all this goodness shipping? Nvidia was a bit coy on that minor detail, but Jensen showed the logo of every 1st tier CSP, and every small and large Server OEM. So it certainly looks like Blackwell will be a 2024 revenue driver, with B100 shipping ASAP and the GB200 shipping later this year.
So, what happens to the H100? Well, if you are an enterprise looking to train or fine-tune a model, or run inference on these ~80B models, the H100 will remain the most cost-effective platform. But if you are an AI factory creating a 10 Trillion parameter MoE model, you’ll need the GB200 and probably in the NVL72 rack.
NIM: pre-built domain-specific Inference Microservices
During the last quarterly earnings call, Nvidia indicated that the company’s software business is reaching critical mass at a $1B run rate. Nvidia’s software helps clients get AI (or HPC) running quickly, and has now taken the next logical step. The company has created a concept called a “NIM” (for Nvidia Inference Microservices, we think) that consists of pre-built Kubernetes containers, models, APIs, and inference engines like Triton for developers building domain-specific copilots. NIMs are included in the all-you-can-eat Nvidia AI Enterprise solution at $4500/GPU/Year.
Let that sink in a minute. For illustrative purposes, if Nvidia sells 1 million GPUs (about 1/4 of expected GPU shipments) into companies or sovereign data centers with AI Enterprise, that would generate 4.5B in annual, sticky, high-margin revenue. And of course, Nvidia would be happy to sell additional software licenses for the hundreds of millions of installed based GPUs.
So, one has to wonder — are we about to transition from “Hardware is dragging in some software” model to a new paradigm where the “Software is dragging in the hardware”? The business value and time to market is driven by models and optimized software, and NIMs could make it easier to deploy inference capacity. Nvidia says you can deploy a model in 10 minutes. Which, of course, comes with really cool GPUs.
Here’s an example of NIMs. Nvidia has been marketing “Clara” as its comprehensive starting point for the health care industry. Now, with NIMs, healthcare microservices are pre-built and easy to deploy with standard APIs and deployment flexibility, cloud or on prem. This is how Nvidia will go from a huge toolbox of stuff to consumable and deployable AIs.
NIM microservices provide the fastest and highest-performing production AI container for deploying models from NVIDIA, A121, Adept, Cohere, Getty Images, and Shutterstock as well as open models from Google, Hugging Face, Meta, Mistral AI and Stability AI, and will soon support models from Microsoft. ServiceNow announced that it is using NIM to develop and deploy new domain-specific copilots and other generative AI applications faster and more cost effectively.
Omniverse Updates
Omniverse, the Nvidia platform for 3D collaboration and digital twins, continues to expand into new markets and attract new partners. Nvidia announced new APIs to simplify the integration of CAD and CAE software into Omniverse. “Everything manufactured will have digital twins,” said Jensen Huang, founder and CEO of NVIDIA. “Omniverse is the operating system for building and operating physically realistic digital twins. Omniverse and generative AI are the foundational technologies to digitalize the $50 trillion heavy industries market.”
One of the immediately relevant use cases Nvidia showcased on the GTC floor was the use of data center digital twins to model the change out from older GPU technologies to the new GB200 platform. To bring up new data centers as fast as possible, NVIDIA first built its digital twin with software tools connected by Omniverse. Engineers visualized multiple CAD datasets in full physical accuracy and photorealism in Universal Scene Description (OpenUSD) using the Cadence Reality digital twin platform, powered by NVIDIA Omniverse APIs. This technology is helping to streamline the design and construction process for new and updated data centers, particularly when implementing cutting-edge hardware like the GB200 platform.
CuLitho: moving into production with Synopsys and TSMC
The semiconductor manufacturing industry has been looking into the use of computational lithography as a way to accelerate throughput since Nvidia introduced the concept a year ago. Now, TSMC and Synopsys are ready to take this 40X improvement in lithography throughput into production on TSMC manufacturing lines, and not just the most advanced process nodes. It is widely believed this AI platform will transform the semiconductor manufacturing industry.
Conclusions
Anyone who has been wondering whether Nvidia might lose its competitive advantage should rest assured that the leader will continue to lead. With a newfound 4X training advantage, a 30X inference advantage, and the new NIM inference deployment model, Nvidia looks well situated to take on all the competitors and come out holding the bulk of its >80% market share.
But make no mistake, the competition has gone from only one viable alternative (Google TPU) to at least 8, adding AMD MI300, Intel Gaudi, Microsoft Maia, AWS chips, Meta MTIA Cerebras, and Groq, with more in the wings ready to pounce. As these players bring their chips to market, the Nvidia Software stack adds to their challenges they must overcome. And we don’t see anyone coming even close to Nvidia in software like NIMS and Omniverse and Enterprise AI any time soon, although the availability of LLM models and OpenAI Triton across the field could blunt that advantage somewhat.