Demand for Real-time AI Inference from Groq® Accelerates Week Over Week – Morningstar

6 minutes, 10 seconds Read

70,000 Developers in the Playground on GroqCloudand 19,000 New Applications Running on the LPU Inference Engine

MOUNTAIN VIEW, Calif., April 2, 2024 /PRNewswire/ — Groq®, a generative AI solutions company, announced today that more than 70,000 new developers are using GroqCloud™and more than 19,000 new applications are running on the LPU™ Inference Engine via the Groq API. The rapid migration to GroqCloud since its launch on March 1st indicates a clear demand for real-time inference as developers and companies seek lower latency and greater throughput for their generative and conversational AI applications.

70,000+ new developers are using GroqCloud™and 19,000+ new applications are running on the LPU™ Inference Engine

“From AI influencers and startups to government agencies and large enterprises, the enthusiastic reception of GroqCloud from the developer community has been truly exciting,” said GroqCloud General Manager, Sunny Madra. “I’m not surprised by the unprecedented level of interest in GroqCloud. It’s clear that developers are hungry for low-latency AI inference capabilities, and we’re thrilled to see how it’s being used to bring innovative ideas to life. Every few hours, a new app is launched or updated that uses our API.”

The total addressable market (TAM) for AI chips is projected to reach $119.4B by 2027. Today, ~40% of AI chips are leveraged for inference, and that alone would put the TAM for chips used for inference at ~$48B by 2027. Once applications reach maturity they often allocate 90-95 percent of resources to inference, indicating a much larger market over time. The world is just beginning to explore the possibilities AI presents. That percentage is likely to increase as more applications and products are brought to market, making it an extremely conservative estimate. With nearly every industry and government worldwide looking to leverage generative and/or conversational AI, the TAM for AI chips, and systems dedicated to inference in particular, appears to be limitless.

“GPUs are great. They’re what got AI here today,” said Groq CEO and Founder, Jonathan Ross. “When customers ask me whether they should still buy GPUs I say, ‘Absolutely, if you’re doing training because they’re optimal for the 5-10% of the resources you’ll dedicate to training, but for the 90-95% of resources you’ll dedicate to inference, and where you need real-time speed and reasonable economics, let’s talk about LPUs.’ As the adage goes, ‘what got us here won’t get us there.’ Developers need low latency inference. The LPU is the enabler of that lower latency and that’s what’s driving them to GroqCloud.”

GPUs are great for training models, bulk batch processing, and running visualization-heavy workloads while LPUs specialize in running real-time deployments of Large Language Models (LLMs) and other AI inference workloads that deliver actionable insights. The LPU fills a gap in the market by providing the real-time inference required to make generative AI a reality in a cost- and energy-efficient way via the Groq API.

Chip Design & Architecture Matter
Real-time AI inference is a specialized system problem. Both hardware and software play a role in speed and latency. No amount of software can overcome hardware bottlenecks created by chip design and architecture.

First, the Groq Compiler is fully deterministic and schedules every memory load, operation, and packet transmission exactly when needed. The LPU Inference Engine never has to wait for a cache that has yet to be filled, resend a packet because of a collision, or pause for memory to load – all of which plague traditional data centers using GPUs for inference. Conversely, the Groq Compiler plans every single operation and transmission down to the cycle, ensuring the highest possible performance and fastest system response.

Second, the LPU is based on a single-core deterministic architecture, making it faster for LLMs than GPUs by design. The Groq LPU Inference Engine relies on SRAM for memory, which is 100x faster than the HBM memory used by GPUs. Furthermore, HBM is dynamic and has to be refreshed a dozen or so times per second. While the impact on performance isn’t necessarily large compared to the slower memory speed, it does complicate program optimization.

No CUDA Necessary
GPU architecture is complicated, making it difficult to program efficiently. Enter: CUDA. CUDA abstracts the complex GPU architecture and makes it possible to program. GPUs must also create highly tuned CUDA kernels to accelerate each new model, which, in turn, requires substantial validation and testing, creating more work and adding complexity to the chip.

Conversely, the Groq LPU Inference Engine does not require CUDA or kernels – which are essentially low-level hardware instructions – because of the Tensor Streaming architecture of the LPU. The LPU design is elegantly simple because the Groq Compiler maps operations directly to the LPU without any hand-tuning or experimentation. Furthermore, Groq quickly compiles models with high performance because it doesn’t require the creation of custom “kernels” for new operations, which hamstrings GPUs when it comes to inference speed and latency.

Prioritizing AI’s Carbon Footprint Through Efficient Design
LLMs are estimated to grow in size by 10x every year, making AI output incredibly costly when using GPUs. While scaling up yields some economies, energy efficiency will continue to be an issue when working within the GPU architecture because data still needs to move back and forth between the chips and HBM for every single compute task. Constantly shuffling data quickly burns joules of energy, generates heat, and increases the need for cooling, which, in turn, requires even more energy.

Understanding that energy consumption and cooling costs play fundamental roles in compute cost, Groq designed the chip hardware so that it is essentially an AI token factory within the LPU to maximize efficiencies. As a result, the current generation LPU is 10x more energy-efficient than the most energy-efficient GPU available today because the assembly line approach minimizes off-chip data flow. The Groq LPU Inference Engine is the only available solution that leverages an efficiently designed hardware and software system to satisfy the low carbon footprint requirements of today, while still delivering an unparalleled user experience and production rate.

What Supply Chain Challenges?
From day one Groq has understood a dependency on limited materials and a complex, global supply chain would increase risk, as well as hinder growth and revenue. Groq has side-stepped supply chain challenges by designing a chip that does not rely on 4-nanometer silicon to deliver record-breaking speeds or HBM, which is extremely limited. In fact, the current generation LPU is made with 14-nanometer silicon, and it consistently delivers 300 tokens per second per user when running Llama-2 70B. The LPU is the only AI chip designed, engineered, and manufactured entirely in North America.

About Groq
Groq® is a generative AI solutions company and the creator of the LPU™ Inference Engine, the fastest language processing accelerator on the market. It is architected from the ground up to achieve low latency, energy-efficient, and repeatable inference performance at scale. Customers rely on the LPU Inference Engine as an end-to-end solution for running Large Language Models and other generative AI applications at 10x the speed. Groq Systems powered by the LPU Inference Engine are available for purchase. Customers can also leverage the LPU Inference Engine for experimentation and production-ready applications via an API in GroqCloud™ by purchasing Tokens-as-a-Service. Jonathan Ross, inventor of the Google Tensor Processing Unit, founded Groq to preserve human agency while building the AI economy. Experience Groq speed for yourself at

Media Contact for Groq
Allyson Scott
[email protected]


View original content to download multimedia:


This post was originally published on this site

Similar Posts