Why CPUs also make sense for AI inference - interview with Ampere Computing's Jeff Wittich
Captured source
source ↗Why CPUs also make sense for AI inference - interview with Ampere Computing's Jeff Wittich Build • James Martin • 13/11/23 • 4 min read
As CPO of US-based chipmaker Ampere Computing , Jeff Wittich has an important message for IT executives: artificial intelligence inference doesn’t necessarily need supercomputers, or GPUs. In many cases, he claims, CPUs are not only good enough, they’re even ideal. Why? Because they can offer right-sized compute power with minimal energy consumption, thereby limiting AI’s impact on the planet and on cloud budgets. We spoke to Wittich ahead of his keynote at ai-PULSE on November 17…
How does Ampere want to be considered by cloud providers today when it comes to AI?
Jeff Wittich : Ampere’s mission from day one has been to deliver sustainable computing for modern performance environments like the cloud. That extends to AI too. Cloud service providers (CSPs) should consider Ampere for all needs in the cloud, including when looking to build AI workload capabilities.
We know one of CSPs’ biggest challenges is power consumption. Using more power is costly, plus power is scarce, and you can’t expand your data center infinitely. This means we need to deliver more efficient systems over time, to provide more compute capacity without consuming more power .
AI inference has really brought this into the forefront, as demand for it has increased rapidly, making that power challenge even more difficult to solve. We have a solution that tackles that.
Often when we talk about AI, we forget that AI training and inference are two different tasks.
Training [or teaching the AI model with large quantities of data] is a one-off, gigantic task that takes a long time; and for that one time, you might be OK to use the considerable amounts of power required by GPUs and supercomputers.
Inference [or using the trained AI model on a regular basis] is different, as it can be millions of tasks running every second. Inference is your “scale” model, that you’re running all the time, so efficiency is more important here.
So whereas accelerators can make a lot of sense for training, building inference workload doesn’t need to be done on supercomputing hardware .
In fact, general-purpose CPUs are good at inference, and they always have been. Our CPUs are especially well-suited to the task because they are high-performance and balanced. Plus you need predictable latency in these cases, and to keep processing close to the core, not have it bouncing around all over the place. Having a lot of cores is useful too, as is flexibility. It may be that AI inference isn’t 100% of what you’re asking a CPU to do. If it can do other things at the same time, you get higher overall utilization.
How can CPUs be enough for inference, when the current trend is “throwing more expensive, power-hungry, and narrowly specialized hardware at AI”*?
JW : AI needs today cover a whole spectrum. What are your project’s compute requirements? Do you need to be inferencing all the time? What about memory bandwidth? For the vast majority of that spectrum, CPUs will be the right-sized solution . Some inference needs may have a particularly high memory footprint, and therefore need a GPU.
But I think we’ll see a shift in time to smaller, more versatile solutions. It’s like I could have come to work in a Ferrari today, when what I actually need is a more economical electric vehicle that’ll get me here in the same time.
We’re still in the hype and research phase for AI , due to the euphoria around these massive large language models (LLMs), where the instinct is to throw the most possible power at a problem and see what happens. But at some point, these use cases will mature, and efficiency and sustainability will be the victor .
Not everyone will be able to pay for a solution like ChatGPT, which features all of human knowledge. We’ll see more specialization of models, as well as refinement of existing models. Overall, models will become smaller, and more focused on specific tasks.
*A quote from Ampere's recent white paper .
What are the most interesting inference use cases for Ampere chips today?
JW : We’re already seeing some great examples, from real-time voice-to-text translation in any language, which makes things easier for meetings with colleagues in other countries, or increasing accessibility for hearing-deficient people; or generative AI use cases, like artwork, videos, or simplifying everyday routine tasks. These cases all work well with our CPUs.
More specifically, Matoha uses Ampere CPUs to power its near-infra-red spectroscopy. This allows them to scan a 30-year-old landfill for waste noone back then thought of recycling. They can scan a bottle, figure out what type of plastic it is, and send it to the right recycling location. And it works with other materials too, like fabrics.
We also have Red Bull Racing , the highly successful Formula One team, which uses our processors for pre- and in-race day analysis, to optimize their racing strategies. They have a limited amount of time to run these analyses, using complex models based on past race data. Our CPUs allow them to process a lot of data in a very short time, so they can change strategies in real-time, for example, if the weather changes.
How exactly do Ampere CPUs transfer training data from Nvidia GPUs, for inference?
JW : It’s a common misperception that you need to run training and inference on the same models . It’s actually very easy to take one framework and run it on another piece of hardware. It’s particularly easy when you use [AI frameworks like] PyTorch and Tensorflow ; the models are extremely portable.
We have a whole AI team at Ampere, which has developed software called AI-O , that allows us to have compatibility across all AI frameworks. So there’s no need to adapt data models at all . Just take a model trained with any GPU, put it on an Ampere CPU and it’ll run great. AI-O does some optimization on the data and processing sides, but you don’t need to use it unless you really want to improve performance. Otherwise, no need for quantization or anything like that. People think (transferring from training GPUs to inference CPUs) is incredibly complicated, but it’s not!
Can data models be adapted to get maximum performance from Ampere CPUs?
JW : Yes, just use the software library we have (AI-O): it’s sophisticated, it gets better results, and it makes sure the way the code is compiled is well-suited to…
Excerpt shown — open the source for the full document.