Forget training, find your killer apps during AI inference

Most organizations will never train their own artificial intelligence (AI) models. Instead, most AI clients' primary focus is on applying it to production applications and inference, with fine-tuning and data processing being key challenges.

The key here is to use generation with advanced search (RAG) and vector databasesthe ability to reuse AI prompts and co-pilot capabilities that allow users to question corporate information in natural language.

That's the view of Pure Storage executives who spoke to Computer Weekly this week at the company's Accelerate event in London.

Naturally, the identified key objectives fit well with the areas of functionality recently added to Pure's storage offering, including the recently released Core Values ​​Accelerator – as well as its ability to provide power on demand.

But they also illustrate key challenges for organizations pursuing AI at this stage of its maturity, called the “post-training stage.”

In this article we will look at what customers need from storage in AI during production stagesand at the same time there is a constant reception of data and output.

Don't buy GPUs; they change too quickly

Most organizations won't train their own AI models because it's too expensive at the moment. It's because GPU The hardware (GPU) is incredibly expensive and also because it evolves at such a fast pace that obsolescence sets in very quickly.

Most organizations now tend to buy GPU power from the cloud for the training phases.

According to Pure Storage founder and chief visionary John Colgrove, there's no point in trying to build your own artificial intelligence training farms when the GPU hardware could be obsolete in a generation or two.

“Most organizations say, 'Oh, I want to buy this equipment, I'll get five years of use out of it and depreciate it over five or seven years,'” he said. “But right now you can’t do that with GPUs.

“I think when things improve at a fantastic rate, you're better off renting rather than buying,” Colgrove said. “It's like buying a car. If you're going to keep it for six, seven, eight or more years, you buy it, but if you're going to keep it for two years and replace it with a newer one, you're leasing it.”

Find your killer AI app

For most organizations, practical use of AI does not occur during the modeling phase. Instead, it will appear where they can use it to create an amazing application for their business.

Colgrove gives the example of a bank. “For the bank, we know that a killer app is going to be something that customers experience,” he said. “But how does AI work right now? I take all my data from whatever databases I have to interact with the client. I suck it into some other system. I transform it like the old batch ETL process, spend weeks training, and then get the result.

“It’s never going to be a killer app,” Colgrove said. “The killer application will include some conclusions that I can draw. But these findings will have to be applied to conventional systems if they are customer-facing.

“What this means is that when you actually apply AI, to get value from it, you'll want to apply it to the data you already have and what you're already doing with your customers.”

In other words, for most customers, AI challenges lie in production—specifically, the ability to quickly process, add data, and draw conclusions from it to fine-tune existing AI models. And then be able to do it all again when you have the next idea to make things even better.

Pure Storage EMEA chief technology officer Fred Lero summed it up this way: “It's really about, 'How do I connect models to my data?' Which, first of all, means: “Have I reached the right level of defining what my data is, processing it, preparing it for artificial intelligence, and placing it in an architecture where it can be accessed by a model?”

Key technological foundations of flexible artificial intelligence

The inference phase has become a key point for most AI clients. The challenge here is to be able to process and manage data to create and iterate AI models throughout their lifecycle. This means that clients can connect flexibly to their own data.

This means leveraging technologies including vector databases, RAG pipelines, co-pilot capabilities, and online caching and reuse.

The key issues with storage are twofold. This means being able to connect to, for example, RAG data sources and vector databases. This also means being able to handle storage surges and reduce the need for it. The two concepts are often related.

“An interesting thing happens when you put data into vector databases,” Laro said. “Some calculations are required, but then the data is augmented with vectors that can then be searched. That's the whole purpose of a vector database, and this kind of augmentation can sometimes result in 10 times the data.”

“If you have a terabyte of raw data that you want to run with an AI model, that means you need a 10TB database to run it,” he said. “All of these processes are new to many organizations when they want to use their data with AI models.”

Meeting Storage Capacity Requirements

These capacity surges can also occur during tasks such as checkpointing, where massive amounts of data can be viewed as snapshot-like points for rollback during AI processing.

Pure aims to solve these problems with its Evergreen-as-a-service model, which allows customers to quickly expand capacity.

The company also offers ways to prevent storage from growing too quickly, as well as improve performance.

The newly introduced Key Value Accelerator allows customers to save AI hints so they can be reused. Typically, a large language model would access cached tokens representing previous responses, but the GPU cache is limited, so responses often have to be recalculated. Pure's KV Accelerator allows you to store tokens in a vault in file or object format.

This can speed up response times by up to 20 times, Laro says. “The more users ask different questions, the faster you run out of cache,” he added. “If two users ask the same question at the same time and do it on two GPUs, they will both have to perform the same calculations. It's not very effective.

“We allow it to actually store these pre-computed key values ​​in our storage, so the next time someone asks a question that has already been asked, or asks for the same token if we have one, the GPU won't have to do the calculations,” Lareau said.

“This helps reduce the number of GPUs you need, and for some complex questions that generate thousands of tokens, we have sometimes seen the answer come up to 20 times faster.”

Leave a Comment