AI storage: Machine learning, deep learning and storage needs

Artificial intelligence workloads impact storage, with NVMe flash needed for GPU processing at the highest levels of performance – and there are other choices right through the AI data lifecycle

This article can also be found in the Premium Editorial Download: Computer Weekly: How tech experts could earn millions as whistleblowers

Facebook has just short of 2.4 billion active users and sees 350 million photo uploads a day, plus more than 500,000 comments posted every minute. How does it track, monitor and gain value from this amount of information?

“There are billions of users and no way for humans to scale to do the analytics,” says Chirag Dekate, a research director covering artificial intelligence (AI), machine learning and deep learning at Gartner.

So, Facebook uses learning systems and AI to scan posts. “No one can analyse every video or image, for banned speech or inflammatory material, or tag or for ad revenue generation,” says Dekate.

Social media sites are just one example of a growing number of applications of AI, which has moved from academic research into areas as diverse as medicine, law enforcement, insurance, and retailing.

Its growth has far-reaching implications for enterprise IT systems, including data storage.

AI is a broad term that covers a wide range of use cases and applications, as well as different ways of processing data. Machine learning, deep learning, and neural networks all have their own hardware and software requirements and use data in different ways.

“Machine learning is a subset of AI, and deep learning is a subset of machine learning,” says Mike Leone, senior analyst at ESG.

Deep learning, for example, will carry out several passes of a data set to make a decision and learn from its predictions based on the data it reads.

Machine learning is simpler and relies on human-written algorithms and training with known data to develop the ability to make predictions. If the results are incorrect, data scientists will change the algorithms and retrain the model.

A machine learning application could draw on thousands of data points. A deep learning application data set will be an order of magnitude larger, easily running to millions of data points.

“Deep learning acts similarly to a human brain in that it consists of multiple interconnected layers similar to neurons in a brain,” says Leone. “Based on the accuracy or inaccuracy of predictions, it can automatically re-learn or self-adjust how it learns from data.”

Storage for AI can vary

Data storage requirements for AI vary widely according to the application and the source material. “Depending on the use case, the data set varies quite dramatically,” says Dekate. “In imaging, it grows almost exponentially as files tend to be really, really huge.

“Any time you do image recognition or video recognition or neural systems, you are going to need new architecture and new capabilities. But in a use case like fraud detection, you can use an infrastructure stack without new hardware for incredible results.”

Medical, scientific and geological data, as well as imaging data sets used in intelligence and defence, frequently combine petabyte-scale storage volumes with individual file sizes in the gigabyte range.

By contrast, data used in areas such as supply chain analytics, or maintenance, repair and overhaul in aviation – two growing areas for AI – are much smaller.

According to Gartner’s Dekate, a point-of-sale data set, used for retail assortment prediction, typically runs to 100MB to 200MB, whereas a modern, sensor-equipped airliner will produce 50GB to 100GB of maintenance and operational data per flight.

CPUs, GPUs and I/O

The issue for AI systems is how quickly they need to process data. In the airline business, predictive maintenance data has to be analysed while the aircraft is on the ground, with turnaround times ranging from several hours for a long-haul flight to just minutes for a low-cost carrier.

A facial or number plate recognition system, meanwhile, needs an answer in moments and an automated insurance claim system in minutes.

This has prompted AI developers to build GPU-intensive clusters, which is the most effective way to process the data and run complex algorithms at speed. But these GPU clusters – often based on Nvidia DGX hardware – are expensive and available only in small numbers.

As Alastair McAulay, an IT expert at PA Consulting, points out, academic and industrial high-performance computing (HPC) systems are typically run at very high utilisation rates because of their scarcity and cost.

Research institutes employ specialists to squeeze the last drop of performance from the hardware. In the enterprise, integration with existing data systems can be more important.

NVMe the medium of choice

“We see judicious application of solid-state storage bringing a massive benefit,” says McAulay. “But it is more about what file system to use, how that is optimised, and whether any accelerators are needed to get the most from [off-the-shelf] storage hardware. They are putting most effort into file systems and managing data.”

Flash storage is commonplace now, while NVMe flash is emerging as the medium of choice for applications that require the fastest access for data stored near the GPU. Spinning disk is still there too, but is increasingly being relegated to bulk storage on lower tiers.

 Josh Goldenhar, vice-president at NVMe-focused storage supplier Excelero, says a system’s PCIe bus and the limited storage capacity within GPU-dense servers can be a greater limitation than the speed of storage itself.

A common misconception, however, is that AI systems need storage with high IOPS performance, when in fact it is the ability to deal with randomised I/O that is important.

“If you analyse deep learning, it is more random-read intensive while the output is negligible – it can be kilobytes,” says Gartner’s Dekate. “It is not high IOPS that is needed necessarily, but architecture that is random read-optimised.”

AI phases and I/O needs

The storage and I/O requirements of AI are not the same throughout its lifecycle.

Conventional AI systems need training, and during that phase they will be more I/O-intensive, which is where they can make use of flash and NVMe. The “inference” stage will rely more on compute resources, however.

Deep learning systems, with their ability to retrain themselves as they work, need constant access to data.

“When some organisations talk about storage for machine learning/deep learning, they often just mean the training of models, which requires very high bandwidth to keep the GPUs busy,” says Doug O'Flaherty, a director at IBM Storage.

“However, the real productivity gains for a data science team are in managing the entire AI data pipeline from ingest to inference.”

The outputs of an AI program, for their part, are often small enough that they are no issue for modern enterprise IT systems. This suggests that AI systems need tiers of storage and, in that respect, they are not dissimilar to traditional business analytics or even enterprise resource planning (ERP) and database systems.

Read more on AI infrastructure

Justin Price, AI lead and chief data scientist at Logicalis UK, says an on-premise system needs at least the performance of SSD storage to deliver commercial value. But AI systems need bulk storage too, and this points to spinning disk as well as use of the cloud and even tape.

“Every node can be different, and you can use a mixed environment,” says Chris Cummings, chief marketing officer at software-defined storage maker Datera. “The key is to be flexible and match the requirements of the different applications.

“If the information is ‘hot’, you have to cache it to NVMe, but you might copy it out to flash.”

Cloud storage is also an attractive option for enterprises with large volumes of data. This can be done, says Yinglian Xie, CEO of analytics company Datavisor, but it means moving AI engines to where the data is. Currently, cloud-based AI is limited to applications that do not rely on the latest generation of GPUs.

“Storage depends on the specific use case and algorithm,” says Xie. “For some applications, such as deep learning, it is compute-intensive. For that, we see customers use GPU-intensive architecture. On the other hand, for storage-intensive applications, it is better to bring computation to where the data resides.”

So, less GPU-intensive applications are potential candidates for the cloud. Google, for example, has developed AI-specific chips to work with its infrastructure. But, as IBM’s O'Flaherty cautions, for now the cloud is more likely, given the technical and financial constraints, to support AI than to be at its core.

Next Steps

Best Storage Strategies for AI and ML

Analyzing the Effects of Storage on AI Workloads

SambaNova makes a mark in the AI hardware realm

Read more on Clustering for high availability and HPC