madgooch - stock.adobe.com

Decision points in storage for artificial intelligence, machine learning and big data

Artificial intelligence and machine learning storage is not one-size-fits-all. Analytics work differs, and has varied storage requirements for capacity, latency, throughput and IOPS. We look at key decision points

Data analytics has rarely been more newsworthy. Throughout the Covid-19 coronavirus pandemic, governments and bodies such as the World Health Organization (WHO) have produced a stream of statistics and mathematical models.

Businesses have run models to test post-lockdown scenarios, planners have looked at traffic flows and public transport journeys, and firms use artificial intelligence (AI) to reduce the workload for hard-pressed customer services teams and to handle record demand for e-commerce.

All that places more demand on storage.

Even before Covid-19, industry analysts at Gartner pointed out that expansion of digital business would “result in the unprecedented growth of unstructured data within the enterprise in the next few years”.

Advanced analytics needs powerful computing to turn data into insights. Machine learning (ML) and AI takes this to another level because such systems need rich datasets for training and rapid access to new data for operations. These can run to multiple petabytes.

Sure, all data-rich applications put pressure on storage systems, but the demands can differ.

“Data-intensive applications have multiple storage architectures. It is all about the specific KPI [key performance indicators] of the specific workload,” says Julia Palmer, research vice-president at Gartner. “Some of those workloads require lower latency and some of them require higher throughput.”

AI, ML and big data: Storage demands

All big data and AI projects need to mix performance, capacity and economy. But that mix will vary, depending on the application and where it is in its lifecycle.

Projects based on unstructured data, especially images and video, involve large single files.

Also, AI applications that include surveillance and facial recognition, geological, scientific and medical research use large files and so need petabyte scale storage.

Applications based on business systems data, such as sales or enterprise resource planning (ERP), might only need a few hundred megabytes to be effective.

Sensor-based applications that include maintenance, repair and overhaul technologies in transport and power generation could run to the low hundreds of gigabytes.

Meanwhile, applications based on compute-intensive machine learning and dense neural networks need high throughput and low latency, says Gartner’s Palmer. But they also need access to scalable, low-cost storage for potentially large volumes of data.

AI and ML applications have distinct cycles of storage demand too. The learning or training phase is most data intensive, with more data making for a better model. And storage needs to keep up with the compute engines that run the algorithm. Model training needs high throughput and low latency.

IOPS is not the only measure

Once the system is trained, requirements can be modest because the model only needs to examine relevant data.

Here, latency becomes more important than throughput. But this presents a challenge for IT departments because conventional storage solutions usually struggle to perform well for both sequential and random input/output (I/O).

For data analytics, typical batch-based workflows need to maximise the use of computing resources to speed up processing.

As a result, big data and analytics projects work well with distributed data, notes Ronan McCurtin, vice-president for northern Europe at Acronis.

“It is better to have distributed storage for data analytics and, for example, apply Hadoop or Spark technologies for big data analysis. In this case, the analyst can solve issues with memory limitations and run tasks on many machines. AI/ML training/inference requires fast SSD storage.”

But solid-state technology is typically too expensive for large volumes of data and long-term retention, while the need to replicate volumes for distributed processing adds further cost.

As Stephen Gilderdale, a senior director at Dell Technologies points out, organisations have moved on from a primary focus on enterprise resource planning (ERP) and customer relationship management (CRM) to heavier use of unstructured data.

And analytics has moved on too. It is no longer simply a study of historical data, “looking backwards to move forwards”. Instead, predictive and real-time analytics including sensor data is growing in importance.

Here, data volumes are lower, but the system will need to process the data very quickly to deliver insights back to the business. System designers need to ensure the network is not the bottleneck. This is prompting architects to look at edge processing, often combined with centralised cloud storage and compute.

AI/ML storage approaches, and limitations

To meet the requirements imposed by AI/ML, IT managers need to pick and mix from the following types of storage:

  • High performance – NVMe and flash.
  • High capacity – performance spinning disk, perhaps combined with flash/advanced caching.
  • Offline and cold storage – capacity-optimised disk, cloud storage, tape.

Analytics and AI/ML are natural candidates for tiered storage, as these allow system designers to put the most expensive, best-performing resources as close as possible to compute resources, but still use large-capacity storage for archive data.

Architectures will also depend on the type of data a system handles. Gartner, for example, suggests that AI/ML using unstructured data could use NVMe-over-fibre, persistent memory and distributed file systems, and that will likely be on-premise, or using a hybrid cloud architecture.

Data analytics projects, meanwhile, are more likely to use converged file and object storage and hybrid models. That’s so they can scale but also to take advantage of the economies of long-term cloud storage. Analytics projects might process a few hours’ or several years’ worth of data, depending on the business questions, so being able to reload older data quickly and economically has its own value.

Real-time analytics needs data sources, compute and storage to be closely coupled. This is prompting organisations to use the cloud-based hyperscalers – primarily Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform – for tiers of compute and storage performance, as well as multiple physical locations.

There is no universal technology solution, however, and a degree of compromise is inevitable. “AI workloads are diverse and some are fundamentally different from any other workload the organisation may have run in the past,” says Palmer.

Analytics and AI: Build or buy?

Larger AI and business intelligence (BI) projects will need significant investment in storage, compute and networking. That has prompted some businesses to look to the cloud, and others to buy in analytics “as a service”.

But for most, venturing into data-rich applications will be a blend of existing and new capabilities.

“Buying technology is easy, but AI, ML and analytics rarely arrive or operate in perfect, pristine environments,” cautions Nick Jewell, director of product evangelism and enablement at data analytics firm Alteryx. “The reality is that most systems of insight are built on architectures that have existing dependencies or a legacy of some kind.”

CIOs also need to decide if AI and advanced analytics are a project, or a long-term strategic choice for the business.

Discrete projects, especially where data is already in the cloud, might make good use of a cloud or outsourced solution. But if the business wants to drive long-term value from analytics, and later AI, it needs to connect its existing data to the analytics platforms. For that, the storage architecture will need to measure up.

Read more about artificial intelligence, machine learning and data analytics

Next Steps

Storage for AI in 2025 and How We Got There

Read more on Business intelligence software