Sergey Nivens - Fotolia

Scale-out NAS hits the spot for large datasets, AI and machine learning

Despite the rise of cloud and object storage, scale-out NAS is a key choice for the big datasets increasingly prevalent in artificial intelligence and machine learning scenarios

If you thought the humble NAS (networked attached storage) was dead, think again.

Last year, analysts at HFT MI estimated that the global market for NAS products would be worth $45.2bn by the end of 2023, growing at a rate of 20.2% annually. Much of this growth is driven by one specific category of NAS product – the scale-out NAS.

In a recent report, Markets and Markets predicted that the scale-out NAS market would reach $32.6bn by 2022. That is a formidable statistic, especially when you consider the frenetic growth of off-premise cloud storage services, which are slickly marketed, and are often easier to administer by threadbare and overburdened IT teams.

That is easy to understand when you consider the innate characteristics of scale-out NAS products. As the name implies, they are literally designed to be scalable, allowing operators to add storage capacity as required without having to make any drastic (or, indeed, expensive) structural changes. Should a scale-out NAS reach the limits of its capacity, adding some breathing room is merely a matter of installing new nodes.

Before we delve into the market, it is worth discussing the fundamental value proposition of scale-out NAS, especially in comparison to older conventional NAS products.

So, what is it good for? Scale-out NAS addresses two specific problems that are endemic in contemporary data and content-rich enterprises.

The first is scale, which it is well suited to deal with in purely economic terms (for unstructured data).

In many cases, it is simply cost-effective to procure scale-out NAS hardware because it is designed to be vertically and horizontally scalable. By design, users can expand the available storage by adding nodes, which in some cases can be done independently of processing, network, and I/O capability. These build into clusters of storage that shares resources across a parallel file system. 

This point is important because it drastically reduces the initial cost of ownership. It is possible to buy a controller with a relatively limited amount of storage and increase it as needs dictate later. Given that enterprise storage systems routinely cost tens (if not hundreds, or even millions) of dollars, this soothes the financial sting somewhat. 

NAS is flexible

The next advantage of scale-out NAS is that it is flexible. Businesses can buy the storage they need, rather than a one-size-fits-all solution. This is particularly helpful during the initial design processes, because it means infrastructure architects and IT procurement managers are not required to forecast storage trends, which can be difficult to discern years in advance. 

Of course, it is pointless to talk about scale without considering specific capabilities. 

Scale-out NAS systems use parallel file systems, such as IBM’s Spectrum Scale/General Parallel File System (GPFS), Dell’s OneFS, pNFS (Parallel NFS), and the open-source Lustre

Unlike ordinary file systems, parallel file systems can store vast numbers of files. Lustre, for example, uses a system called Lustre Metadata Targets (MDT) to track and organise files. Each MDT has a theoretical maximum limit of four billion files. It is possible to have multiple different MDTs, for data lakes of the largest scale. 

Spectrum Scale is similarly capable, with each file system able to store 2^64 files – or, written plainly, 18,446,744,073,709,551,616. It supports volumes of up to 8 yottabytes (or 8,000,000,000,000 terabytes) and files up to 8 exabytes (8,000,000 terabytes) in size. 

Of course, it is extremely unlikely that any organisation – even the biggest – would brush up against these limits, not least because provisioning that amount of storage would be hugely expensive. 

Nonetheless, those limits exist because most commercial scale-out NAS systems transcend what is possible with common file systems. It is not uncommon to see systems deployed with several petabytes of storage. 

And because these systems are often intended to be shared resources, sometimes experiencing thousands of concurrent read-and-writes, scale-out NAS systems often permit huge amounts of throughput. Of course, this is often contingent upon the configuration of the system, as well as the underlying network architecture. 

A place for scale-out NAS

Those writing the obituary for NAS technologies often assume that cloud-based services can serve as a direct replacement for on-premise storage. That is a faulty assumption. 

In fact, there are countless scenarios where it is not tenable, in terms of cost or performance, to use an external third-party service. 

Let’s ignore the common corporate scenarios. It is obvious that large blue-chip companies with hundreds of thousands of employees will use their own in-house storage solutions. But what about the more exotic use-cases?

A great example would be the visual arts. Increasingly, film and television studios prefer to shoot footage in ultra-high-resolution formats, to cater for those with more sophisticated home cinema systems. An hour of unprocessed 4K footage routinely weighs more than 100GB. Bear in mind that an hour-long TV show is often the culmination of tens of hours of footage, which is selected, edited, and compiled into a single entity.  

Read more on NAS

NAS systems can provide studios with a single hub for their content. But what makes scale-out NAS systems so uniquely useful for this sector is that it is trivial to provision additional storage, which means studios don’t have to delete old footage to accommodate new projects. “Director’s cut”, anyone? 

And then there are other data-rich environments. An obvious example is machine learning and artificial intelligence, which involve training models on vast quantities of training data. Obviously, the scale of this depends greatly on the actual task.

On the relatively small side of things, you’ll find datasets like Stanford University’s MURA, which is used to train algorithms to identify musculoskeletal abnormalities. This consists of just 40,561 files. Meanwhile, Google’s open-source Google-Landmarks-v2 dataset includes a whopping five million files. This is used to train computer vision applications to identify iconic landmarks, such as San Francisco’s Golden Gate Bridge, as well as natural phenomena such as Niagara Falls. 

In this scenario, scale-out NAS systems offer the flexibility to accommodate datasets that are not just large, but are continuously expanding. The latter example, Google-Landmarks-v2, is a great example of that – it is twice as big as the previous version, despite being released only one year later. 

Scale-out NAS: The big five suppliers

As you would expect, the NAS market is filled with multiple competing suppliers, all trying to grasp a slice of the pie. There are five major suppliers, which we looked at in more detail here: Dell EMC (Isilon), NetApp, HPE, IBM and Hitachi Vantara. 

These suppliers distinguish themselves in various ways. Many have their own operating systems. NetApp uses its Ontap OS platform (based on pNFS), while Dell EMC’s Isilon has OneFS.

Each of these has its own inherent advantages and disadvantages. Quite often, the choice between platforms comes down to the “wetware” of the organisation, with the deciding factor being levels of internal familiarity with a particular operating system. 

Other areas of distinction include the media offered. Some suppliers have all-flash systems, while others offer a mix of flash and disk-based storage. 

And then there’s cloud capability. This manifests itself slightly differently across suppliers. Many offer their own cloud services, and it is not surprising to see them push customers towards their proprietary offerings. Fortunately, there is a bit of give here, as suppliers tend to offer users a bit of choice. 

Hitachi, for example, allows operators of its scale-out NAS systems to connect to its own cloud services, as well as to the more popular Azure and AWS. Dell EMC, on the other hand, offers its own cloud storage products (CloudPools and Virtustream), as well as AWS, Google Cloud and Azure. 

Scale-out NAS suppliers: Minnows in a scale-out sea

As well as the big five, there are smaller niche (boutique, even) suppliers, all of which try to distinguish themselves by offering industry-specific features.

San Jose-based Quantum, for example, heavily targets customers in the media and visual arts industries. One product line aimed at this market – the Quantum F-Series – accomplishes this by offering NVMe-based storage which permits faster I/O performance.  

Taiwan-based QNAP also targets those content-rich creatives, while simultaneously catering towards customers performing AI and scientific computing applications. It offers scale-out NAS systems with hardware and software support for on-system model training, as well as video rendering. 

That latter bit is interesting, because it highlights the increasing potency of NAS hardware. In the case of QNAP, many of its systems come with hardware that staunchly resembles many professional workstations. The TS-2888X, for example, can support up to 512GB of RAM, up to four GPUs, and comes with a capable Intel Xeon processor.

Of course, this indicates another interesting industry trend. People don’t necessarily just want a NAS; they also want something that can store their files and perform high-intensity computing tasks, such as model training and video rendering.  

How big is the scale-out NAS market really? 

We have already discussed the rosy forecasts produced by forward-glancing analysts, who predict that the scale-out NAS market will continue to thrive, despite the drumbeat of various doomsayers. 

But actually determining the current size of this market isn’t easy. Financial disclosure documents from publicly traded big players such as Dell EMC and NetApp reveal that the former, for example, sold $4.1bn worth of storage technology during the first quarter of 2019 alone. 

However, this figure is not broken down further by category. You can’t just attribute this figure to the scale-out NAS market, because Dell EMC also sells SAN (storage-area network), archiving, and cloud storage hardware. 

But it is true that cloud (and, to a lesser extent, object) storage technologies are growing rapidly, and command vast amounts of enterprise IT spend dollars. According to Allied Market Research, the total cloud storage market is expected to reach $97.4bn by 2020. But how much of that growth comes at the expense of old-school NAS products? 

It is not clear. It should also be noted that cloud storage technologies are often in tandem with NAS technologies. By leveraging both approaches, enterprises can take advantage of higher availability and flexibility. And a plurality of scale-out NAS providers offers integrated cloud functionality. 

But what about object storage? Again, this is a rapidly growing sphere, which, according to Market Research Future, is expected to reach $6bn by 2023. 

Rather than organising files in a hierarchical format, with folders and subdirectories, object storage has a flat structure. It organises data as objects and keeps track of them through an approach that leans heavily on metadata. This lends itself favourably to tasks that involve lots of files – such as a large music-streaming service, or a social network that hosts billions of photos. 

There are obvious enterprise applications for this technology. But, at the same time, object storage solves a problem that only the largest companies are likely to experience – those that service millions of users. While it is conceivable that object storage will consume the top tier of the enterprise storage market, it is far more likely that small to medium-sized enterprises will stick with tried-and-tested technologies.

Read more on Computer storage hardware