luchschen_shutter - Fotolia

Disaggregated systems try to unlock the NVMe conundrum

NVMe can unlock the true potential of flash, but storage controllers put a bottleneck back in the I/O path. We look at how suppliers are trying to solve the problem

This article can also be found in the Premium Editorial Download: Computer Weekly: Can China’s Alibaba successfully take on the US public cloud giants?

The shared storage array has seen amazing success as a key component of IT infrastructure over the past 30 years. Consolidation of storage from many servers into one appliance has provided the ability to deliver more efficient services, increase availability and reduce costs.

But as storage media moves towards the use of flash NVMe, shared arrays are showing their age and are being superseded by a new wave of disaggregated storage products.

To understand the root cause of the impending issue with shared storage, we need to look at the media in use.

Recently, hard drives have given way to flash (Nand) storage that is many orders of magnitude faster than spinning media, but that wasn’t the case for many years.

The performance profile of a single hard drive was such that centralising input/output (I/O) through one or more controllers didn’t impact on performance and, in most cases, improved it.

In other words, spinning disk drive hardware actually was the bottleneck in I/O. The controller provided much-needed functionality with no further effect on performance.

Performance-wise, an HDD-based array might, for example, deliver latency of 5ms to 10ms. Flash set the bar at less than one 1ms, with suppliers looking to achieve ever lower numbers. The first all-flash systems were based on SAS/Sata drives and connectivity.

The next transition in media is towards NVMe drives, where the I/O protocol is much more efficiently implemented. As a result, traditional array designs that funnel I/O through two or more centralised controllers simply can’t exploit the aggregate performance of a shelf of NVMe media. In other words, the controller is now the bottleneck in I/O.

Removing the controller bottleneck

The answer so far to the NVMe performance conundrum has been to remove the bottleneck completely.

Rather than have all I/O pass through a central controller, why not have the client system access the drives directly?

With a fast, low-latency network and direct connectivity to each drive in a system, the overhead of going through shared controllers is eliminated and the full value of NVMe can be realised.

This is exactly what new products coming to the market aim to do. Client servers running application code talk directly to a shelf of NVMe drives, with the result that much lower latency and much higher performance numbers are achieved than with traditional shared systems.

NVMe implementation

Creating a disaggregated system requires separation of data and control planes. Centralised storage implements the control and data path in the controller. Disaggregated systems move control to separate elements of the infrastructure and/or to the client itself.

The splitting of functionality has the benefit of removing controller overhead from I/O. But there is also a negative effect on management, as the functions that were performed centrally still have to be done somewhere.

To understand what we mean by this, imagine the I/O that occurs to and from a single logical LUN on shared storage mapped to a client server. I/O to that LUN is done using logical block address (LBA).

The client writes from block 0 to the highest block number available, based on block size and capacity of the LUN. The controllers in the storage take the responsibility of mapping that logical address to a physical location on storage media.

Then as data passes through a shared controller, the data block will be deduplicated, compressed, protected (by Raid or erasure coding, for example) and assigned one or more physical locations on storage. If a drive fails, the controller rebuilds the lost data. If a new LUN is created, the controller reserves out space in metadata and physically on disk/flash as the LUN is used.

In disaggregated systems, these functions still need to be done and are, in general, passed out to the client to perform. The client servers need to have visibility of the metadata and data and have a way to coordinate between each other to ensure things go smoothly and no data corruption occurs.

Why disaggregate?

The introduction of NVMe offers great performance improvements.

In certain applications low latency is essential, but without disaggregation the only real way to implement a low-latency application is to deploy storage into the client server itself. NVMe flash drives can deliver super low latency, with NVMe Optane drives from Intel giving even better performance.

Unfortunately, putting storage back into servers isn’t scalable or cost effective and was the original reason shared storage was first implemented. Disaggregation provides a middle ground that takes the benefit of media consolidation and (seemingly) local storage to get the highest performance possible from new media.

The type of applications that need low latency include financial trading, real-time analytics processing and large databases where transaction performance is a direct function of individual I/O latency times.

There’s an analogy here to the early days of flash storage, where all-flash arrays were deployed in the enterprise onto applications that would be expensive to rewrite or simply couldn’t be speeded up by any other method than delivering lower latency.

In the first implementations it’s likely we will see disaggregated systems deployed on only those applications that will benefit most, as there are some disadvantages to the architecture.

Compromises

As highlighted already, depending on the implementation, client servers in disaggregated systems have a lot more work to do to maintain metadata and perform calculations such as Raid/erasure coding, compression and deduplication.

Support is limited to specific operating systems and may require the deployment of kernel-level drivers or other code that creates dependencies on the OS and/or application. Most systems use high-performance networks such as InfiniBand or 40Gb Ethernet with custom NICs.

This increases the cost of systems and will introduce support challenges if this technology is new to the enterprise organisation. As with any technology, the enterprise will have to decide whether the benefits of disaggregation outweigh the support and cost issues.

One other area not yet fully determined are the standards by which systems will operate. NVMe over a network or NVMe over Fabrics (NVMeF) is defined by the NVM Express organisation, and covers the use of physical transports such as Ethernet and Infiniband with access protocols such as RDMA over Converged Ethernet (RoCE) and Internet Wide-Area RDMA Protocol (iWarp), which provide remote direct memory access (RDMA) from client server to individual drives.

Some suppliers in our roundup have pushed ahead with their own implementations in advance of any standards being ratified.

NVMe supplier systems

DVX is a disaggregated storage system from startup Datrium. The company defines its offering as open convergence and has a model that uses shared storage and DRAM or flash cache in each client server. The company claims some impressive performance figures, achieving an IOmark-VM score of 8,000 using 10 Datrium data nodes and 60 client servers.

E8 Storage offers dual or single appliance models. The E8-D24 dual controller appliance offers Raid-6 protection across 24 drives, whereas the E8-S10 implements Raid-5 across 10 drives. Both systems use up to 100GbE with RoCE and can deliver up to 10 million IOPS with 40GBps throughput. E8 also offers software-only systems for customers that want to use their own hardware. Note that the dual controller implementation is to provide metadata redundancy.

Apeiron Data Systems offers a scale-out system based on 24-drive NVMe disk shelves. Client servers are connected using 40Gb Ethernet. Apeiron claims performance figures of 18 million IOPS per shelf/array and an aggregate of 142 million with eight shelves. Latency figures are as low as 100µs with MLC flash and 12µs with Intel Optane drives.

Excelero offers a platform called NVMesh that is deployed as a software system across multiple client servers. Each client server can contribute and consume storage in a mesh architecture that uses Ethernet or Infinband and a proprietary protocol called RDDA. Systems can be deployed in disaggregated mode with dedicated storage or as a converged system. Performance is rated as low as 200µs, with 5 million IOPS and 24GB/s of bandwidth.

Read more about NVMe

Read more on Computer storage hardware