Big data storage: Defining big data and the type of storage it needs

Big data storage: What is big data, what compute/storage system configurations are used for big data analytics, and what type of storage infrastructure does it require?

Big data has emerged as a key buzzword in business IT over the past year or two. It’s easy to be cynical, as suppliers try to lever in a big data angle to their marketing materials.

But it’s not all hype. Big data is a term that can be applied to some very specific characteristics in terms of scale and analysis of data. And it’s not necessarily going to be the preserve of very large companies such as Facebook and Google.

Relatively small organisations can get business value and insight from analysis of their data sets.

In this podcast, I discuss what defines big data and the key attributes required of big data storage.

What is big data?

The simplest explanation of the big data phenomenon is that, on the one hand it’s all about large amounts of data, while on the other hand it is also almost always about running analytics on those large data sets.

On the face of it, neither the volume of data nor the analytics elements are really new. For many years, enterprise organisations have accumulated growing stores of data. Some have also run analytics on that data to gain value from large information sets.

Notable here are, for example, the oil and gas industry, which has, for decades now, run very large data sets through high-performance computing (HPC) systems to model underground reserves from seismic data.

There have also been analytics in data warehousing, for example, where businesses would interrogate large data sets for business value.

More on big data infrastructure

  • Big data spells new architectures
  • Process big data at speed
  • CW Buyer's Guide: big data infrastructure
  • Big data projects require big changes in hardware and software
  • Visualisation and big data: new tools, same rules
  • Big data projects require big changes in hardware and software
  • A 'big data' veteran talks fundamentals of big data infrastructure

But, both of these examples can highlight what we mean by big data in the contemporary sense by what they lack. HPC and data warehousing, while they deal with very large data sets and entail analytics, are comprised overwhelmingly of data that is structured and see operations running on a batch basis, overnight often for data warehousing, and over several days or even weeks in research.

By contrast, what we call big data today often deals with very large unstructured data sets, and is dependent on rapid analytics, with answers provided in seconds.

Examples of this are Facebook, Google or Amazon, which analyse user statuses or search terms to trigger targeted advertising on user pages. 

But big data analytics are not restricted to these web giants. All sort of organisations, and not necessarily huge ones, can benefit – from finance houses interested in analysing stock market behaviour, to police departments aiming to analyse and predict crime trends.  

Key requirements of big data storage

At root, the key requirements of big data storage are that it can handle very large amounts of data and keep scaling to keep up with growth, and that it can provide the input/output operations per second (IOPS) necessary to deliver data to analytics tools.

The largest big data practitioners – Google, Facebook, Apple, etc – run what are known as hyperscale computing environments.

These comprise vast amounts of commodity servers with direct-attached storage (DAS). Redundancy is at the level of the entire compute/storage unit, and if a unit suffers an outage of any component it is replaced wholesale, having already failed over to its mirror.

Such environments run the likes of Hadoop, NoSQL and Cassandra as analytics engines, and typically have PCIe flash storage alone in the server or in addition to disk to cut storage latency to a minimum. There’s no shared storage in this type of configuration.

Hyperscale computing environments have been the preserve of the largest web-based operations to date, but it is highly probable that such compute/storage architectures will bleed down into more mainstream enterprises in the coming years.

More on big data storage

The appetite for building hyperscale systems will depend on the ability of an enterprise to take on a lot of in-house hardware building and maintenance and whether they can justify such systems to handle limited tasks alongside more traditional enterprise environments that handle large amounts of applications on less specialised systems.

But hyperscale is not the only way. Many enterprises, and even quite small businesses, can take advantage of big data analytics. They will need the ability to handle relatively large data sets and handle them quickly, but may not need quite the same response times as those organisations that use it push adverts out to users over response times of a few seconds.

So the key type of big data storage system with the attributes required will often be scale-out or clustered NAS. This is file access shared storage that can scale out to meet capacity or increased compute requirements and uses parallel file systems that are distributed across many storage nodes that can handle billions of files without the kind of performance degradation that happens with ordinary file systems as they grow.

For some time, scale-out or clustered NAS was a distinct product category, with specialised suppliers such as Isilon and BlueArc. But a measure of the increasing importance of such systems is that both of these have been bought relatively recently by big storage suppliers – EMC and Hitachi Data Systems, respectively.

Meanwhile, clustered NAS has gone mainstream, and the big change here was with NetApp incorporating true clustering and petabyte/parallel file system capability into its Data ONTAP OS in its FAS filers.

The other storage format that is built for very large numbers of files is object storage. This tackles the same challenge as scale-out NAS – that traditional tree-like file systems become unwieldy when they contain large numbers of files. Object-based storage gets around this by giving each file a unique identifier and indexing the data and its location. It’s more like the DNS way of doing things on the internet than the kind of file system we’re used to.

Object storage systems can scale to very high capacity and large numbers of files in the billions, so are another option for enterprises that want to take advantage of big data. Having said that, object storage is a less mature technology than scale-out NAS.

So, to sum up, big data storage needs to be able to handle capacity and provide low latency for analytics work. You can choose to do it like the big boys in hyperscale environments or adopt NAS or object storage in more traditional IT departments to do the job.

Read more on SAN, NAS, solid state, RAID