fotohansel - Fotolia

Feature

Storage 101: Unstructured data and its storage needs

Unstructured data exists in huge volumes, but often it is semi-structured with metadata. We lift the lid on unstructured data and key approaches to its storage

Stephen Pritchard

Published: 22 Nov 2018

Estimates suggest that upwards of 80% of business information is unstructured data.

That could cause headaches for anyone who needs to manage, organise and keep all that data secure.

One survey, commissioned by Igneous, an unstructured data management provider, found that 82% of respondents manage one billion or more files and objects. In fact, a staggering 59% of those asked manage more than 10 billion files.

Unstructured data is, broadly, all data and information that does not have a predefined data model.

In practical terms, for IT that means information that is outside a relational database, or stored outside an application environment such as an enterprise resource planning (ERP) or human resources (HR) system that sits on top of a database.

However, a growing volume of information could best be described as semi-structured. Although such data is not held in a database, there is some structure there, mostly in its metadata.

And, as technology – including object storage – allows for richer metadata, the boundaries between structured and unstructured information could blur further.

Unstructured data in context

Business information is, for the most part, generated by systems, or by people. Data from systems is most likely to be structured. An order number created by a sales system, and stored in a database, is a typical example.

Unstructured data is often created by people. An email from a sales team confirming the order would be unstructured, as would a social media message or voicemail complaining the order was late.

A photograph of a damaged item in delivery would, superficially, also be unstructured data – although metadata from the camera files is semi-structured information.

Data can also move between being unstructured and structured during its lifecycle. A business seeing a spike in delivery complaints could combine metadata from customer photographs with geo-tracking information from delivery vehicles in a business intelligence tool.

Although free text-based analysis – and even image analysis – is becoming more powerful, most text analysis tools use a database engine of some sort.

Structured data usually comprises small pieces of information, such as the value of a single database entry, although collectively data volumes can be large.

Unstructured data comes in a much wider range of sizes, from a few kilobytes for a message to potentially terabytes for uncompressed video footage.

Semi-structured machine data

As well as dealing with more data, information and storage managers now need to handle a wider range of data types, both in centralised and user systems.

IT has moved a long way beyond spreadsheets and word processor files on the desktop and a few shared databases, to a much richer range of information sets. Audio, image files and videos now work alongside information from the web, and increasingly, information from connected devices and the internet of things (IoT).

Building in more control over data is increasingly important, as organisations gather and store ever-greater volumes of information

Data generated by sensors and connected devices is essentially semi-structured. Whether it is a temperature sensor in a factory, or a surveillance camera stream, the raw data is of limited use. Metadata, such as time and location, is essential for human or automated analysis of the raw information.

Without metadata, it is difficult – perhaps impossible – to make informed decisions. It is also the metadata that allows analysts to categorise information and move it into a structured environment, such as a database, for processing.

Interrogating the data, whether for a simple historical report or sophisticated predictive analysis, is impossible without a framework of metadata.

Industry observers expect IoT data volumes to grow rapidly. Gartner, for example, expects to see 20 billion IoT connected devices by 2020. Another estimate, from IDC, suggests that IoT data will reach 163ZB (zettabytes) by 2025.

The need to capture metadata is having a greater impact on business’s data management and storage needs. As much as 5.2ZB of data will need to be analysed, and perhaps because of that, 26% of data will be in the public cloud by 2025, IDC predicts.

Storage choices

Cloud storage is an attractive option for at least some types of unstructured data. The cloud is well-suited to information that needs to be accessed infrequently. One type of data already widely migrated to public cloud storage is archive material.

Firms that need to keep long-term records can make use of the cloud’s low cost per gigabyte, only paying retrieval fees for data they need, or in a disaster recovery or forensic investigation scenario.

Long-term data storage is also able to handle the performance lags that potentially come with use of the public cloud. Systems such as Microsoft SharePoint – a common repository for unstructured business information – are less affected by any latency than a transactional system, such as ERP, which relies on a relational database.

For semi-structured data, the cloud is appealing because of its use of object storage.

With semi-structured data, the business value can lie as much in the metadata as the data itself. As object stores can distribute data – and metadata – across multiple locations, they raise the prospect of fast, localised searches for metadata while gaining the scale economies of the cloud for raw data.

Object storage gaining traction

The larger the dataset, the more attractive the object model, and as a result, object storage is gaining traction in industries as diverse as media and entertainment, life sciences, and oil and gas.

According to Forrester analysts Boris Evelson and Elizabeth Cullen, cloud-based text analytics tools can be up and running in minutes, even if it takes rather longer to train algorithms to become productive. As businesses can now run analytics in the cloud, there is a stronger case for keeping data in the cloud too.

Performance requirements, though, will act to keep some unstructured and semi-structured datasets on-premise. Over the past decade, storage suppliers have steadily improved the performance of network-attached storage (NAS) – still the go-to architecture for on-premise, unstructured data.

Clustered NAS can offer performance close to direct-attached storage or a storage area network (SAN). Data that requires fast processing, such as real-time analytics or customer-facing systems, can be supported on NAS.

And CIOs are likely to favour NAS, or on-premise object storage, where data security and compliance considerations rule out the cloud. In this case, policy requirements may well trump technical or cost considerations.

Storage 101: Unstructured data and its storage needs

Unstructured data exists in huge volumes, but often it is semi-structured with metadata. We lift the lid on unstructured data and key approaches to its storage

Unstructured data in context

Read more about unstructured data storage

Semi-structured machine data

Storage choices

Object storage gaining traction

Read more on SAN, NAS, solid state, RAID

Podcast: How to get value from unstructured data

Cubbit DS3 Composer brings DIY cloud to object storage pool

Cubbit offers cut-price cloud with DS3 distributed storage

Vast Data Platform aims at storage everywhere for AI/ML workloads