How to manage semi-structured data

Semi-structured data is here to stay, and it offers the potential for business advantage to companies that handle and analyse it well. Mark Whitehorn has advice on how to do that.

Data can be classified as structured, semi-structured or unstructured – but what bearing do these classifications have on a company’s data-handling strategy? The short answer is that it is becoming more important in our rapidly changing IT world to be aware of different data forms and how (or if) you need to manage them.

Structured data is data that has been split into small, discrete units. Each piece of data concerns one thing (to use a good Anglo-Saxon catchall word), for example, the last name of a customer. Structured data is typically stored in tables. Continuing with our example, one column of data would list the last names of all customers, and each row would pertain to one customer. These tables, in turn, are typically stored in a relational database.

In very many cases, we find that data in the real world is not structured quite as neatly as this. But we impose the database structure upon it for the simple reason that doing so makes the data easy to retrieve and query. In practice, this works well for managing most business data; stock control, finance, human resources and other corporate systems all submit fairly readily to an imposed data structure.

The problem is that some data is not amenable to rigorous structuring – and such data is becoming more and more prevalent. A great deal of data relevant to the enterprise is turning up in documents, images and emails as well as tweets and other social media data. All of these can be described as semi-structured data.

The term unstructured data is also bandied around but is not, in my opinion, a viable classification. Virtually all data has some kind of structure – only random noise is truly unstructured, and it contains very little commercial value.

Our options for managing semi-structured data are

a)     Ignore it (probably fatal in a competitive climate)

b)     Force it into structured relational form

c)      Adopt a different storage mechanism

Let’s consider those options one by one.

Ignore it. This is a good one to rule out: So much data is being created and collected in semi-structured forms that most enterprises cannot afford to disregard the outpouring of it. Doing so is viable only if there is no compelling business advantage in being able to track and analyse such data.

Stay relational. Relational database engines have been significantly modified over the years to handle what are characterised by the database manufacturers as “complex data types.” XML is one example: It is considered by many to be an excellent way of holding classic semi-structured data. Most common document formats are, or can be, rendered into XML, and almost all relational engines now have an XML data type, which means that documents often can be stored in a relational database. But the additional complexity of handling semi-structured data means there will inevitably be a trade-off, and in general that will equate to slower retrieval times. However, it does make it very easy to find all tweets that refer to your product, all emails that mention “politician” and so on.

Other examples of complex data types are those that can handle spatial and image data.

Adopt a different approach. There is increasing interest in adopting alternative data management and storage mechanisms. Imagine you store patient X-rays as images. We store data so we can retrieve it later and also so we can query it, but running a query against an X-ray image is a somewhat bizarre concept because the X-ray is simply a collection of pixels. What often happens in practice is that this and other semi-structured data comes with some attached metadata and can also undergo some form of analysis in order to generate further metadata. (In a nutshell, metadata is data about data). In the case of an email, the attached metadata might include length, sender, recipient, time/date and so on. Automatic semantic analysis of the email could be performed and that might yield metadata about the tone of the email (as in, angry, conciliatory, praising, etc.), its grammatical construction (correct, lax, etc.) and so on. 

Metadata is typically highly structured and is therefore highly susceptible to analysis. So you could then store the emails and the metadata in a relational database and query the metadata to find, not just those emails that mention your product, but more specifically those that are well-written and also positive about the product. 

And now think again about X-rays, which are classic semi-structured data. While you wouldn’t query against a raw X-ray image, you can query against its metadata. The attached metadata might include patient ID, doctor ID, extensive information about how and when the X-ray was taken and so on. Automatic analysis of the image might yield metadata like diagnosis, prognosis and so on. In this solution the semi-structured data might be stored simply as image files in the file system and the structured metadata would be stored in a relational database and linked to the image. A query could then pull out all the X-rays for doctor ID 1234 that involved broken limbs and display the images.

The bottom line is that semi-structured data is here to stay, and it offers the potential for business advantage to any company that handles and analyses it well.

Dr. Mark Whitehorn specializes in the areas of data analysis, data modeling and business intelligence (BI). Based in the UK, Whitehorn works as a consultant for a number of national and international companies and is a mentor wit Solid Quality Mentors. In addition, he is a well-recognized commentator on the computer world, publishing articles, white papers and books. Whitehorn is also a senior lecturer in the School of Computing at the University of Dundee, where he teaches the masters course in BI. His academic interests include the application of BI to scientific research.

Read more on Data quality management and governance