'Big data' technologies emerge to battle large, complex data sets
‘Big data’ technologies that allow the storage, management and analysis of big data are on stream: NoSQL, Hadoop, MapReduce. Mark Whitehorn explains the reasons for their emergence and how they work.
‘Big data’ is a term that is appearing with increasing frequency. It is often used to mean large volumes of structured data (such as your customer list) but also to describe semi-structured data (images, documents, emails and so on). Here we take a look at the big data technologies that are emerging to allow us to store, manage and analyse both types of big data in data warehouses.
Analysis of structured data
Databases usually store their data on disks, and retrieving data from disks is typically the slowest part of any database query. One way to speed up the analysis of structured data is to optimise disk access, which brings us neatly to columnar databases. Relational databases store data in tables comprising rows and columns. In a customer database, for example, each row would contain all of the data about an individual customer, while each column would contain the data about one facet (e.g., date of birth) of all the customers.
Customer |
|||
CustomerID |
FirstName |
LastName |
DateOfBirth |
1 |
Fred | Jones |
12/03/1956 |
2 |
Sally | Smith |
25/04/1972 |
3 |
Jane | Yo |
23/05/1987 |
Normal relational databases write the table to disk row by row. So all of the data about one customer is contiguous and, once the disk head has moved to the correct part of the disk, reading the entire row is very rapid. The problem is that queries run against data warehouses very rarely return a single customer; instead, they are much more likely, for example, to return the average age of all customers.
You can probably see where this is going. A columnar database simply stores the table column by column rather than row by row. Now all of the dates of birth are stored sequentially and are much faster to read. Columnar databases are bad at doing transactional work (where we would often access individual customers) but very good for analytical access across big, well-structured data sets.
What about analysis of large quantities of semi-structured data?
Semi-structured data -- for example, X-ray images -- can be stored in a relational database. That’s the good news; the bad is that it doesn’t work very well. Other storage options are emerging big data technologies such as NoSQL databases like Cassandra and the Hadoop and MapReduce distributed computing tools.
NoSQL (or “not only SQL” to the enlightened, as opposed to the narrow-minded “no SQL at any price”) describes a class of database management systems. NoSQL databases are likely to be nonrelational, distributed, open source and horizontally scalable. Like columnar databases, they address a limitation of transactional relational databases: poor performance during certain intensive operations, such as serving Web pages or data streaming. One example is Cassandra, an open source distributed database management system with no single points of failure. It can handle very large data volumes across many commodity servers.
Hadoop is an open source shared storage and analysis system: the storage element is supplied by the Hadoop Distributed File System (HDFS) and the analysis by MapReduce (see below). HDFS can scale up from one server to thousands and replicates data across clusters of commodity computers. The obvious advantage is that if one disk fails, no data is lost; another is the cost benefit of using commodity kit to provide this redundant storage.
MapReduce is a software framework and programming model that provides a means of combining and coordinating data from multiple sources. It also is designed for use with huge data volumes: A MapReduce job can process many terabytes on thousands of machines.
If semi-structured data is tricky to handle, why do we store it at all? It’s because we may want to inspect it again in future. For instance, a doctor may wish to re-inspect an X-ray for comparison with subsequent X-rays. Also, we may want to run automated scanning processes against it: Sticking with the medical example, techniques are constantly under development for automatic detection of anomalies that could be tumours, fractures or whatever. The results from a doctor’s reappraisal or a scan are derivable data and can be stored as structured data in a relational database. We are seeing the development of techniques that tie semi-structured data to structured data so we can run a query that asks for all the X-rays for males over 35 living in Northumberland that show evidence of a particular type of fracture.
So much valuable information is tied up in semi-structured data that it’s becoming much more important to understand the advantages recent technical developments can bring to storing, managing and analysing such data.