What is big data and how can it be used to gain competitive advantage?

Cliff Saran discovers how big data can deliver big business benefits without excessive IT investment

According to the "IBM Business Analytics and Optimization for the Intelligent Enterprise" study, one in three business leaders frequently make decisions without the information they need and half don't have access to the information they need to do their jobs. That has significant competitive implications.

 

Data analytics represents a massive opportunity for IT. Sanjay Mirchandani, chief information officer at EMC believes there is a perfect storm, driven by affordable IT and the ability to gather information. "The onus on IT is to leverage data," he says.

EMC is beginning to build its internal IT expertise in "big data", with Mirchandani's chief architect now working alongside marketing to look at data flow. Establishing a centre of excellence makes sense, because big data requires a very different approach to data analytics, compared with traditional business intelligence (BI).

The information explosion that is unfolding as organisations try to make sense of emerging requirements such as social media campaigns on Facebook, along with the imminent roll-out of smart metering, will create vast quantities of data. IT needs to rethink what it means to provide data analytics to support such applications because BI techniques simply fail when applied to massive data sets.

The business case

The McKinsey Global Institute recently published a report on the opportunities in business and government of using big data. According to McKinsey, the use of big data is becoming a key way for leading companies to outperform their peers. "We estimate that a retailer embracing big data has the potential to increase its operating margin by more than 60%. We have seen leading retailers such as Tesco use big data to capture market share from local competitors, and many other examples abound in industries such as financial services and insurance," the report says.

A study from IBM shows that companies which excel at finance efficiency and have more mature business analytics and optimisation can experience 20 times more profit growth and 30% higher return on invested capital.

Technical challenges

The biggest problem, according to Quocirca analyst Clive Longbottom, is that the type of data businesses have to manage is changing. "More and more binary large objects (BLOBs) are appearing in the database, and these require different approaches [to analysing rows and columns] to be able to identify and report on what the content actually is and in identifying patterns and making sense out of what this means to the user," says Longbottom.

In the report, Mark Beyer, a research director at Gartner, warns that big data will cause traditional practices to fail, no matter how aggressively information managers address dimensions beyond volume.

Beyer is responsible for assessing where to put big data on Gartner's eponymous hype cycle. While early use of big data would suggest it is all about data volumes, the Gartner paper identifies 12 dimensions of big data, split into quantification. access control and qualification. He says the analysis of data becomes complex: "Data does not have to be 100% accurate. If I cannot look at all the data, I need to do sampling."

Gartner uses the term "linked data" to describe a data quality process when sampling large data volumes. The report defines linked data as data from various sources that have relationships with each other and maintain this context so as to be useful to humans and computers.

Beyer uses a radio telescope (see SKA - the world's most powerful telescope) as an example. "In [the SKA] radio telescope which takes a snapshot of two-thirds of the galaxy each day, you can link the images together and assign a weighting to effects like gravitational pull," he says. This weighting can change over time, which affects the link between the images. An astronomer can use linking to assess whether the image is showing a stellar object or just an artefact.

MapReduce is another aspect of big data (see Watson - IBM's supercomputer). If a large country builds a smart electricity grid with billions of meters to detect power usage, the meters would send huge amounts of data to those monitoring the health of the grid, Gartner notes. But most readings would amount to "noise" because nearly all would simply report normal operation.

Gartner says a technology such as MapReduce can perform a statistical affinity or clustering analysis by moving the initial processing to the meters to gather similar readings together. An analyst can then specify that the system should filter out readings within normal parameters and display only abnormal readings. This type of processing cuts down on the actual data volume being transmitted

Is big data expensive?

Big firms will have the big budgets to run big data projects. Technology for IBM's Watson project involved a supercomputer with nearly 15 terabytes of RAM; technology for the SKA does not even exist yet.

But big data can be achieved with more affordable IT. Most large enterprises have been running enterprise applications and customer relationship management (CRM) long enough to have aggregated a critical mass of data in those applications, says Rebecca Wettemann, vice-president at Nucleus Research.

While it may seem that huge scientific projects, winning gameshows, and medical research fit in naturally with big data analytics, there is no reason why smaller organisations and those that are perhaps not at the cutting edge of IT cannot benefit. Wettemann believes there is a huge opportunity in big data at smaller companies with cloud computing.

"One of the really interesting things is the ability for smaller businesses to access applications that are available in the cloud, particularly in marketing and e-commerce," she says.

Experts agree that big data offers a big opportunity for businesses. The technology appears relatively immature and the idea of a complete big data solution is a bit of a pipedream.

There are products that can handle aspects of big data, like analysis of large datacentres, but this is not the complete story - businesses should begin to pilot big data projects to evaluate how to benefit from the additional insight it can potentially offer.

 


 

SKA - the world's most powerful telescope

Once it is built, The Square Kilometre Array (SKA) will be the world's most powerful radio telescope. The $1.5bn array could revolutionise science by answering some of the most fundamental questions that remain about the origin, nature and evolution of the universe. It will involve exabyte-scale computing to process the huge data sets that will be captured.

Construction is due to start in 2016. Two countries are bidding to host the project: Australia and South Africa.

South Africa has partnered with eight more African countries in its bid to bring the SKA to the continent. Richard Armstrong, who is based at Oxford University, is one of the researchers working on the South Africa bid. Coming from a computer science background Armstrong is researching the use of massively multi-core integer processors to process the vast amount of data that the telescope will produce. The SKA is expected to produce 100 times the amount of data on the internet. Such volumes of information cannot be stored.

"SKA is an instrument that will continuously produce data. There is no way to store this amount of information so we need to process data in real time," says Armstrong.

He expects the SKA will record snapshot data (like a photographic image), but it will be near impossible to record "time stream data". One of the taskforces for the SKA is looking at efficient networking, which will be essential for streaming vast amounts of information. Significantly, unlike traditional computing, which uses error correction to ensure accurate transmission of data, a "clean signal" is not necessary.

"In astronomy, the signal we're looking for is generally buried in noise anyway," he says. So to identify the clean signal above the noise, researchers can do some form of trend analysis, by taking snapshots, where a clean signal (such as a distant galaxy or star) essentially remains static and is boosted by each snapshot, while random noise cancels itself out.

 


 

Watson - the IBM supercomputer

"Who is Bram Stoker?" was the question that, in February 2011, gave Watson, the IBM supercomputer, victory in the US gameshow, Jeopardy.

Eddie Epstein, a researcher at IBM, is part of the team who developed the IT used in the contest that pitted man against machine. "Jeopardy represents a broad domain question and answer system, where the question is expressed in natural language and the system then analyses raw data to come up with a top answer," he says.

Watson was designed by a small team of IBM Research scientists, with groups focused on algorithms, systems, game strategy, speech, and data annotation. As one of the core leaders of the Watson project, Epstein led the systems team responsible for scaling out Watson's computation over thousands of Power7 cores to achieve the speed needed to be competitive.

In terms of hardware, the system uses a cluster of 90 IBM Power 750 servers, each with 32 processors, giving a total of 2880 cores. Due to the speed needed to access and process the information in real time for the Jeopardy challenge, each server was configured with 128 to 256 Gbytes of RAM and they were connected together using a 10Gbps network.

"Watson is rated at 80 teraflops of processing power. For Jeopardy it used about a third of that. It took about three seconds to answer a quiz show question," says Epstein.

The knowledge is self-contained. According to Epstein, the system does a deep analysis of the question, known as DeepQA, which he says produces "ideas and the types of answers to look for".

Watson is trained to improve its understanding of a correct answer. The raw data comprised 50Gbytes from an assortment of sources including encyclopaedias, dictionaries, the Bible and more

One of the steps in getting the right answer involved using the open source Hadoop scalable, distributed data processing engine to perform MapReduce functions, that are designed to process large data sets.

A query will typically generate 100 answers. These are then ranked in terms of confidence. "It is not an expert system," Epstein notes, "but instead it combines multiple analysis techniques using different algorithms to evaluate different aspects of an answer in real time."

Epstein's recent projects have been focused on the field of unstructured text analytics and he played a key role in the development of the unstructured information management architecture (UIMA). Used by IBM DeepQA technology, UIMA is designed to support interoperability and scale-out of text and multimodal analysis applications.

He believes Watson could be deployed in the medical domain. "There is such a large amount of information needed to do drug analysis and treatment that it is hard for any individual to keep up. Watson could be used to simplify the list of possible drugs that a doctor could potentially prescribe for a given illness," he says.

 


 

 

  • CW+: Big data – the implications for enterprise applications

Read more on Managed services and hosting services