Weissblick - Fotolia

Data management: What does the future hold?

We explore how firms are replacing transactional database management systems with new database architectures

This article can also be found in the Premium Editorial Download: Computer Weekly: How the DVLA is making driving digital

Transactional database management systems (DBMS) such as IBM DB2 or Microsoft SQL Server have a long and glorious history reaching back to the rule of the mainframe computing paradigm with one server and many users (be they humans or other computing processes).

Transactional systems handle the orderly and secure updating and changes to a relational database on a system, when several concurrent transactions are trying to access the same data item, or an outage cuts into an updating process.

Queries, updates and data definitions in the DBMS typically use structured query language (SQL) between the user and
the DBMS.

Advantages of the classic transactional DBMS are: data independence; efficient data access; data integrity and security; data administration; concurrent access, crash recovery.

Obvious downsides

However, with the advent of cloud computing, hybrid datacentres and virtualisation, the downsides of transactional DBMS are becoming very obvious:

• Expensive/complicated to set up and maintain. The relationship between the data points is essentially constructed at the time of query, and this can be expensive in resource terms.

• Longer latency when receiving data from a number of
storage locations.

• Software is general-purpose, not suited for special-purpose tasks (for example, text stream processing).

• Not good for applications that need real-time processing.

NoSQL

To manage big data and data volumes that vary significantly, non-linear databases using NoSQL (Not Only SQL) have been developed by major consumer internet players, such as Google, Amazon, Microsoft and Yahoo, that stress scalability. NoSQL DBs such as MongoDB, Riak, Couchbase and Cassandra rely on specialised frameworks to store data, which can be accessed by special query APIs. This lowers the requirement for consistency to achieve better availability and partitioning for better scalability.

Graph databases

Graph databases such as SAP Hana, Neo4j and ArangoDB take the non-relational data a step further, and create a “graph” of relationships. When a query is required to be run against the data, the results can be pulled out far more efficiently than is possible in a relational or basic non-relational database.

DSMS for high volumes

Users often want to monitor very large data streams, but not actually get notified until specific conditions occur. This requires incoming data to be treated as a continuous infinite stream of data integrating live and historical sources, rather than as data in static tables or files.

Typical use cases include network management systems, telcos’ call detail records, transaction records in financial systems and healthcare monitoring systems.

Read more about data management

A NewSQL database with NoSQL-style scaling and SQL-style queries may be apt for an IoT system that monitors hotel air conditioning. Fast data and ease of development play roles.

Graph databases, based on mathematics known for three centuries, are starting to yield value for businesses beyond Facebook and Twitter.

A data stream management system (DSMS) is a set of programs comprising input processors, a continuous query (CQ) engine and a low-latency cache/buffer connected to back-end data storage facilities. The DSMS manages continuous data streams, computes functions over data streams and provides other functionalities as in a DBMS.

DSMS engines such as IBM InfoSphere Streams, SAP Event Stream Processor, SAS Event Stream Processing and the open source PipelineDB are useful for applications that require processing data streams and require real-time or near-real-time response with quality of service (QoS).

So instead of querying data, as SQL does, data is presented to CQ queries. The CQ produces its own data stream for back-end storage, which an SQL (or NoSQL for unstructured data) process can then query. The benefits of a CQ engine are:

• Accessibility: Live data can be used while still in motion, before being stored.

• Completeness: Historical data can be streamed and integrated with live data for more context.

• High throughput: High-velocity, high-volume data can be processed with minimal latency.

Complex event processing (CEP)

With the arrival of hybrid cloud storage, big data and unstructured data in data lakes, the need to correlate and analyse information from multiple (and sometimes ad hoc) data sources has increased dramatically. Similarly, much smaller autonomous units, such as the ones being developed for driverless cars, need to correlate and act on streams of car performance data, real-time sensor data alongside information from the surroundings, and data downloaded from external sources, such as traffic and weather information.

CEP combines data from multiple sources to infer events or patterns that, taken together, indicate specific actions. Most of the heavy database suppliers have products in this category, such as Tibco Streambase, Oracle Event Processing, SAP ESP, Microsoft StreamInsight, GigaSpaces XAP and Red Hat’s BRMS.

The goal of CEP is to identify opportunities or threats, and respond to them as quickly as possible. These events may happen internally across an organisation, such as sales leads, stock levels, security systems, orders or customer service calls, or they may originate from unstructured news items, text messages, social media posts, stock market feeds, traffic and weather reports. The CEP event may also be defined as a “change of state,” when a measurement exceeds a predefined threshold of time, temperature or other value.

Developing predictive capabilities

CEPs are limited to historical data and real-time information flows. This does not allow systems to learn from past mistakes and reprogram themselves to improve their performance. Some processes, such as risk assessments and portfolio management, require better predictive capabilities. Software with predictive capabilities are very much ongoing, continuous-development processes and rely on algorithms developed by data scientists. The user community must constantly assess the validity and transparency of the artificial intelligence (AI) process to understand (and trust) these systems.

Artificial intelligence

Artificial intelligence encompasses a wide range of technologies focused on improving operational efficiency and providing new insights from existing data. AI tools automate a wide range of processes that may involve reasoning, knowledge, planning, learning and natural language processing.

A tool like h2o.ai can be used to build smarter machine learning/AI applications that are fast and scalable. Another AI tool, SigOpt, is used by developers to run experiments and make better products with less trial and error. On the marketing front, companies might employ Conversica, an AI tool to prioritise customer leads with automated email conversations to qualify leads.

To integrate AI tools with existing data sources, most AIs are being built primarily by system integration specialists from the usual suspects such as Accenture down to specialists including Greymaths in the UK and Colenet in Germany.

Most AI processes require vast amounts of data to train and fine-tune a system, and it remains very much a development project, primarily driven by industry verticals such as finance with huge amounts of data, millisecond response needs, very high transaction volumes and high value; and pharmaceuticals, where drug development efforts require modelling and trialling of vast amounts of data.

Machine learning

At the lowest AI level, process automation technology replaces manual handling of repetitive and high-volume tasks. The next level is reached when the AI acquires self-learning capabilities (machine learning) where programs “learn” to build models by using observations (example data) combined with experience data. The resulting models can be predictive or descriptive and continue to evolve by obtaining more knowledge about the data at hand.

Currently, machine learning is mainly used in complex problems that have previously been handled by humans, but with no replicable explanation as to exactly how they solve them, and the process of solving the problems is very time- and cost-consuming.

As opposed to AI, which is based on providing the program with rules and user experience, machine learning is employed in computing tasks where explicit algorithms with good performance are not available, and where the programming can only provide rules for reasoning. Example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), learning to rank, and computer vision.

Machine learning in commercial products comes from the likes of Microsoft, whose Azure Dynamic CRM service enables users to identify patterns over time from issues that crop up, speeding time to resolution and improving performance.

Cisco recently announced its Encrypted Traffic Analysis (ETA) to find malware in encrypted traffic. Besides the initial data packet in the connection, ETA looks at the sequence of packet lengths and times, and the byte distribution across packet payloads within a flow. The detection process improves over time by expanding its machine learning models without hogging resources or slowing traffic. The first ETA offering uses NetFlow data from Cisco’s Catalyst 9000 switches and 4000 series Integrated Services Routers, integrated with Cisco StealthWatch security analytics.

Deep learning

Deep learning is a specific method of machine learning aimed at data-intensive, machine-learning processes. It relies on GPU acceleration, both for training and inference, and so requires tight integration of hardware and software components.

In the US, Nvidia with its DGX line and Volta GPU architecture delivers GPU acceleration to datacentres, desktops, laptops, and the world’s fastest supercomputers. For cloud applications, Nvidia NCG deep learning is available on services from Amazon, Google, IBM and Microsoft.

In Japan, Fujitsu announced a deep learning system for Riken, a privately held Japanese research foundation, which in terms of operations will be one of the largest-scale supercomputers in Japan to accelerate research and development in AI technology.

On the software side, the Google Brain Team has contributed significantly to conduct machine learning and deep neural networks research with the open source TensorFlow software. The flexible architecture runs on one or more CPUs or GPUs in a desktop, server or mobile device with a single API. The system underpins a range of industry-specific deep learning offerings. In the legal field, Intraspexion uses TensorFlow as the core of an early warning system to investigate and prevent potential litigation.

Cognitive analytics

Cognitive analytics relies on processing and storage combinations to mimic the human brain in making deductions from vast amounts of data. This brings us back to the world of the mainframe and the supercomputer. 

Recently, the US Air Force and IBM announced a plan to build the world’s first cognitive analytics supercomputer to enable analysis of multiple big data feeds (audio, video or text). It will run on an array of 64 TrueNorth Neurosynaptic processors. Each core is part of a distributed network and operates in parallel on an event-driven basis, not clock-based like conventional CPUs.

The processing power is equivalent to 64 million neurons and 16 billion synapses. Estimates of the average human brain indicate 100 billion neurons and 100 trillion synapses – so cognitive analytics is still in its infancy.

Commercially, the financial sector is at the forefront of cognitive analytics adoption and financial services company Opimas predicts that in 2017, finance firms in the investment sector will spend $1.5bn on robotic process automation, machine learning, deep learning and cognitive analytics, with that sum increasing by 75% to $2.8bn in 2021.

Caveats in data management

Given the breadth of data management issues, overall maturity is a slow process. Driving forces include different technologies, platforms, and capabilities that influence how this field is practised. Continued use of SQL (and NoSQL for unstructured data) to access new data storage and data structures is essential to ensuring easy access for developers, adoption affordability, and integration with existing enterprise infrastructure. The NoSQL scalability is crucial, as is the ability to enhance corporate data management strategies.

Likewise, open source developments in Hadoop for big data and Spark for processing will enable companies to expand their data management functionality, while reducing costs. Their focus can then be on business issues rather than technology.

With huge repositories of unstructured big data, especially in data lakes, the need for consistent frameworks and data governance increases. Companies need to develop business rules and glossaries, and clearly identify governance responsibility roles, before making seismic changes to their data management.

Bernt Ostergaard is an industry analyst at Quocirca.

Read more on Artificial intelligence, automation and robotics