Anusorn - stock.adobe.com

Riding the cloud database wave

There is set to be a surge in demand for cloud databases, driven by new use cases and greater decentralised IT spending

It would seem that the adoption of cloud databases – those delivered via a cloud consumption model – is ramping up.

Referred to as dbPaaS (database platform as a service) by analyst Gartner, the market for cloud databases is dominated by public cloud providers. Amazon Web Services (AWS), Microsoft, Google, Oracle and Alibaba are among the leaders identified by the analyst firm in its Magic quadrant for cloud database management systems, published in December 2023. Broadly speaking, these providers offer a range of data management capabilities. Other leaders identified by Gartner include MongoDB, which specialises in non-relational database technology, and Snowflake and Databricks, which focus on data warehouses and data lakes.

According to Gartner’s August 2023 Forecast analysis: Database management systems, worldwide, the market for database management systems (DBMS) is set to grow at a compound annual growth rate of 16.8% through 2027 to reach $203.6bn, accounting for 27% of the total infrastructure software market spend in 2027. The forecast shows that the percentage of spending on cloud dbPaaS will grow from 55% of the total DBMS market in 2022 to 73.5% by 2027.

According to Gartner, the transition of DBMS software purchasing – moving from legacy, centralised IT groups to decentralised lines of business within an enterprise – is driving this increase in DBMS spending. Traditionally, with centralised IT services, different areas of a business shared a DBMS. Gartner notes that they now have been given the freedom to choose their DBMS based on their own unit’s criteria and then build their own databases rather than using shared systems.

However, Forrester vice-president and principal analyst Noel Yuhanna warns that some cloud databases are built on proprietary technology, which makes it challenging to migrate to other databases. There is also a lack of visibility in cost. “Without monitoring and management, excessive usage of the infrastructure can lead to unexpected costs,” he says. Yuhanna recommends IT decision-makers consider the ability to customise cloud databases compared with on-premise databases, since some impose customisation constraints.

The hybrid approach

There are instances where IT decision-makers will look at options to ringfence their public cloud database platform in a specific region. However, there will clearly be use cases where – perhaps to comply with regional data and privacy regulations – data stores and databases need to be deployed on-premise.

Hyperconverged infrastructure providers such as Nutanix, for instance, offer pay-per-use database-as-a-service offerings, which give IT decision-makers automation tools for database management and the ability to deploy across hybrid IT environments, including public and private clouds.

Vector databases and vector embeddings

In 2022, Dale Markowitz, applied artificial intelligence (AI) engineer at Google, described vector embeddings as a way to represent data – almost any kind of data, such as text, images, videos, users or music – as points in space where the locations of those points in space are semantically meaningful.

Google’s Word2Vec, developed in 2013, is an example of such an approach. With Word2Vec, similar words are clustered together. When plotted on a graph, the words “king” and “queen” and “prince” will all cluster nearby. This also occurs with synonyms such as “walked” and “strolled” and “jogged”.

“Embeddings allow us to find similar data points. I could build a function, for example, that takes as input a word (for example, “king”) and finds me its 10 closest synonyms. This is called a nearest neighbour search. Not terribly interesting to do with single words, but imagine instead if we embedded whole movie plots,” says Markowitz.

Forrester vice-president and principal analyst Noel Yuhanna says vector databases are critical in supporting next-generation generative AI (GenAI) and large language model (LLM) initiatives, particularly to enable similarity searches. A vector database is used to query vector embeddings, and many cloud databases offer this functionality to improve the accuracy of AI applications. 

Yuhanna estimates that 35% of large organisations are currently looking at cloud databases with vector capabilities to support their GenAI initiatives. He says cloud databases frequently serve as feature stores, facilitating the training, storage and development of machine learning models to support AI initiatives.

Certain use cases require a combination of on-premise and public cloud databases. For instance, MongoDB recently put into preview its Atlas Edge Server, which gives developers the capability to deploy and operate distributed applications in the cloud and at the edge. Atlas Edge Server provides a local instance of MongoDB with a synchronisation server that runs on local or remote infrastructure. According to MongoDB, this significantly reduces the complexity and risk involved in managing applications in edge environments.

Data integration

Among the terms often used when looking at an enterprise data architecture is the data pipeline. Teams responsible for data need to provide a way to ingest data from corporate IT systems that may be in silos, including databases and enterprise applications. This data ingestion process often involves complex and fragile data connectors, which can sometimes fail, leading to operational disruptions.

An example of what dbPaaS providers are offering is Databricks’ recently introduced LakeFlow tool, which automates the deployment, operation and monitoring of pipelines at scale in production with built-in support for continuous integration/delivery (CI/CD) and advanced workflows that support triggering, branching and conditional execution. 

The data connectivity part of LakeFlow, called Connect, supports MySQL, Postgres, SQL Server and Oracle, as well as enterprise applications such as Salesforce, Dynamics, SharePoint, Workday and NetSuite.

The extract, translate and load (ETL) component of Databricks’ LakeFlow tool offers what the company claims is a real-time mode for low-latency streaming without any code changes. The final part of the tool offers automated orchestration, data health and delivery. According to Databricks, it provides enhanced control flow capabilities and full observability to help detect, diagnose and mitigate data issues for increased pipeline reliability.

Interoperability

By its very nature, a dbPaaS is deployed on top of a public cloud platform, which means IT buyers risk being locked into whatever their public cloud provider chooses to do.

Snowflake’s recent announcement to make its Polaris Catalog open source is an attempt to provide greater platform interoperability with the Apache Iceberg table format.

Originally developed by Netflix, Iceberg is described as a table format for large, slow-moving tabular data. It provides metadata describing database tables. One benefit is that it offers a standard way for enterprises to run analytics across multiple data lakes.

At its annual user conference in June 2024, Snowflake said it would provide enterprises and the entire Iceberg community with new levels of choice, flexibility and control over their data, with full enterprise security and Apache Iceberg interoperability with AWS, Confluent, Dremio, Google Cloud, Microsoft Azure and Salesforce, among others.

At the time, Christian Kleinerman, executive vice-president of product at Snowflake, said: “Organisations want open storage and interoperable query engines without lock-in. Now, with the support of industry leaders, we are further simplifying how any organisation can easily access their data across diverse systems with increased flexibility and control.”

Snowflake’s goal is to offer the Apache Iceberg community a way to harness their data through an open and neutral approach, which, according to Kleinerman, offers “cross-engine interoperability on that data”.

Data quality

A key area that can hold back enterprise IT projects is the quality of data. In a recent blog, Stephen Catanzano, senior analyst of data platforms at Enterprise Strategy Group, points to research conducted by the analyst firm that shows 79% of organisations recognise the need to use artificial intelligence (AI) in mission-critical processes to better compete, but 62% of line-of-business stakeholders only somewhat trust their organisation’s data.

“This disparity between needing AI and trusting data needs to close quickly. We found that most organisations are heavily focused on data quality as part of data governance to gain trust and to deliver decision-making-ready data to decision-empowered employees,” writes Catanzano.

The blog discusses Informatica’s Cloud Data Access Management (CDAM) product, which, according to Catanzano, represents a path towards helping organisations achieve their goals in terms of data quality and governance. “With data becoming increasingly pivotal in driving business outcomes, it has become imperative for organisations to have robust governance mechanisms in place,” he writes.  

When CDAM was announced, Brett Roscoe, senior vice-president and general manager of data governance at Informatica, blogged that the product provides AI-powered data governance, which enables organisations to deploy analytics and AI with automated, policy-based security and privacy controls driven by metadata intelligence. 

Setting the scene for AI

Assuming Gartner’s forecast is a fair indication of where the database market is heading, it would seem that central IT control of enterprise databases is being replaced by each business unit choosing the most appropriate database to meet their specific requirements. The fact that cloud databases tend to be easier to deploy and potentially offer a lower total cost of ownership makes them attractive to IT buyers.

As Forrester’s Yuhanna points out, they also offer IT leaders a way to streamline IT operations and a quicker way to deploy database applications. He adds: “There is a significant correlation between the adoption of cloud-based DBMS and the rate of AI adoption.” 

Read more cloud database stories

  • NoSQL systems are increasingly common in the cloud. Read about the different types of NoSQL databases that are available from major cloud providers and other suppliers.
  • If your company is using a cloud database, it’s critical to stay on top of security. Review the security features offered by top cloud providers, plus some best practices.

Read more on Database management