Laurent - stock.adobe.com

Feature

AI in the enterprise: How to build an AI dataset

The successful execution of an enterprise's AI strategy lives or dies on the quality of the data underpinning it, so how can companies ensure they are on the right path when it comes to cracking on with the early stages of this process?

Fleur Doidge

Published: 09 Sep 2024

Finding and acquiring the right data to build an enterprise dataset is perhaps the most critical task facing organisations that want to build their own artificial intelligence (AI) models.

Even with hands-on experience, things can easily go wrong, according to Waseem Ali, CEO at consultancy Rockborne. “It always starts with the data,” Ali says. “If your data isn’t good, the model won’t be good.”

Instead, quite often, the challenge should not be for enterprises to want to take over the world with their first project, but to do a pilot that enables them to take things further, he suggests.

Examine the specific business need and requirement for the data or digital project and ask what problem needs solving, and what “hunch” needs querying, but avoid deep-dives of “global impacts” at first.

Work from first principles towards acquiring data for the specific use case in question, as Johannes Maunz, AI head at industrial IoT specialist Hexagon, explains.

“There’s not one ML or deep learning model to solve all use cases, Maunz says. “Compare your status quo with what you need to improve. What available data needs to be captured? Do that in a small or finite way, just for that use case.”

Hexagon’s approach usually focuses on its own sensors, with data for construction use cases on walls, windows, doors and so on. Right up to what is rendered in the browser, Hexagon knows about the data and its standards, format, consistency and so on.

Consider first the conforming data and datasets the business already has or can use. This typically entails working closely with legal and privacy teams, even in an industrial, in-house setting. Ensure the data earmarked for use does not contain any private personal information, Maunz recommends. And, from here, enterprises can build the model they want to use and train it – assuming costs and feasibility are in place.

From there, transparency of the decision points needed to make things work and the signal values to estimate factors such as usability and viability, business effects, or potential performance versus competition data, can emerge.

For data the enterprise does not currently hold, some partners or customers negotiation to acquire it might be required.

“People are quite open, frankly – but there’s always a contract in place,” Maunz says. “Only then do we start doing what we usually call data campaigns. Sometimes it even makes sense to start with more data than needed, so that the enterprise can down-sample.”

Data quality and simplicity can be essential

Emile Naus, partner at supply chain consultancy BearingPoint, highlights the focus on data quality for AI/ML. Keep things simple where possible. Complexity makes correct decision-making difficult and damages outcomes – and then there is bias and intellectual property to consider. “Internal data isn’t perfect, but at least you’ll have a view of how good it is,” adds Naus.

Versus an easy-to-use 2D line fit, or even a 3D line fit, a complicated multi-dimensional line fit powered by AI/ML can drive much better outcomes – optimising production, solution “recipes”, minimising waste, and more – if enterprises are “let loose” on the right data, he warns.

“As with all models, because an AI model is used to build a model, and a model is always wrong, data governance is key,” he says. “The bits you don’t have might actually be more important. You have to work out how complete data is and how accurate.”

Andy Crisp, senior vice-president of data and analytics at Dun & Bradstreet (D&B), recommends the use of client insights and critical data elements to establish data quality standards and tolerances, measuring and monitoring.

“The data that [clients[ want or acquire from us [for example] is also then potentially feeding their models,” Crisp says. “We’re calculating about 46 billion data-quality calculations, taking our data and then maybe doing it again against those standards, and then publishing data-quality observations [each month].”

A specific attribute through the lens of a specific standard, for instance, must in effect perform well enough to be passed to the next team, who take those standards and tolerances, the outcomes of those measurement and observation points, and then work with data management to capture, curate and maintain the data.

“There no substitute for spending time on things and developing your understanding,” Crisp agrees. “Start by cutting one piece of wood, and check the length before you go and cut 50 planks all wrong.”

Enterprises need to “know what good looks like” to improve data performance and insights, which can then be pulled together. Keep problem statements tight, narrowing identification of data for the required datasets. Meticulous annotation and metadata can enable curation of control datasets and a truly scientific approach that identifies and helps minimise bias.

Beware big, bold statements conflating multiple factors and make sure to “test to destruction”. This is one area in IT where enterprises do not want to “move fast and break things”. All data used must meet standards that must themselves be continually examined and remediated.

“Measure and monitor, remediate and improve,” Crisp says, noting that D&B’s quality-engineering team comprises some 70 team members worldwide. “Competent engineering will help with trying to reduce hallucinations, etc.”

Greg Hanson, North Europe, Middle East and Africa general vice-president at Informatica, agrees that goal-setting is crucial and can help enterprises determine how best to spend their time in terms of cataloguing information, integrating information and what data is required to train the AI to support outcomes.

Even an enterprise’s own data will typically be fragmented and hidden across locations, clouds or on-premise locations.

“Catalogue all your data assets and understand where that data resides,” Hanson says. “Consider AI for faster data management too.”

Ensure governance prior to ingestion

Apply all data quality rules before ingestion by the AI engine, assuming appropriate governance and compliance. If an enterprise is not measuring, quantifying and fixing, then they are just going to make incorrect decisions at an accelerated pace, says Hanson, adding: “Remember: garbage in, garbage out.”

Tendü Yogurtçu, CTO at data suite supplier Precisely, says that based on company size and industry type, an organisation may consider a steering committee or a cross-functional council to help define best practice and processes across all relevant AI initiatives. This can also assist with acceleration by identifying common use cases or patterns across teams, which itself can continue to change as organisations learn from pilots and production.

Data governance frameworks may require expansion to include AI models. That said, potential AI use cases abound.

“Take insurance. To model risk and pricing accuracy, insurers need detailed information about wildfire and flood risks, parcel topography, the exact location of the building within the parcel, proximity to fire hydrants, and the distance to potentially risky points of interest such as gas stations,” Yogurtçu explains.

However, building AI models – especially generative AI (GenAI) – can prove expensive, warns Richard Fayers, senior data and analytics principal at consultancy Slalom.

“Maybe, in some areas, companies can work together – such as legal or medicine,” Fayers says. “Where we start to see value is when you augment [GenAI] with your data – there are various ways you could do it.”

In architecture, for example, users can supplement use of large language models (LLMs) with their own datasets and documentation to be queried. A similar strategy might work for creating a ticketing search platform that intelligently considers a set of natural-language based criteria that is not one-for-one linked to the metadata and tags.

“For instance, if you could use a ticketing platform that enables you to discover ‘a performance on the weekend that’s suitable for children’, that’s a type of search which right now can prove quite difficult,” Fayers says.

Dataset building and prompt-engineering for the likes of even ChatGPT, for a more “conversational” approach, still mandates a focus on data quality and governance though, he says, with prompt engineering set to become an essential skillset in high demand.

AI in the enterprise: How to build an AI dataset

The successful execution of an enterprise's AI strategy lives or dies on the quality of the data underpinning it, so how can companies ensure they are on the right path when it comes to cracking on with the early stages of this process?

Data quality and simplicity can be essential

Ensure governance prior to ingestion

Read more about enterprise AI

Read more on Software development tools

Nvidia releases synthetic dataset to support Singapore’s AI ambitions

Storage is key to AI projects that succeed

Precision-bred veg from Phytoform Labs: Meet the AI startup looking to boost the UK’s food security

SLM series - IBM: Open source base models empower data-infused specialised models