Acquiring the appropriate data to construct a robust enterprise dataset is arguably the most crucial challenge organizations face when developing their own artificial intelligence (AI) models.
According to Waseem Ali, CEO of consultancy Rockborne, even experienced practitioners can encounter significant hurdles. “It always starts with the data,” Ali explains. “Poor data leads to a poor model.” He advises that enterprises should not aim to conquer the world with their initial projects; rather, they should pilot a smaller initiative that allows them to build on their successes.
Organizations need to identify their specific business needs and data requirements for digital projects. They should focus on understanding the problems they need to solve and the hypotheses they want to explore, steering clear of overly ambitious considerations of “global impacts” in the early stages. Johannes Maunz, AI head at industrial IoT expert Hexagon, emphasizes the importance of approaching data acquisition with a specific use case in mind. “There isn’t a single machine learning or deep learning model that fits all scenarios,” he states. “Assess your current situation and identify what data is needed to improve it. Collect this data in a targeted and manageable way, focused solely on that use case.”
Hexagon typically relies on its own sensors, gathering data relevant to construction applications, such as walls, windows, and doors. From the point of data capture to final browser rendering, Hexagon ensures consistency and standards in its data management processes.
Enterprises should first consider the compliant datasets already at their disposal or those they can leverage. This involves close collaboration with legal and privacy teams, even within industrial settings, to ensure that the data used does not include any private or sensitive information, as Maunz advises. Once this groundwork is established, organizations can proceed to build and train the necessary models, provided they have the budget and feasibility to support their initiatives.
At this stage, it becomes crucial to maintain transparency regarding the decision points and signal values that influence usability, viability, business impact, and competitive performance. For data that enterprises do not possess, negotiating with partners or customers to acquire it may be required. “People are usually quite open to sharing data, but contracts are essential,” Maunz notes. “That’s when we typically initiate data campaigns. In some cases, starting with more data than needed can allow enterprises to refine their focus later.”
### The Importance of Data Quality and Simplicity
Emile Naus, a partner at supply chain consultancy BearingPoint, stresses the significance of data quality in AI and machine learning initiatives. Keeping things simple is crucial, as complexity can impede sound decision-making and lead to poor outcomes—complications that might include bias and intellectual property issues. “Internal data may not be flawless, but at least you have visibility into its quality,” Naus points out.
He argues that while simplified models, like a straightforward 2D line fit, can be effective, more complex models utilizing AI and ML can yield enhanced results, optimizing production, refining solutions, and minimizing waste, provided enterprises utilize the right data effectively. “As with all models, it’s important to recognize that an AI model is created using other models, and since models are inherently imperfect, data governance becomes critical,” he warns. “Missing information may turn out to be more significant than what you have.”
Andy Crisp, senior vice president of data and analytics at Dun & Bradstreet (D&B), recommends leveraging client insights and critical data elements to establish standards for data quality. “The data that clients acquire from us has the potential to feed into their models,” Crisp explains. “We conduct around 46 billion data-quality evaluations, recalibrating our datasets according to established standards and subsequently publishing data-quality insights each month.”
A specific data attribute needs to perform adequately according to established standards before being handed off to another team responsible for data management, where it can be captured, curated, and maintained. “Investing time to develop your understanding is invaluable,” Crisp advises. “Start with a single piece, check its dimensions, and ensure accuracy before attempting a larger batch.”
Enterprises must recognize what “good” data performance looks like to enhance their insights. To do this, they should keep problem statements concise and focus on identifying the necessary data for their datasets. Careful annotation and metadata management will facilitate the curation of control datasets, enabling a rigorous, scientific approach that minimizes bias.
Firms should exercise caution against grand claims that blur various factors and ensure thorough testing. This is one area where moving hastily can lead to significant setbacks. All data utilized must adhere to standards that require ongoing evaluation and remediation. “Measure, monitor, remediate, and enhance,” Crisp emphasizes, noting that D&B’s quality-engineering team consists of about 70 professionals worldwide. “Competent engineering is vital for minimizing inaccuracies.”
Greg Hanson, vice president for North Europe, the Middle East, and Africa at Informatica, concurs that setting goals is essential for enterprises to efficiently allocate their resources toward cataloging and integrating data required for effective AI training. Often, an organization’s data is fragmented across various locations—whether in cloud environments or on-premises infrastructures. “Make sure to catalog all data assets and understand where they reside,” advises Hanson. “Consider leveraging AI for accelerated data management as well.”
### Governance Before Data Ingestion
It’s essential to enforce data quality standards before feeding data into the AI engine, ensuring proper governance and compliance. Failure to measure, quantify, and address data quality concerns can lead to rapid, erroneous decision-making, warns Hanson, who succinctly summarizes, “Garbage in, garbage out.”
Tendü Yogurtçu, CTO at data suite provider Precisely, suggests that organizations may benefit from establishing a steering committee or cross-functional council to define best practices and processes for relevant AI projects. Such structures can expedite progress by identifying common use cases or patterns among teams, which may evolve as companies learn from both pilot programs and deployed solutions.
Data governance frameworks may need to be expanded to incorporate AI models, which present numerous potential use cases. “For instance, in the insurance sector, accurate risk modeling requires comprehensive data on factors like wildfire and flood risks, topography, building locations, access to fire hydrants, and distances to potential hazards such as gas stations,” Yogurtçu explains.
However, developing AI models—particularly generative AI (GenAI)—can be costly, cautions Richard Fayers, senior data and analytics principal at Slalom consulting. “In some domains, companies might find advantages in collaborating, especially in fields like law or medicine,” Fayers observes. “Value arises when you augment GenAI with your own datasets.”
In architecture, for example, users can enhance large language model applications with their own data and documentation to enrich queries. A similar tactic could be employed in creating a ticketing platform that intelligently searches based on natural-language criteria not strictly linked to metadata.
“For example, imagine a ticketing service that can help you find ‘family-friendly performances this weekend.’ That type of search is currently quite challenging,” Fayers notes. However, he reiterates that constructing datasets and refining prompts for generative models like ChatGPT demands a steadfast commitment to data quality and governance, with prompt engineering emerging as a crucial skill in high demand.