Opinion & Analysis

Data for AI — 3 Vital Tactics to Get Your Organization Through the AI PoC Chasm

(Views in this article are the author’s own.)

Written by: Craig Suckling | UK Government Chief Data Officer

Updated 6:43 AM UTC, Thu April 25, 2024

Craig Suckling, UK’s Central Digital and Data Office Government Chief Data Officer

In the first part of this series of articles, we covered how data for AI is the new “data is the new oil.” But with an exponentially growing wealth of data ready to drive the next generation of value we can easily get lost in the bloat of data, driving costs sky high, while seeing little in return.

Data needs to be harnessed, curated, and pointed intentionally at the right business problems, in the right format, and at the right time. There is nothing new in that statement.

With AI all the usual data challenges still apply that prevent delivering value fast and scaling broadly.

Legacy systems and siloed data are a blocker to accessing data needed to train and fine-tune models. Central governance teams cannot keep pace with business demands for new AI use cases, while also being experts in a variety of multi-modal data. Business appetite for AI continues to increase, but trust in data quality remains low, and lineage is fuzzy.

The list goes on, and these challenges compound each other, frustrating efforts to shift from AI PoCs to production scale.

Getting your organization through the AI PoC chasm requires adopting the following 3 principles:

Keeping a high bar on the data you choose to invest in
Storing data in the right tool for the job, with interoperability
Enabling uniform and autonomous access to data across the organization

The good news — AI is increasingly available to fix the data “back office” problem and speed up our ability to deliver customer-facing “front of house” AI use cases. Put another way, AI is folding back on itself to help solve the challenges that impede AI from scaling.

1. Keep a high data bar

The manufacturing industry has been one of the earlier adopters of AI. Forward-thinking manufacturing organizations use AI to reduce product defects, speed up production, and maximize yield.

Achieving these goals at scale without intelligent automation is a challenge as human quality managers can only scale to a certain point to cater to the demands of complex parallel manufacturing processes, detailed inspection requirements, and multivariate problem-solving, all while adhering to health and safety policies.

AI has proven to be highly effective in augmenting the role of quality managers across manufacturing. McKinsey’s research shows some manufacturing companies have been able to realize 20-40% savings in yield loss with AI agents assisting alongside quality managers.

In the same way that AI has helped how physical products are manufactured, it can assist in the data prep process to optimize the yield of high-quality data products that can be used for AI use cases.

(Side note: There is precedence for this AI for AI scenario. In the chip manufacturing industry, NVIDIA uses AI to optimize the production of GPU chips built to train AI).

We now have the opportunity to automate manual heavy lifting in data prep. AI models can be trained to detect and strip out sensitive data, identify anomalies, infer records of source, determine schemas, eliminate duplication, and crawl over data to detect bias. There is an explosion of new services and tools available to take the grunt work out of data prep and keep the data bar high.

By automating these labor-intensive tasks, AI empowers organizations to accelerate data preparation, minimize errors, and free up valuable human resources to focus on higher-value activities, ultimately enhancing the efficiency and effectiveness of AI initiatives.

2. Right tool for the job

AI models need data for a host of different reasons. Data is not just needed during training cycles for building new models. It is also required for fine-tuning foundation models to increase relevance in the context of a specific functional domain, and it is needed during model inference to create contextual grounding and remove hallucinations with techniques like Retrieval Augmented Generation (RAG), and Memory Augmented Generation.

These different usage categories all require different modes of data storage. Training a new model requires various data forms, from large piles of text data and high-quality image, audio, and video data that must be stored in standard formats that can be efficiently processed by training pipelines.

Cloud-based data lakes suit the needs of training data well as they can handle the terabytes or petabytes of raw data that go into training AI models with requisite scalability, durability, and integration capabilities. In comparison, data required for fine-tuning generative AI models is typically restricted to domain-specific data, representative of the task or domain you want to tune the model for, such as medical documents, legal contracts, or customer support conversations.

Typically, data for fine-tuning needs to be labeled and structured to guide the models learning towards a desired output making relational, or NoSQL document databases the better choice for fine-tuning. Retrieval Augmented Generation (RAG) relies on knowledge bases which are a structured collection of information (e.g. documents, web pages, databases).

RAG stores need to also support efficient information retrieval with techniques such as indexing, searching, and ranking, and they also need to capture the relationship between user input, retrieved knowledge, and LLM output. For this reason, Vector Databases (which use embeddings to preserve semantic relationships) or Graph databases (which use a structured graph for relationships) are well-suited for RAG.

Add this to the plethora of other BI, analytics, and traditional Machine Learning use cases that support other data insight and intelligence work across an organization and it quickly becomes evident that variety really matters. Providing AI developer teams with diversity in the choice of data storage applications is imperative to match the right tool for the job at hand.

3. Uniform and autonomous access (with security)

AI is being experimented with and adopted broadly across organizations. With so much activity and interest, it is difficult to centralize work, and often centralization creates bottlenecks that slow down innovation.

Encouraging decentralization and autonomy in delivering AI use cases is beneficial as it increases capacity for innovation across many teams, and embeds work into the business with a focus on specific business priorities. However, this can only work with a level of uniformity in data across the organization.

Businesses need to standardize how data is cataloged, tagged with metadata, discovered, and accessed by teams so that there is a level of consistency in how data is interpreted and used in AI use cases. This is especially important for the role AI plays in augmenting work that goes across departments.

For example, realizing a seamless customer experience across sales, accounts, and support teams requires all teams working on AI use cases using a common definition of the customer, and their purchase and support history. To optimize a supply chain across demand forecasting, inventory management, and logistics planning, teams require consistency in data products for suppliers, SKUs, and customer orders.

Increasing uniformity in data products across functions lets organizations improve the quality and reliability of their AI models. This combined with providing a choice of data tools to meet the AI task at hand and maintaining a high quality bar on data allows for AI work to be conducted with increasing autonomy, delivering greater business value faster.

In the next article in this series, we will continue to move up the stack, away from data sourcing, and management, and into activation of value.

Next up is Agentic AI, how this is transforming how we think about conducting data work, and augmenting how we use data to deliver on business value.

Note: The article was first published on the author’s LinkedIn blog. It has been republished with consent.

About the Author:

Craig Suckling is the Government Chief Data Officer at the UK Central Digital and Data Office (CDDO). With over 20 years as an analytics, data, and AI/ML leader, Suckling brings a wealth of experience and expertise in innovating for the future, developing transformation strategies, navigating change, and generating sustainable growth.