Artificial Intelligence (AI) has been a factor in the development of innovative technologies since the advent of the modern internet. It provided the foundation for Netflix’s first recommendation algorithm Cinematch which was launched in 2000, seven years before they started their streaming service. Since then, just about every consumer and enterprise technology solution has had some element of AI behind it.
For most end-users, AI was seemingly invisible. Until the launch of ChatGPT in 2022, the practical concept of AI was hidden from the consumer — while we either loved or made fun (and frequently both) of Netflix and Amazon recommendations, we did not spend a lot of time thinking about the algorithms or the data that was powering them.
In fact, most of the time, consumers were happy to give tech companies full access to their data in return for highly personalized and frictionless experiences. Their unwitting agreement has provided trillions of data points about identities, transactions, and behaviors that have created the fuel for the models that drive the rapid proliferation of more and more AI-supported solutions.
But then ChatGPT burst onto the scene and put AI into the hands of almost everyone. And thus, we’ve entered the “Age of AI.”
There are many implications to the widespread use of AI, not least of which are ethical, business, and political considerations, and it is beyond the scope of this article to address all of them. The topic I do want to explore is how the data management discipline will need to change and adapt in the age of AI.
Let us start with the definition:
Data management is a business and technology discipline aimed at ensuring that the data in an enterprise is usable for any business purpose, current or future.
At a high level, data management concerns itself with this basic concept of the data lifecycle:
To make this a little less abstract, let us use an example of a customer – we’ll call him Joe Smith – who is opening a small business checking account in a bank called BestBank. And let’s assume a realistic level of data management maturity at BestBank.
Collect: Joe Smith provides the required documentation for BestBank to comply with Know Your Customer (KYC) and Anti-Money Laundering (AML) requirements. Happily, that gives BestBank enough identifying information to be able to onboard Joe Smith as a verified customer and assign him a unique ID.
Less happily, this ID is only unique to Joe in the system that processes small business accounts and not to the entire bank. Joe has a personal account at BestBank, but his ID there is different from the one assigned when his business account got onboarded (remember, we are assuming a realistic, but not ideal level of data management maturity at BestBank).
Behind the scenes, data management’s role at this stage is to:
Establish data quality checks when Joe’s information is entered into the BestBank business accounts onboarding system.
Ensure metadata is created to ensure compliance with privacy, information security, and government and/or commercial standards guidelines. Identify the most important business entities (in this example Joe as a customer, and small business checking account as a product) through master data management.
Connect: Joe’s business and personal account information is brought together in a data lake and connected, making it possible for Joe to see all of his accounts in the BestBank award-winning app.
At this stage, the role of data management is to:
Integrate datasets, usually by moving them into centralized data repositories.
Identify and connect common keys within different datasets through master data management processes and tooling.
Track data movement through data lineage processes and tooling.
Track and improve data quality through data quality management processes and tooling.
Use: In addition to improving Joe’s digital experience, the connected data in the data lake is also used for BestBank financial reporting, risk reporting, marketing, and cross-sale outreach to Joe. His data is included in a cohort to use in predictive analytics to personalize marketing and sales to people with similar demographic profiles.
In this usage stage, data management must:
Enforce data usage rights through metadata and access management processes and tooling; e.g., ensure Joe’s data does not get used in predictive analytics if he didn’t consent to it.
Use data lineage processes and tooling to track what data is being used and where it is being used.
Identify and surface insights through dashboards, machine learning, and AI.
What are the common data management challenges in the “Collect” phase and does AI help?
Data quality on entry: While AI cannot directly improve data quality, it can enhance the data entry experience, leading to more accurate and consistent data input. For instance, generative AI can enable voice-to-data transcription, reducing errors and simplifying data entry.
Time to market for ingesting new external data sets: The primary challenge in ingesting external data lies in the diversity of formats and time required to develop logic to bring this data into the standardized format and associate it with the vendor-provided metadata. AI-powered tools can help with all three steps:
Automate the parsing out of the vendor file
Automate the loading into the standardized schema
Read the vendor documentation and associate descriptions and definitions with the standardized data
This automation would significantly reduce the time needed to realize value from the multitude of external datasets companies process.
Business metadata creation: The challenge of business metadata creation is that it is a manual process that requires subject matter experts (SMEs) to step away from their day-to-day duties to create, curate, and update business descriptions of both atomic and derived data.
While the primary beneficiaries of business metadata are the data consumers, the most knowledgeable people are those in the business unit that creates the data. As a result, there can be a misalignment of incentives that makes business metadata one of the hardest data management disciplines to implement (and often the major reason data catalogs implementations do not bring the value that companies expect).
Machine learning (ML)-based tools can automate the process by generating business metadata from existing documents and usage patterns, streamlining the metadata creation process, and ensuring consistency. While these tools are still at a low level of maturity, I have seen significant progress just in the last year.
Uncovering hidden data: Till the advent of generative AI, the information stored in documents was difficult to access and make available for operations and analytics. Generative AI changed that: it possesses a remarkable ability to extract information from documents, transforming previously inaccessible data into a consumable format. This capability opens new avenues for data analysis and insights.
What are the common data management challenges in the “Collect” phase and does AI help?
Connecting disparate datasets: This is the major challenge that results from source data residing in siloed operational systems, created by different business processes, and in different business lines. This challenge is especially tough on companies that have not established a robust Master Data Management (MDM) discipline in the data onboarding process.
Connecting these datasets requires a lot of manual work by people who understand both business and data. In most organizations, these people are in very short supply and very high demand, so this work often takes a significant amount of time, which impacts time to market and thus detracts value from data initiatives.
Advanced ML tools, with human oversight, can streamline the process of identifying common keys and relationships between diverse datasets, facilitating seamless data integration. Similar to business metadata tooling, these tools are still at a low level of maturity but are rapidly progressing.
Understanding data quality: Traditional approaches to data quality assume the prior knowledge of what kind of errors are expected and monitoring for these errors. Usually, the data quality measurement process starts with profiling existing data and then creating data quality checks that are based on current concerns.
That is time and labor-intensive and not very useful since it uses past errors to predict future ones. The “learning” part of AI/ML-enabled tools (especially the new class of tools in the Data Observability category) removes the need for extensive manual data profiling and bypasses past performance bias inherent in the traditional data quality approach.
Understanding data lineage: While there are tools that can crawl a data environment and discover how data elements move through it, they produce so much output that it is hard to make sense of it. These tools typically investigate only the physical attribute layer, making it very hard to discern where the data originated.
As a result, the majority of data lineage efforts are very manual, take a long time, and their results become outdated the moment they are finished. Ultimately, they struggle to demonstrate ROI. Combining AI/ML-enabled metadata management platforms and data observability tooling can transform the field of data lineage and enable organizations to get the value that has proven elusive so far.
Let us look at the changes generative AI is bringing to the “Use” stage since ML-enabled predictive analytics have been with us for a few years now and their value and risks are reasonably well understood.
How does generative AI address the common challenges of the “Use” stage?
Operationalizing the output of both business and predictive analytics into everyday use by decision-makers: This is a major challenge for organizations that rely on analytics to make informed decisions (which, in essence, is every organization).
Even with all the advancements in dashboard and reporting tools, and even with well-developed data storytelling expertise, dashboards and reports can only be understood with the dedicated attention of users. In some cases, their complexity renders them of only marginal value.
To illustrate this problem here is a true-life example from my own experience:
For a large financial services customer, we implemented a 360 view of sales prospects. It gave great results, and we operationalized it as an additional tab in our Customer Relationship Management (CRM) system. We implemented it that way based on the well-defined business requirements and in close collaboration with a business sponsor.
It should have been a resounding success. But, as it turned out, salespeople don't ever want to open more tabs! This functionality went unused (notwithstanding all the training and communication we did) until we figured out how to use limited real estate on the main CRM page to display this information.
It could have presented information in a user-preferred format, whether it is a document, speech, or even a meme. In the example above, imagine a salesperson on the way to see the prospect listening to a summary of the information about this prospect: “You are about to meet Jane Doe.
She is the CFO of the company ABC, which is a subsidiary of XYZ. Within our company, John Smith has been working with a CFO of XYZ for the past six months and has sold US$1 million worth of our product during that time frame”. No tab surfing required!
But this power comes with significant new challenges that we are just beginning to understand and find ways to address.
Let us look at the challenges we outlined above from the data management perspective. To ascertain whether or not our efforts are achieving their intended outcomes, we have to know the following about our data:
From the data management perspective ensuring correctness has two components: the accuracy of the dataset used in training of the model and the quality of the dataset at the time of execution.
However, in traditional data quality approaches, accuracy is one of the hardest data quality dimensions to assess. That is because it requires a set of contextual data quality rules that are both time-consuming and error-prone.
For example, if a record has both the date of birth (say 1972) and the age of a person (say 60), both of these numbers can be within the valid range, yet when taken together are clearly incorrect. Yet, data stewards rarely come up with the rules that cover two fields in relation to each other.
ML-enabled tools can be much more useful since they bring a learning component to evaluating accuracy and can discover the expected relationship between the two fields as part of the model training process.
From the data management perspective, data observability during execution is becoming increasingly important since it shows what data is being fed to the model in real time. It also can provide “data lineage at execution” creating transparency into the data sources that were used to derive the result.
Addressing and mitigating bias is both challenging and important, especially in applications of AI that can have an impact on people (e.g., hiring, loan approvals, claims adjudication). From the data management perspective, at a very minimum, data observability and metadata management are key capabilities that create the foundation for discovering the biases that exist in data.
From the data management perspective, data rights are key considerations for responsible AI. Data rights encompass both data privacy considerations (e.g. did the customer agree to the use of their data for marketing purposes?) and data source management (e.g., are we ensuring we are not using copyrighted material or irresponsible content?).
Metadata management combined with data observability is key to managing both data rights and data source context, and they provide key transparency at the time of training and execution into the data used and where that data came from.
“With great power comes great responsibility…” – Uncle Ben Parker, Spiderman
AI delivers significant power to the data management lifecycle and efficacy in that it:
Revolutionizes the accessibility of all data and information produced by both people and machines.
Delivers a new generation of AI-powered tools that can automate, enrich, and thus significantly accelerate the process of making data available and usable for any business purpose.
On the other hand, successful and responsible deployment of AI-enabled systems requires both expansion and rethinking of data governance practices and processes. In a somewhat circular fashion, to assist with the efficacy of these new practices to responsibly deploy AI, AI-enabled tools are rapidly being created such as data observability, ML-enabled data quality, and metadata management.
We are still in AI’s infancy, and its power will evolve significantly in the coming years. So will the challenges and risks. As our experience evolves and our understanding deepens, I expect we will be revisiting this topic again and again and coming up with new and hopefully, even better answers.
About the Author:
Julia Bardmesser is CEO of Data4Real, LLC and a Board Advisor to several fintech startups. Bardmesser has over 25 years of experience in building technology and business capabilities that enable business growth, innovation, and agility.
She has led transformational initiatives as Head of Data and Architecture at Voya Financial and as a senior executive at Deutsche Bank, Citi, FINRA, and others.
Her recent awards include Engatica 2023 list of the World’s Top 200 Business and Technology Innovators; a 2022 WLDA Changemaker in AI award; and CDO Magazine’s List of Global Data Power Women three years in a row.
Bardmesser holds a Master of Arts in Economics from New York University.