Opinion & Analysis
Written by: dsocietydev
Updated 11:12 AM UTC, Wed August 9, 2023
In October 2022, I wrote about the challenge of connecting technical with business metadata that organizations face when deploying a data catalog tool. As MIT CDOIQ Program Director Richard Wang points out, the article aimed at highlighting the importance of building a data governance strategy around a data catalog tool. Since data governance is mainly about communication (Hopwood, 2008), people and their responsibilities to create and maintain metadata are a central aspect of any data governance strategy.
In this article, I explain…
…why it is essential to consider data assets and people in the same strategy…
…and how to automate the interaction between them?
A motivation from industry: The role of domain experts
One of my longest research collaborations at the Software Competence Center Hagenberg (SCCH) is with voestalpine Stahl GmbH, an international steel manufacturer (Bechny et al. 2021, Ehrlinger et al. 2018). Voestalpine has a factory-owned power plant to support its steel production, where the company performs energy price forecasts every 15 minutes. Before calculating a forecast, our task at SCCH was to identify outliers in the data before the actual analysis. An outlier is a data point that is considered abnormal or far away when compared to the rest of the data.
The identification of such an outlier by statistical methods does not automatically indicate the validity of the data point as outliers can have different causes:
The outlier can be an invalid data point, which means that the data does not properly represent its real-world representation. This is what Wand and Wang (1996) describe as a “data quality problem.”
The outlier can also be a valid data point, which means it reflects the real world (the energy data in our use case) correctly. In this case, the outlier can still be an indicator of an error in the real world, e.g., an error in the production process that needs to be fixed.
While outliers can be detected fully automatically using machine learning (ML), domain experts are required to identify whether an error is an error in the data or an error in the real-world representation of the data (i.e., product or process errors).
Background: The role of data and knowledge
Data are facts and symbols that represent some object in the real world, for example, sensor measurements, a product description, or a business process representation. These representations are only models and therefore always different from the real object they represent.
Drawing conclusions from data requires context. In a human conversation, we can ask questions when we are not sure whether we have understood something right or not. An artificial intelligence (AI) does not (yet) use this ability, but makes statements and predictions based on the data at hand. Hence, it is essential to provide as much context to the data as possible.
Here, we speak of information, that is, collections or uses of data (Sebastian-Coleman 2012). In the end, data only has meaning in a specific context. Context can be provided by metadata, which is explicit knowledge about data since it documents what the data is intended to represent, how it was designed, how it can be used, and how it should not be used.
Data responsibility roles
Who is responsible for providing metadata or context to data?
There is a clear answer to this question – the responsible people. One of the main reasons for poor data quality is the lack of responsibility that employees feel they have for a particular data set. In their study, Labadie et al. (2020) found that the data steward is the most important data responsibility role for companies. A data steward is a business role that understands the respective data and its value and can be held accountable for its quality.
Further optional roles include data owner (accountable for the data), data architect, data scientist, and solution architect. The leadership role of the Chief Data Officer (CDO) is specifically dedicated to the implementation of a data catalog since this ‘data leader’ is also responsible to shape the vision and strategy of a data governance strategy within an organization.
Establishing responsibility rules is therefore one of the main aspects of successful metadata maintenance and governance. In other words, the success of a data catalog depends to a large extent on the people who maintain it.
Takeaways – the role of the CDO is to coordinate all relevant people. More specifically, at least the following four actions are expected:
Know all relevant stakeholders (e.g., data scientists, managers, domain experts) and their expectations to the data
Assign the right people with the right role to the respective data assets
Use incentives to encourage people to annotate and maintain metadata
Continuously evaluate the assignment of data responsibility roles
The main challenge will be to synchronize the language between different stakeholders (e.g., between domain experts and IT people) and to ensure that everyone has the same point of view. Without a clear assignment of the right people to data, it is not possible to successfully deploy a data catalog.
Conclusion and outlook
In an era of digitalization, organizations can only succeed if they manage to automate the communication between data assets, data consumers, and data producers. For this purpose, a decentralized platform is required (see data mesh architecture) that connects…
…the underlying database infrastructure storing all the data (i.e., technical metadata),
…the people consuming the data (e.g., data scientists, business analysts), and
…the people producing the data or overseeing the data production process (e.g., data owner).
Such a platform ideally provides all technical and business metadata in a machine-readable form and assigns data consumers and data producers to the respective data assets. Data consumers should be able to find and access the data (data democratization). They will be the first to recognize errors in the data since they work with it and should therefore be enabled to easily report them via the platform.
Data producers, on the other hand, might know the cause behind the error due to their knowledge of the data production process and therefore can provide an explanation or a possible fix to the problem.
The aim of a data governance strategy is to clearly define and document this interaction, which is automated through the platform. This central interaction institutionalizes metadata annotation and hence enables long-term improvements in data quality across the organization. As a result, the automated interchange, integration, and analytics of data will receive a tremendous boost.
About the Author
Lisa Ehrlinger is Senior Researcher at the Software Competence Center Hagenberg (SCCH) in Austria, where she leads the Data Management and Data Quality team. Since 2016, Ehrlinger has also been a Lecturer at Johannes Kepler University (JKU), Linz, Austria, on databases, information systems, and ontology engineering.
A proven data expert, Ehrlinger has more than 12 years o experience as an IT practitioner and more than seven years of scientific experience as a researcher in fundamental as well as applied research projects. Her research interests and publications cover the topics of data quality, data catalogs, metadata management, knowledge graphs, and information integration.
Ehrlinger received her diploma in computer science and her Ph.D. in Automated Continuous Data Quality Measurement from JKU. Ehrlinger is a Europe region member of the CDO Magazine Global Editorial Board.