The Challenge of Connecting Technical with Business Metadata
(Europe) Over the last couple of years, data catalogs have emerged from a “nice to have” data management tool to a key requirement in organizations. Consequently, more and more company partners of the Software Competence Center Hagenberg (SCCH) seek our help in either choosing the most suitable data catalog for their organization, or in setting up an existing data catalog. We have experienced that data leaders often presume that metadata management can be completely automated. However, no single data management tool can fully substitute domain experts, which is why we have identified a comprehensive data governance strategy as a success factor for deploying a data catalog.
This article clarifies the difference between technical and business metadata and why both kinds are required. This distinction is important for CDOs and data leaders to assess the degree of automation that can be achieved for the deployment of a data catalog.
Data catalogs aim at solving the complex task of metadata management. Metadata is “data that defines or describes other data” (Gartner 2016). Both concepts have their origin in library management, where the term “catalog” was used to describe the index that contains all information on the books (e.g., the location of each book or its content), that is, the metadata (Gartner 2016). Also, Philip Dutton, Co-founder of Solidatus, uses the analogy of a library to explain metadata and concludes that “metadata gives organizations the ability to find and understand the data that one is looking for.” Prof. John Talburt even goes one step further and argues that the “deliberate act of creating and distributing data with missing or inadequately metadata annotation” is data littering. He wants to point out the absolute necessity for organizations to care about proper metadata annotation.
Indeed, there are many data catalog tools on the market that promise to solve this task, for example, Collibra Data Catalog, Informatica Enterprise Data Catalog, Alation, or the open source catalog DataHub, to name only a few. The diversity of data catalog tools inevitably led to different opinions about which functionalities must be provided by these tools (Korte et al. 2019; Zaidi et al. 2017). As a result, it is not always clear what can be expected from a data catalog tool.
Following a systematic literature review, our research group working at SCCH and Johannes Kepler University Linz defined the core functionalities of a data catalog, which are among others (1) technical metadata management, and (2) business metadata or “business context” management (Ehrlinger et al. 2021).
Technical metadata is all kinds of metadata that can be collected automatically from the IT database infrastructure. In other words, technical metadata entails physical characteristics of data within a database, e.g., column names, data types or whether a column can contain NULL values or not. Metadata has three distinctive types (Gartner 2016; Quimbert et al. 2020):
Descriptive metadata: can, for example, be a title (“customer name” for a column name), description, author, or creation data of a data asset.
Administrative metadata: describes inherent data asset characteristics, e.g., file format, data types (e.g., character), text encoding, or rights metadata like access or copyrights.
Structural metadata: describes how data assets relate to each other (e.g., that a customer from the customer table is connected to the corresponding invoices in the invoice table).
However, just collecting all technical metadata into a central repository does not provide direct value to business users. The data is typically in the original format, which means it is as inconsistent and diverse as the actual data and needs to be cleaned, enriched, and ideally brought into a useful format (e.g., a business data model) to be useful to domain experts and non-IT people.
To leverage the full potential of automation with data catalogs, it is essential that all kinds of metadata are made available in machine-readable form. Apart from the automatically collected (technical) metadata, a lot of knowledge resides within the people of an organization, that is, the “business context” or business metadata. To allow non-technical users to find and truly understand data (cf. Dutton 2022), it is therefore essential to enrich the technical metadata with business metadata in the form of a company-wide terminology that is independent of specific data sources or technologies. This can either be done through the establishment of a business glossary, or through the annotation of the technical metadata with business context attributes. Table 1 clarifies the difference between technical and business metadata by an example.
Table 1: The difference between technical and business metadata.
Lessons Learned and Takeaways
A data catalog is not the silver bullet that solves all data challenges of an organization. Yet, a data catalog is a very useful data management tool, which supports (1) the automated gathering of technical metadata on the one hand, and (2) provides an interface for business users to add contextual information on the other hand. According to our research investigations, all data catalog tools support the first functionality, but not necessarily the second. Business context annotations are not always considered to be part of metadata management.
The Challenge of Connecting Technical and Business Metadata
This discussion brings me back to the challenge outlined in the headline. We experienced that it is relatively easy to automatically collect all technical metadata (i.e., structure, data types, timestamps) from the data sources and to store them in a central repository (i.e., a data catalog) – see Figure 1. However, structuring, cleaning, and annotating technical metadata with business context still remain tasks that involve people. A data catalog is a tool that supports people in achieving these tasks, because complete automation is, so far, not entirely possible.
Figure 1: Connecting technical metadata from the IT infrastructure with business context from domain experts.
As a result, it is of utmost importance to build a comprehensive data governance strategy around the deployment of a data catalog to support organizations in finding and understanding all their data. Currently, no single data management tool can substitute domain expertise; it can only support domain experts in connecting their knowledge to technical metadata. This fact is the reason why our research group also defined “user responsibility roles” (e.g., data steward assignment) as core functionality of data catalogs (Ehrlinger et al. 2021). In the future, report catalogs that automatically profile business reports and their underlying logic might provide the link between technical and business metadata.
Dutton, P. (2022). Metadata Gives Organizations the Ability To Find and Understand Data. In CDO Magazine (online).
Ehrlinger, L., Schrott, J., Melichar, M., Kirchmayr, N., & Wöß, W. (2021). Data Catalogs: A Systematic Literature Review and Guidelines to Implementation. In International Conference on Database and Expert Systems Applications (pp. 148-158). Springer, Cham.
Gartner, R. Metadata. Springer, 2016.
Korte, T., Fadler, M., Spiekermann, M., Legner, C., Otto, B. (2019). Data Catalogs - Integrated Platforms for Matching Data Supply and Demand. Reference Model and Market Analysis (Version 1.0). Fraunhofer Verlag, Stuttgart.
Talburt, J. (2022). Data Speaks for Itself: Data Littering. In The Data Administration Newsletter (online).
Quimbert, E., Jeffery, K., Martens, C., Martin, P., & Zhao, Z. (2020). Data Cataloguing. In Towards Interoperable Research Infrastructures for Environmental and Earth Sciences (pp. 140-161). Springer, Cham.
Zaidi, E., De Simoni, G., Edjlali, R., Duncan, A. D. (2017). Data Catalogs Are the New Black in Data Management and Analytics (2017). Gartner Research (online)
About the Author
Lisa Ehrlinger is Senior Researcher at the Software Competence Center Hagenberg (SCCH) in Austria. In her role, she drives the scientific development and management of the Data Management and Data Quality research focus. SCCH is an internationally recognized research center, which integrates fundamental research with practical application at the intersection of data science and software science. Since 2016, Ehrlinger has also been a Lecturer at Johannes Kepler University (JKU), Linz, Austria, on databases, information systems, and ontology engineering.
A proven data expert, Ehrlinger has more than 12 years’ experience as an IT practitioner and more than seven years’ scientific experience as a researcher in fundamental as well as applied research projects. Her research interests and publications cover the topics of data quality, data catalogs, metadata management, knowledge graphs, and information integration.
Ehrlinger has served as a program committee member for several scientific workshops (GraphSM, MLKgraphs, QEKGraph) and the International Conference on Advances in Databases, Knowledge, and Data Applications (DBKDA) since 2019. She also presented her most recent research on data quality tools at the MIT CDOIQ Symposium in 2019 and 2020.
Ehrlinger received her diploma in computer science and her Ph.D. in Automated Continuous Data Quality Measurement from JKU.
Ehrlinger is a Europe region member of the CDO Magazine Global Editorial Board.