Opinion & Analysis

From Chaos to Clarity — A Robust Management Framework for Establishing Agile DataOps

Written by: Justin S Magruder, Ph.D.

Updated 5:47 PM UTC, February 6, 2024

Modern enterprises continuously create, ingest, transform, distribute, archive, and often destroy, a wide range of data and information about normal operations. Volumes of structured and unstructured data are enormous. Typically, this includes information about customers, staff, events, legal agreements, products, services, transactions, operations, risk, and finances.

They exchange data and information continuously, with commercial, government, academic, and internally disparate organizations, such as vertically, horizontally, or geographically segregated groups.

Rarely do these enterprises have perfect transparency into internal and external data operations, and many organizations are hopelessly siloed or “stove-piped” as managers and staff struggle to find, access, interoperate with, and reuse data from organizations across an enterprise.

Only recently has the leadership and management begun to realize that these data sets are tangible operational assets with real direct and indirect financial value, rather than byproducts of technology processes and operational sunk costs or liabilities.

To find, access, interoperate, and reuse information seamlessly, smart enterprises are beginning to catalog and organize internal structured and unstructured data. As these enterprises and internal organizations become fluent in their data assets and data products, they can begin to exchange metadata, data, and information with external counterparties to dramatically improve performance with Agile DataOps.

What is “Agile DataOps?”

Agile DataOps is a methodology that combines agile principles, Data engineering, and DevOps to streamline the process of developing, deploying, and managing Data-intensive applications and workflows. It focuses on automating and optimizing end-to-end Data pipelines, from Data creation or ingestion to consumption, using a continuous delivery and improvement approach.

The objective of DataOps is to enable seamless collaboration between Data engineers, Data scientists, operational users, and other stakeholders, while ensuring Data quality, security, and compliance. By implementing DataOps principles and practices, organizations can reduce time-to-market, improve agility, and increase the value of Data assets.

The reality is that IT and operations management must be able to understand the full scale and breadth of data coursing through these systems, including the business priorities and definitions, statistics, legacy, and current transformations, access controls for specific data collections, Data Governance priorities, and usage policies. IT is in a tough spot. This is where a strategic data organization can help.

The ideal solution for the agile, automated enterprise is to establish an “Agile DataOps” Governance program and set up and maintain a catalog of enterprise data, with metadata and business definitions from each source system across the internal data platform. With a mature data organization, the enterprise will catalog data and metadata it receives from its counterparties.

This is to understand and contextualize the data flows, data content, and lineage and enable more seamless integration and appropriate mapping to internal data collections. In the Agile DataOps framework, this would be materialized as, effectively, a catalog of catalogs, and sometimes considered “Amazonification” or a Hyper-Catalog.

Agile DataOps: The vision

In the modern enterprise, authorized users will be able to access comprehensive, integrated, curated information about anything they need to do their job or satisfy their meaningful curiosity, to which they are entitled. They will have access to the complete inventory of available data from internal systems and appropriate external counterparties, organized by sources, subjects, and operational use cases.

Information discovery using natural language will be prolific, a faceted search paradigm that will allow users to see the most contextually relevant content, search results, and the validating metadata that will aid in their analysis of operational use cases. In some cases, users will be able to generate high-quality prose, images, and software code suitable for business use from curated data sets managed by the organization. Data Governance will be highly automated.

The catalog will define data access controls, or apply existing policies so that staff understand the process and requirements for gaining access to each collection of data. It will help users differentiate between authoritative and secondary sources, and understand, via graph-powered lineage, the flow of data through the organization, including provenance, how and when it is transformed, and where to find good quality, curated data that is ready for use. And it will suggest the next best course of action for myriad data user personas.

It will be cloud-native and knowledge-graph-based so that it can scale to thousands or millions of users, support a diverse set of use cases, and connect to any type of data source. These critical capabilities describe how the catalog will simplify the concept of operations for data users.

They drive a variety of critical business use cases including the following:

Data discovery: Self-service access to a comprehensive inventory of trusted data assets and metadata to make informed business decisions. More than data, this discovery process will include key concepts, definitions, users and personas, and insights derived from the data and metadata being examined. The holistic approach to data discovery will enable the user to have a fully contextualized search result set with which to work. And there will be no “data littering!” “Just as data have gained parity with software, metadata needs to have full standing with data,” and data without full supporting metadata creates a mess. Don’t litter.

Agile Data Governance: Empower key stakeholders from across the business to participate in an inclusive data enablement process and enable organizations to create and curate data products. This process must strike a balance between Positive Data Control TM and the enablement of speed and efficiency when deploying data in response to difficult or complicated business questions.

Cloud Data Migration: Cloud migration requires data access, policies, and metadata to remain consistent throughout the process. Catalogs provide insight into your on-premise and cloud data, so data teams can make smart decisions about what to migrate and when. A modern data infrastructure is a living, breathing organism and users need the catalog as the “front office” for their data operations, despite the underlying infrastructure having to constantly change and evolve over time.

Semantic layer: Bridge the gap between how data consumers understand the business and how companies store the data. Business terminology can be represented in the catalog as concepts and relationships that are understandable by people and machines in the same way. Often the relationships between data, concepts, and the people working on the data are just as important as the data itself.

Agile DataOps: The catalog will have a key role in change management and daily operations by reading, recording, and conveying changes to the data platform that will affect the way the “lights on” operational systems interoperate. This functional capability will enable true operational artificial intelligence at quantum speeds.

In today’s data-driven world, organizations face significant challenges in managing and leveraging the vast amounts of data they accumulate. Traditional centralized approaches to Data Governance have proven to be inadequate for handling the scale and complexity of modern data ecosystems. As a result, new paradigms such as Data Mesh and Data Fabric have emerged to address these challenges.

In building on the ADO principles, Data Mesh and Data Fabric emerge as paradigms empowering teams to take ownership of data products while facilitating centralized data controls and seamless integration from enterprise sources. Both frameworks enable Agile DataOps and manifest the “producer-consumer model.”

The Data Mesh

Data Mesh seeks to empower individual teams within an organization to take ownership of data products originated or ingested by their domains.

What is “Data Mesh?”

Data Mesh is a decentralized approach to Data governance and a conceptual framework for building decentralized, scalable Data architectures within complex organizations. It emphasizes the importance of treating Data as a product and promotes the idea of creating cross-functional Data teams that take ownership of specific Data domains. These teams are responsible for the end-to-end Data delivery process, from Data ingestion to consumption, and operate in a self-organizing and autonomous way.

The Data mesh approach also emphasizes the importance of standardizing Data infrastructure and Data access and using domain-driven design principles to structure Data into business-aligned domains. Ultimately, the goal of Data mesh is to create a more flexible, efficient, and scalable Data architecture that can better support the needs of modern Data-driven organizations.

Rather than relying on a single, centralized data team for all data management, Data Mesh distributes data responsibility to the teams closest to the data source, referred to as “data product teams.” These teams are composed of domain experts and stewards who understand the nuances and context of the data they work with.

Data Mesh is built on the following principles:

Domain-oriented ownership

Data product teams take ownership of domain-specific data assets, such as products and services, counterparties, transactions, people, and financial information, including data quality, data modeling, and data pipelines.

Federated computational governance

Data product teams, necessarily cross-functional, are responsible for the computational aspects of their data products, enabling them to choose the appropriate technologies and tools that suit their specific needs.

Self-serve data infrastructure as a platform

Data Mesh provides a data infrastructure platform that enables self-serve capabilities for reporting, analytics, native integration, and hypothesis testing, allowing data product teams to manage their data independently.

Product-centric culture

Data products are treated as first-class products, assets with tangible value, with a focus on user experience, documentation, and discoverability.

Data Mesh enhances Data Governance by promoting data ownership and accountability within the organization. It encourages collaboration, reduces bottlenecks, and enables faster innovation by empowering teams to make data-driven decisions within their respective domains.

The Data Fabric

Data Fabric relies on a singular, centralized collection of data controls that enable a data services layer to integrate data from enterprise sources for many purposes.

What is “Data Fabric?”

Data Fabric, like Data Mesh, is an approach to modernizing Data architecture. Data Fabric differs from Mesh in that it is a centralized, highly controlled approach that aims to provide a unified and consistent view of Data across an organization. Both approaches emphasize the importance of Data agility, scalability, and interoperability, but Data Fabric is focused on cross-functional integration and Data Mesh describes autonomous and self-organizing Data teams that enable better alignment with localized business needs.

Data Fabric ensures management controls are robust, drives the adoption of “Data sharing frameworks,” and reduces organizational silos, while Data Mesh accelerates time to Data but reduces cross-functional integration and can spawn redundancies and Data quality challenges.

Ultimately, the choice between Data Fabric and Data Mesh depends on the specific needs and goals of each organization.

Data Fabric is a unified and integrated data architecture that aims to provide a cohesive view of data assets across an organization. It acts as an abstraction layer that enables seamless data integration, access, and governance.

Data Fabric is built on the following principles and methods:

Data integration

Data Fabric facilitates the integration of diverse data sources, both within and outside the organization, ensuring data consistency and accuracy.

Data access and discovery

It provides a centralized framework for accessing and discovering data assets, allowing users to find and retrieve relevant data efficiently.

Data governance and security

Data Fabric enforces Data Governance policies, ensuring compliance with regulations, privacy guidelines, and data security standards across the entire data landscape.

Data quality and master data management

It incorporates mechanisms for data quality management, data profiling, and master data management to ensure data accuracy and reliability.

Data Fabric strengthens Data Governance strategies by providing a holistic view of data assets, enabling consistent Data Governance practices, and facilitating data-driven decision-making at an enterprise level.

Key differences between Data Mesh and Data Fabric:

While both Data Mesh and Data Fabric contribute to Data Governance strategies, they differ in their approach and focus in the following areas:

Data ownership and responsibility

Data Mesh emphasizes decentralized data ownership, empowering domain-specific teams to manage their data products. In contrast, Data Fabric centralizes Data Governance and management, providing a unified view of data assets across the organization.

Data integration

Data Mesh primarily focuses on the computational aspects of data integration, with each data product team responsible for integrating data sources within their domain. Data Fabric, on the other hand, offers a centralized data integration framework that spans across teams and systems.

Scope of governance

Data Mesh focuses on domain-specific governance, where individual teams have ownership and control over their data. Data Fabric provides a broader scope of governance, ensuring consistency, compliance, and security across the entire data landscape, including data integration, access, discovery, governance, and security.

Flexibility vs. standardization

Data Mesh promotes flexibility and autonomy by allowing teams to choose their technologies and tools. This enables teams to tailor their data infrastructure to their specific needs. Data Fabric, on the other hand, emphasizes standardization and uniformity, providing a consistent and integrated data environment across the organization.

Scale and complexity

Data Mesh is well-suited for large and complex organizations with diverse data domains and a need for agility and scalability. It addresses the challenges of managing distributed data ownership. Data Fabric is beneficial for organizations with a strong focus on data integration, consistency, and enterprise-wide governance.

User experience and discoverability

Data Mesh places a strong emphasis on treating data products as first-class products, focusing on user experience, documentation, and discoverability within individual domains. Data Fabric aims to provide a cohesive and centralized data discovery platform for accessing and discovering data assets across the organization.

Data Mesh and Data Fabric are two distinct approaches that contribute to Data Governance strategies, albeit with different focuses and principles. Data Mesh enables decentralized data ownership and empowers domain-specific teams, fostering collaboration and innovation. Data Fabric provides a centralized and integrated data architecture that ensures consistency, governance, and a holistic view of data assets across the organization.

What is the right pattern for your Organization?

Organizations need to understand their business model and strategic vision explicitly to select and develop the right architectural pattern. This means mapping business processes to the tactical requirements, technology landscape, and the ideal governance approach, centralized or distributed, to decide which paradigm aligns best with their goals.

Data Mesh and Data Fabric each enable architectural and operational solutions for managing data effectively, enabling data-driven decision-making, and establishing robust Data Governance strategies in the ever-evolving landscape of data management.

A traditional approach to data management works to control costs and simultaneously deploy the next set of features to support emerging business requirements. It also struggles to maintain the array of legacy systems while complying with the change of growing business, legal, and regulatory requirements.

Often, these systems are poorly integrated, using a wide range of legacy methods and technologies, and there are massive amounts of duplication and data quality problems.

For example, when possible, most organizations prefer to buy “commercial off the shelf” business systems (“COTS”) rather than build new solutions. But as COTS products proliferate from different vendors and for different use cases, unexpected or unplanned integrations are needed.

Many organizations handle these integrations with manual processes and spreadsheets for lack of systemic integrations, and these also begin to proliferate, often at an exponential rate. These hidden costs are massive, and McKinsey estimates manual integrations consume more than 30% of gross enterprise labor hours.

At the same time, operational organizations struggle to stay ahead of changing customer, market, and economic conditions by optimizing business processes supported by fragmented collections of data and information.

Savvy, modern organizations are now taking a different approach to data, treating these assets as first-class citizens, or products that have measurable value. The concept of DataOps, or Data Operations, has evolved over the past decade and innovative leaders and technologists have begun to recognize the value of findable, accessible, integrated, and reusable data based on carefully governed data standards.

This has spurred new job roles, like data stewards, and new technologies, like modern data catalogs, and are shaping the new paradigm that is facilitating data access and democratization.

Data stewards are data specialists, usually in business functions where manual integrations have been required, who know where to get and how to prepare good quality data, all while maintaining Governance and control over sensitive data.

The role of data steward has begun to appear in many organizations and is becoming a pivotal requirement in ensuring quality data is delivered in an efficient and governed manner. In practice, data stewards are human catalogs that help the business navigate complex processes and systems to get the data needed to operate.

Meanwhile, modern enterprises have turned to data catalogs to document this information exchange and make it easier for people to find, access, interoperate, and reuse data seamlessly, no matter the source system or locality. The catalog becomes the self-service instantiation of the data stewards’ work and expertise.

The growing and evolving data management landscape in multi-size complex organizations demands a comprehensive, strategic approach that aligns with modern best practices and emerging technologies, enabling creativity without stifling cost-effective innovation and productivity.

Forward-thinking organizations recognize data as an asset with tremendous value and impact enabling the mission and business objectives. Ultimately, the results empower organizations to reduce risks, enhance decision-making capabilities, improve operational efficiencies, amplify cost savings, and ultimately improve target state opportunities.

Authored with credit to Diane Schmidt, Ph.D., Chief Data Officer, Government Accountability Office, Daniel Pullen, Ph.D., Chief Data Scientist, Science Applications International Corporation, Patrick McGarry, General Manager, US Federal, data.world, and Steve Sieloff, VP of Product Development & Thought Leadership, Vérité Data.

About the Author:

Dr. Justin J Magruder is Chief Data Officer at SAIC, a Fortune 500 Information Technology Services company based in Reston, Virginia. Magruder is a pioneer and a thought leader in the field of data governance, master and reference data, and data operations, with more than 25 years supporting data operations, leaders, and decision makers to improve business performance through better data management.

He has led efforts at a number of world class organizations to improve business, financial, and operational performance, to reduce costs and manage operational risks, and to improve the quality of customer, account, portfolio, and product data, transaction data processing and analytics.

In his role with SAIC, he is continuously developing and leading implementation of its Enterprise Data Strategy including Lakehouse and DataOps solutions to support Artificial Intelligence, Zero Trust and Information Governance programs.