Demystifying DataOps in the Delivery of Modern Data Management Environment

Demystifying DataOps in the Delivery of Modern Data Management Environment

A recent Pulse Q&A survey found that over 70% of companies are investing in DataOps. Is your organization one of them? DataOps is referenced as the next big thing by some, but at the same time remains a mystery to many. The term DataOps was first introduced as best practices meant to improve the handoff between data science and IT operations. However, it has significantly expanded to mean an overall approach in the delivery of analytics-ready data. Faster time to value? Yes. Higher data quality? Yes. Well-understood and consistently implemented? No.

Lenny Liebmann coined the term DataOps in a 2014 blog post stressing the importance of adopting a set of best practices to reduce the complexity of moving data science models into production. His premise was that DevOps improved the coordination between software development and operations, so DataOps would do the same for data science. The following year, Andy Palmer expanded the concept of DataOps by describing it as a discipline to improve the delivery of analytics-ready data across several facets. Data engineering, data integration, data quality and data security/privacy were the key facets identified by Palmer.

Organizations have been managing data for decades, so what is driving the change to a new approach? The characteristics of the data are changing rapidly across volume, variety and variability, which puts enormous pressure on current data management approaches. The traditional approach for preparing data for analytics makes use of Extract, Transform & Load (ETL). The ETL approach was first used in the 1970s for the construction of data warehouses. Initially, ETL was implemented by custom programs, but later the introduction of ETL tools provided workbenches to simplify development.

The traditional ETL process is to first develop a specification document that identifies the name and location of each of the source data attributes across the collection of operational data bases. Next, the transformations required to align the data into a common format are documented. Finally, how the data is to be loaded into the target database must be defined. The specification document is then provided to an ETL developer to create the programs required to read from the source databases, rationalize the data, and write the data into the target database. In this approach, each step must be defined in advance, and any change to the sources, transforms or targets requires the specification document to be revised and new ETL programs to be developed.

Let’s use a simple example to illustrate the traditional ETL approach. A fictitious company, HarveyCorp, is interested in using data science to better understand their customers. The company is an insurance group comprised of three divisions. The first division is the core group which has been in business for over 50 years, and the customer data is stored in a custom-developed system stored in an Oracle database. The second division was acquired five years ago, and the customer data is stored in a cloud-based CRM system. The third division was acquired last year, and the customer data is stored in a SQL*Server database. To keep our example simple, the data science team is interested in performing demographic analysis of the customers across the divisions and all the data is in a single table in each CRM system. The team requests only five data points: Customer ID, street, city, state, zip code.

The data team must first create the specification document defining the data to acquire from each of the three CRM systems. Most likely, the names of the data points will be different in each system, requiring each to be uniquely specified for the development of the extraction program. For example, customer ID may be called by a different name in each system, CUSTNO in one, ID in another, and CUST_NO in the third. Furthermore, the specification document must define how to rationalize across the three versions, and, finally, define the desired target. Any changes to the source databases will cause issues in the execution of the entire process and lead to failure.

How does DataOps modernize this process? First, rather than using programs to extract pre-defined attributes, the DataOps approach uses pipelines to attach to the source data and flow all the data to the central repository. Second, once the data from the sources has been acquired, the attributes are rationalized using machine learning rather than pre-defined mappings. Third, and most importantly, changes to any of the source systems are detected and resolved by the DataOps process versus causing failure. Handling change, such as the addition or removal of data attributes, which is called schema drift, is built into the DataOps process. In contrast, the traditional ETL approach requires significant manual maintenance to handle change.

DevOps and DataOps have a causal relationship. The key component of DevOps is driving down the gap between making changes to an application and the release of those changes into production. A benefit of DevOps is the ability to deliver a minimally viable product (MVP) that contains base capabilities, and then rapidly iterate to add features over time. Implementing DevOps but attempting to maintain the traditional approach to data will cause frequent failures in the delivery of analytics-ready data due to schema drift. Leveraging DataOps creates the resilience required to manage the frequent changes of the operational data sources driven by frequent application releases.

Although the premise of DataOps as presented by Liebmann has greatly expanded, it is still fundamentally about reducing the complexity of moving data into production. Organizations successfully implementing DataOps will drive three key benefits:

1. Reduce complexity of acquiring data using pipelines versus traditional ETL programs

2. Reduce complexity and accelerate rationalization across data sources using machine learning versus traditional data mapping, and

3. Accelerate the ability to deliver analytics-ready data for data science through a modernized data management approach.

Understanding the DataOps process and benefits removes the mystery as to why it is considered the next big thing in delivering analytics-ready data.

Mark Ramsey is the managing partner of Ramsey International, providing advisory services to global organizations in the design and delivery of an ecosystem of best-in-class technologies to deliver production-level, large-scale data and analytics solutions. He was the first R&D Chief Data and Analytics Officer for GSK, the first CDO for Samsung Mobile, and he led an IBM Business Analytics and Optimization practice, delivering hundreds of data and analytics projects across insurance, banking, telco, retail, pharmaceuticals, CPG, healthcare, government, airline, hospitality, manufacturing and automotive industries. He can be contacted at mark@ramsey.international

Related Stories

No stories found.
CDO Magazine
www.cdomagazine.tech