The History of Data Architecture Patterns

Author:

How we collect and process data has evolved with changes to business models. Since the inception of the modern computer, we have witnessed four generations of data architecture patterns. In fact, we are seeing tremendous and exciting innovations in data processing technology since we have just entered this fourth generation. We summarize the four generations below in Figure 1:

1.JPG

Figure 1. A summary of the four major evolutions of data processing technology over the last 80-plus years. With the rise of data-driven organizations and business models centered around data, a fourth and new generation of data patterns and technology has emerged.

What actually is the fourth generation that we call data-centric architecture? How is it different from prior generations? What does it do that wasn’t possible in generations before? Let’s take a quick look back at the history of how computer-, application- and analytic-centric models have built on one another, and use that trajectory to characterize the major trends that will drive innovation in the fourth generation.

1st Generation (1960s): Mainframe

As mentioned above, data arose out of the first commercially viable enterprise computing platform, the mainframe. Businesses brought their functions to the mainframe, which in turn hosted the application logic/programs, compute, storage and data within a vertically integrated and tightly coupled stack. Because mainframes were a very large capital investment, the platform was usually shared across multiple business functions to fully amortize the cost. The sales team could run a mainframe program to list new customer accounts and another program to provide MIS reports around how many new sales were captured over the previous month. Simultaneously, the order fulfillment team could run its own independent programs to process transactions and generate reports within the same machine. Figure 2 displays a simple conceptual model of what the mainframe architecture pattern looked like.

2.JPG

Figure 2. The computer-centric architecture pattern. Functions share one common technology platform, and the programming logic, compute and data are all vertically integrated inside the one machine.

Having the logic, compute and data integrated neatly in one place made mainframes incredibly optimized, secure and blazingly fast. Many organizations still use mainframes for high-speed and complex transactional processing. However, there were two major limitations to the mainframe:

(1) Mainframes stored and encoded data in a way that was highly optimized for technical programs, but were otherwise inscrutable for everyday users. There was no intuitive model for representing the data. For example, a salesperson could register a new customer account by logging into the program “W3TC” and entering the code “99” in the field “AVXN” on screen “03.” While this was a highly efficient way for the mainframe program to store and retrieve data, it also trapped the data into the context of the program, which made it nearly impossible to share with others who weren’t familiar with the program.

(2) Mainframes by design required finite technical resources (i.e., compute cycles, memory and storage) to be shared by multiple business functions. Very often, programs would sit in a queue waiting for one function to complete their processes before another could run. This created competition and friction across the functions, which wasn’t easily resolvable by simply buying more mainframes. The capital investment required would be cost-prohibitive.

As business executives started to realize how information technology and automation could create value, it became a business model imperative to digitize as many business functions as possible. This would create unmanageable stress on the shared mainframe model. Something new was needed to make business function optimization at enterprise scale a reality.

2nd Generation (1980s): Distributed computing, relational database management systems (RDBMS) and business data warehouses

As businesses analyzed how to increase profitability, they began decomposing big monolithic processes into more discrete business components. Each component could be individually optimized for a peak level of productivity. Specialized enterprise business software and applications emerged in the marketplace to accelerate function-specific digitization and automation. Such business software could run its own programs on its own technology using equipment much cheaper than the mainframe,¹ and without having to share resources with any other applications or functions. In addition, this generation introduced the ability to save data in a model that represented the business function. Instead of locking data in fixed codes tied to the program, it became possible to define business subject areas, attributes and relationships using a vocabulary natural to the function, and then save the data in tables using matching descriptive labels. It was further possible to continuously customize and extend those definitions as needed through simple configuration changes to the underlying data model. The end result was that each business function would define its own custom data model required to operate their processes. The business function would then store its model inside a dedicated relational database management system (RDBMS) linked to an application.

As companies evolved to become more open to using data to inform business decisions (i.e. “data-informed”), the need emerged for the business functions to analyze and introspect their data beyond its original operational context of use. However, because the RDBMS was mostly focused on saving and querying data for operational use only, a separate system would be needed to archive historical data and analyze it. Online analytic processing workloads (i.e., OLAP) typically require more memory and CPU in order to introspect larger volumes of data, from longer time horizons, compared to normal operational transaction processing (i.e., OLTP). Learning from the painful experience of the mainframe generation, the business functions did not want to have OLAP queries and workloads compete with the business-critical OLTP workloads for the same fixed compute resources inside of the RDBMS. The concept of dedicated business data warehouses was introduced to avoid such competition. Each data warehouse contained its own CPU, memory, data storage and data model types (e.g., star schemas and dimensional models) packaged in a configuration that was optimally designed for executing the more resource-intensive OLAP queries. This combination became the foundation of the application-centric pattern, which distributed compute, memory, storage and data across the multiple operational and analytic systems inside of each business function.

3.JPG

Figure 3. The application-centric architecture pattern. Business functions independently purchase one or more business applications. Each business application comes with its own resources to run programming logic, and saves its operational data in its own data model within a dedicated relational database. To enable data analysis inside a function, data is copied, archived and integrated inside business data warehouses.

This model was initially tremendously successful, and along with further innovations in computing, storage and networking contributed to an era of unprecedented business productivity. However, new companies born in the internet era realized that having a complete view of data across many functions end-to-end was tremendously valuable. They used this intelligence to design innovative products and deliver them via novel sales and service channels, which enabled them to begin capturing and retaining sizable market share over incumbents. Furthermore, savvy bad actors realized that many companies had information blind spots, and that they could exploit intelligence gaps across functions to commit fraud or launder money. The inability for many large companies to respond to these threats exposed weaknesses in the application-centric model:

(1) Offering each business function full control over their own technology made the individual function agile and flexible, but at the cost of creating data towers or data silos inside companies. Data was created with the sole purpose to operate the function; it was “owned” by that function’s business application, for the primary use by that application, and saved in a data model that was best understood by the application. Business processes that required access to data across functions and systems (such as upsell/cross-sell marketing programs or corporate functions like compliance, risk, finance and human resources) needed to reach out to each application owner individually to extract copies of their data and then stitch them together into their own data models. Only then could data be processed by their applications. In other words, you had to continually copy and bring the data to each function. This made it very difficult and cumbersome to understand what was happening holistically inside a company.

(2) Distributed computing made managing data really, really complex. Before, data used to sit in a few mainframe machines in one big room in a data center. Now, data could be in any of hundreds, if not thousands, of servers across multiple data centers. Not only did this make protecting data from unauthorized access and use more difficult, it also meant that data governance was imperative. Investment was required in people, processes and systems to manage data. These included inventory catalogs, metadata management systems for both business metadata (the data dictionaries or glossaries that provide semantic context to what the data means) and technical metadata (the physical information about how the data is being saved in the data store); reference data systems, master data management systems, data quality systems, data lineage management systems … and on and on. Without these capabilities — and unfortunately, even with them — it was nearly impossible to know what data even existed, let alone what it meant, where it came from, what its quality was and whether it could be trusted for use by others. As data was copied and transformed from one function to the next, poor data began spreading virally across enterprises. This, in turn, reduced trust within the organization around using data for competitive differentiation. To stop the spread of bad data, a business had to prioritize which data was most important, also known as finding its critical data elements, and then devote its entire focus to managing and cleaning them. Meanwhile, new entrants were using data to grab market share by the bushel and somehow didn’t seem to be encumbered by the same challenges. Clearly, they were doing something different.

3rd Generation (2000s): Enterprise Analytics: Enterprise Data Warehouses, Big Data, NoSQL, and Data Lakes

Whereas the purpose of data in the application-centric model was to optimize the business function, the next stage of evolution for data-enabled companies emphasized the analysis of end-to-end data that came from many business functions. The new generation of data processing technology was largely influenced by the rise of the internet. Structured data generated as a byproduct of application transaction processing now lived side by side with unstructured data, which included user-generated content — HTML pages, blog posts, image uploads, etc. — as well as other digitized forms of documents, such as PDFs, Word documents and spreadsheets. Digitization created vast volumes of both structured and unstructured data. These technologies needed to store data, bring it together and run analytical models to compute it all. Multiple terms were invented to describe and market these data innovations: enterprise warehouse, big data, Not only SQL (NoSQL) databases, enterprise data lakes, data lake houses, data oceans, data fabric. The core premise was that you could copy and save data from multiple source databases and warehouses. You could then combine and integrate them, even though they came in different forms and models. You would save them inside of platforms with a configuration of compute, memory and storage that was optimized for even more complex and resource-intensive analytics over larger data volumes than ever before. Then, finally, you would use new programming languages to build natural language processing-, statistical-, and machine-learning models on top of the data to mine it and uncover insights that likely would not have otherwise been found. Those insights would then be brought back to the business functions.

This generation also included new technologies to save semantic and conceptual data relationships so that (mostly) unstructured data could be easily interpreted and shared by both people and machines. Rather than saving data in a relational table structure, which stores data in matrices made up of row-and-column relationships (or two-dimensional tuples), graph databases emerged as a way to save more complex semantic relationships, such as saving subject, predicates and objects in a three-dimensional or triple model. The linking of conceptual and semantic dictionaries to the data gave rise to the concept of knowledge graphs, a new way for businesses to save information about their functions and processes. Business functions now had multiple options for serving data out to consumption and analytics. They could save data in relational tables inside a data lakehouse and/or in knowledge graph triple stores for semantic querying.

The analytic-centric pattern evolved data architectures into something that looks like Figure 4.

4.jpg

Figure 4. The analytic-centric architecture pattern. Data is copied from source systems and business warehouses into enterprise analytic platforms which are used to mine it for insights to be fed back to the business functions. These new platforms can save the data in relational (tuple) or semantic (triple) stores.

Analytic technologies further evolved with the advent of cloud computing. Cloud economies of scale allowed data platforms to be stored and run more effectively and efficiently compared to a data center. This opened up the opportunity for companies with any size IT budget to invest in enterprise analytic capabilities.

While cloud computing enabled users to more easily integrate and analyze data across business functions, it didn’t fundamentally resolve all of the issues inherent in the application-centric model. Worse yet, the application-centric model introduced more issues:

It relied on copying and transforming data even more than before, exacerbating the complexity of data management and trust issues around data integrity and quality.
It offered a way to work around, but not fundamentally break down, data silos.

It weakened data protections even further, because new platforms were designed with centralization and ease of integration in mind, not security.

Because it was easy for businesses to stand up data lakes, many large enterprises ended up with multiple data warehouses and lakes. Some were inside of data centers in private clouds, others on multiple different public clouds. The ultimate result of the transition to the analytic-centric model was a mixed bag. Even though businesses continued to invest heavily in data, it conversely became even harder to find data, understand what it meant, know its quality or enable access to high-quality data. Becoming data-driven would require a whole new technology architecture pattern.

4th Generation (2020s): Collaborative Database

Here, we finally arrive at data-centric architecture, where the primary resource is the data and everything else revolves around it. In other words, we bring the function, which is temporal, to the data, which is permanent. This differs from the previous architectures where the primary resource was the function, and you brought the data to the function — which inevitably led to creating copies and copies of data as more functions required access to the other’s data. Looking at how prior generations of data technology saved and processed data informs how 4th generation technology should ideally work. This new generation should be able to:

Simplify and consolidate the data ecosystem like the mainframe model did, by going back to a single copy of source data in one ‘system.’
Still give a business the flexibility to define the data model so that the data is interpretable by the function outside of the context of the program, like the distributed computing model did.
Give the business the ability to control its own IT resources (CPU, compute, memory, storage) without having to share.
Enable functions across the enterprise to access and analyze transactional and semantic data in the same place like big data technology did.
Also make data more secure and resistant to cyberthreats.

We call the data innovation that meets these principles and enables data-centric architecture Collaborative Databases. These data stores comprise a consolidated platform that multiple business applications read and write into, but with security and privacy built in. This same system can also be used to run analytic processes for data at a massive scale without requiring additional copies to be made. Further, this architecture, which puts data in the center, can enable sharing both inside and outside of organizations. The architecture pattern gets simplified to the design in Figure 5.

5.JPG

Figure 5. The data-centric architecture pattern, which builds from prior generations. Data is pulled out of the application layer and saved in one place. All business functions can access data for both operational and analytic processing using their defined data models. But, like the distributed computing pattern, functions can still share data without needing to compete for CPU or memory resources. This common Collaborative Database can further be extended to functions outside of the company, such as partners, suppliers or customers.

The ultimate business objective behind data-centric architecture is twofold. One, data-centric architecture establishes a business culture in which any business function can safely and securely consume any other internal or external data, whether coming from another function, a supplier or a customer. Secondly, data-centric architecture dramatically simplifies the data environment and tears down any data silos that were the unintended consequence of the prior generation’s function-centric focus. Adoption of data-centric architecture must inevitably be followed by the gradual consolidation and retirement of the many redundant copycat data stores implemented from past generations.

6.png

Figure 6. The evolution of data processing technologies and how they’ve aligned to business model transformations. The fourth generation of data processing technology allows organizations to become data-driven

About the Author

Eliud Polanco is a seasoned data executive with extensive experience in leading global enterprise data transformation and management initiatives. Prior to his current role as President of Fluree, a data collaboration and transformation company, Polanco was the Head of Analytics and Data Standards at Scotiabank. He led a full-spectrum data transformation initiative to implement new tools and technology architecture strategies, both on-premises and cloud, for ingesting, analyzing, cleansing, and creating consumption-ready data assets.

Palanco’s professional experience also includes Global Head of Analytics and Big Data at HSBC, head of Anti-Financial Crime Technology Architecture for U.S. Deutsche Bank, and Head of Data Innovation at Citi.