The Technology Innovations Behind a Data-Driven World

Author:

In our previous article titled “The Data-Driven Organization,” we introduced the concept of data-driven organizations. Such firms are capable of outsmarting the competition by leveraging their own internal proprietary and third-party external data to a degree their peers cannot. They accomplish this by purposefully creating, collecting and using data at the center of their business process and technology architecture design. We believe that “data first” is the new business and technology imperative of our era.

Becoming a data-driven organization requires significant changes to the internal business culture and social contracts between the business functions that produce and consume data. It also requires breaking down foundational barriers inherent in the technology architecture that unexpectedly trap data and keep it from being fully leveraged. As we will see, the technologies that store and process data have not historically made it easy for companies to securely share data both internally and with external partners, suppliers, or customers. This is because data sharing requires continuously copying, translating, and transforming data across many contexts of use. This makes data overall harder to manage and protect from leakage and cyberthreats. To foster a culture where the default habit is to share data, an entirely new architecture pattern is needed. This pattern must be more secure than existing technologies and reduce the need for massive copying.

In this article, we will:

Describe what a data-centric technology architecture looks like.
Describe how it is different from, or similar to, existing data processing technologies.

The third and fourth articles in the series will close with a discussion around data governance and how to practically adopt a data-centric architecture, especially in light of the legacy footprint of existing data processing technologies.

An Introduction to Data-Centric Technology Architectures

Data processing and storage technologies and architecture patterns have historically evolved to keep pace with changes to business models, as well as cultural attitudes around how data can and should be used. In the beginning, all business logic and data processing could be performed in a single integrated computer, the mainframe. As businesses evolved to become more diversified, componentized and complex, new technologies such as the database and the warehouse were invented. With each continuing major transformative business shift came a new generation of data processing technologies and architectures. These both enabled new business capabilities and fixed the unintended consequences or perceived shortcomings of the previous generation. In the same fashion, the transformation toward becoming data driven necessitates innovations in data processing technology. We are now witnessing the evolution of data processing technology into its fourth generation since the inception of the modern computer. We summarize the four generations below:

Fluree.JPG

Figure 1. A summary of the four major evolutions of data processing technology over the last 80+ years. With the rise of data-driven organization, and business models centered around data, a fourth and new generation of data patterns and technology has emerged.

The differences in the four generations of architecture are primarily characterized by where the data sits in comparison to the compute and the application programming logic. A data-centric architecture consolidates data into a single layer, where the compute and the application logic can be distributed into the functions. In this model, we bring the application logic from the function, which is temporal, to the data, which is permanent. This differs from previous generation architecture patterns where the primary resource was the function, and you brought the data to the function — which inevitably led to creating copies and copies of data as more functions required access to the other’s data. You can see a visual representation of the evolution of data architectures in Figure 2. We also provide a full primer on each generation’s architecture pattern and why they evolved the way they did in a separate article.

We call the new type of data processing technology generated by this new architecture pattern a collaborative database. These are new types of data stores that were designed with data sharing across functions and applications in mind. This means that multiple business applications can both read from and write into the same data store, but with the security and privacy built in. This same system can also be used to run analytic processes for data at a massive scale without requiring additional copies. Further, this architecture can be easily shared within the boundary of the organization, across geographies, technical platforms (e.g., multiple clouds) or extended across a trusted network of partners, suppliers, customers, peers, or even competitors.

Before we get there, we need to explain how this technology actually works. What is a collaborative database? How will it fundamentally change how data is stored, processed and shared compared to relational databases and analytics platforms? There are a series of technology innovations, especially around Web 3.0 principles of decentralized data storage, interoperability and control, that provide the foundation for this fourth generation of data processing technology. In the next section, we will cover the eight most interesting innovations that differentiate collaborative databases and make them the necessary engine to power organizations looking to migrate from data-informed or data-enabled to data-driven.

Fluree 2.png

Figure 2. The evolution of data processing technologies and how they’ve aligned to business model transformations. The fourth generation of data processing technology allows organizations to become data-driven.

What Makes Data-Centric Technologies Different From Existing Data Processing Technologies?

As mentioned above, data centricity relies on a new data technology — the collaborative database — which makes it easy to manage data as the product, as opposed to data as a byproduct, of business applications. The technical design of the collaborative database rests on a set of bedrock principles:

It must be easy to define what data means, in a way that can be interpreted by the original producers but also by multiple consumers of data.
Everyone’s contributions to data must be immediately available to everyone else, as long as they have permission to see the data by policy.
It must be easy to define policies at the level of granularity necessary to control who is allowed to contribute or see what data.

It must be easy to define who you want to share data with, and it must be easy to add or remove producers and consumers from your trusted network.

In order to meet these principles, collaborative databases are built by combining elements of eight innovative technologies that address how data can be created, stored, protected, accessed and consumed:

Fluree 3.JPG

Figure 3. The eight innovations that make up the collaborative database technology platform.

Let’s quickly run through each of them to describe how they all contribute to data-centric design and solve the issues inherent in prior generations of data processing technology.

How Data Is Created, Stored and Protected

1. Multiple writers, multiple readers

Traditional RDBMS systems were designed based on a trust model where there is one predominant creator of original data — the business application. By comparison, collaborative databases operate under the presumption that multiple business functions or applications will share the same “database.” In order to achieve this, they must inherently be designed in a “trustless” or “zero-trust” manner. Theoretically, any business function could write into or read from the same database, provided that the following five conditions are met:

Prior to each individual request to read or write data, the individual writer or reader must be verified to be who they say they are by anyone in the network.
Each individual write or read request must be permissioned by policy before it is executed.
Each individual write transaction must be tamper proof, i.e., you can prove mathematically and through logs that what was written has not been altered.
Conflicts between write operations can be managed by policy, i.e., you can define the set of rules for how to deal with two writers trying to alter the same piece of data simultaneously.
The history of every individual executed request to write or read data must be available to everyone publicly and traceably in a tamper-proof method so that data is inherently auditable.

2. Network-distributed by nature

Traditional databases physically exist inside one or more host servers. When a writer records a transaction in their database, information can only be communicated from one application to another if the database broadcasts it to other host servers (e.g., databases or data warehouses) through some integration channel. This can happen by writing to a message queue, publishing an event, exposing data via an API endpoint or extracting a copy and sending it via secure file transfer (SFTP). The problem with sharing data this way is that it encourages the creation of many point-to-point connections between data sources. These connections, in turn, make it very complicated and expensive to make changes without inevitably affecting some consumer downstream. The proliferation of connections also actively contributes to the continuous redundant proliferation of data copies taking place today, making managing data ever so difficult.

Collaborative databases are conceptually more of a peer network than a single physical system. They are built on standard and open protocols that enable multiple computers anywhere in the world to maintain a shared index or ledger of all transactions made by all participants in the network. When any participant in the network, such as the sales business application, records a new or updated transaction into the collaborative database (by writing the change using the standard protocol), the record of the change automatically gets broadcast in real-time to the ledger copies of all permissioned participants in the network. Each consumer can independently validate the integrity and accuracy of the change. This allows each business function to manage their own copy of the index, or ledger, without physically moving the underlying data. This is critical as it ultimately limits data sprawl, duplication and proliferation.

The boundaries of the collaborative database will not necessarily be a physical host server like an RDBMS, but the network of willing participants. To add business functions to the collaborative database, whether in or outside of an organization’s four walls, simply invite them to participate in the network. To remove them, remove them from the network or change their access policies. The underlying method of storage is not relevant; the data can physically be in a private or public cloud, inside a data center or outside, as long as it is reachable by parties via the network.

3. Composable, reusable data

Business applications have evolved from big, complex programs to a microservice design. Individual, discrete units of functionality and programming logic are now made to be reusable, exposed via application programming interfaces (APIs), and chained and orchestrated together by workflow.

Collaborative databases do the same with data. No longer is data bound by a database host that collects data into a schema made up of tables linked together by a data model for each host, which then requires complex systems integration to merge data from multiple systems together. Instead, data is hosted in discrete, reusable, and interconnectable blocks. These are designed by protocol to be chained and orchestrated together at query time, based on policy permissions, by anyone in the network.

Think of how the adoption of microservices fundamentally changed how application development teams wrote source code, managed change, and released new business functionality. Composable data will similarly change how data producers and consumers serve, consume, and collaborate on data. Businesses will no longer have to create and maintain multiple consumption views or physicalized extracts of data for each consumer. Instead, consumers will be able to request and retrieve whatever relevant blocks of data they need, at query time, based on policy.

4. Semantically describable data

In most organizations, the meaning and context of data — its data dictionary, or business metadata — was typically saved separately from the data itself, stashed inside of a metadata management or data governance system. Adding to the complexity, a separate data catalog contained the inventory of physical metadata about the data across its various data stores (such as schemas, tables, column names, data types and relationships to other tables). It required active, consistent effort to link business metadata in data dictionaries with the technical metadata in data catalogs. In the collaborative database, the business and technical metadata, as well as the contextual relationships between them, are stored directly inside the data. This means that each atomic block of data, by definition, must contain technical information about itself (its class) as well its business context (the semantic model). This can be thought of as an embedded knowledge graph view of the data for all blocks within the network. This is important as it enables blocks of data to be discovered and reused by different functions. After all, you need to know what something is before you can attempt to use or reuse it.

Furthermore, most organizations don’t simply have just one data dictionary or business glossary. As previously described, each business function will likely have its own model to describe its processes. To facilitate interoperability between functions for users, programs or APIs, these dictionary-to-dictionary semantic relationships also need to be preserved inherently inside the data. Because of the extensibility of Web 3 semantic open standards, graphs and triple stores, it is easy to define and add those relationships as extended properties of the data block.

5. Defensible data

In the beginning, the policies and business logic (rules) for who can read and write data was built into the program hosted by the mainframe or the business application itself. If an organization needed to enforce a data privacy policy that applied to every enterprise function, each individual program or business application owner was accountable for making the changes in their software to execute the policy. As systems became more complex, middleware such as centralized corporate directories or identity and access management systems were introduced. These enabled policies to be controllable from outside of a business application or API. The challenge is that the threat attack vector expanded from the sources of data to the APIs to the middleware layer. As data continues to be copied over and over, and then exposed and made available via APIs, the threat surface level that needs to be protected continues to grow exponentially beyond the means of most information security programs.

Collaborative databases embed the data access control policies directly inside the data, so that its access cannot be decoupled outside of the data. Furthermore, these policies can apply extremely rich and robust programmable logic with sophisticated business rules — think about how sophisticated and complex blockchain smart contracts work. When access cannot be separated from the data and the data is “smart” enough to expose itself only to readers when business conditions are met (even if anyone had access to the ledger itself), we call this idea the data “defending itself.”

How Data Is Accessed and Consumed

6. Multi-modality

As discussed before, traditional RDBMS systems were originally designed to handle the transactional operations required to run business functions in real time (such as OLTP database reads and writes). Analytics processes (i.e., OLAP queries), which consume far more compute and memory to execute, could not be performed inside the RDBMS without potentially significantly impairing business operations. To compensate for this, data was typically copied out of databases and transformed significantly to be used inside analytic platforms (data warehouses, big data and NoSQL databases), which were better designed to handle OLAP-type workloads. Unfortunately, in this pattern, the more data that needed to be analyzed, the more copying and transformation was required, exacerbating the complexity of data management.

Collaborative databases, by contrast, are designed to natively process both OLTP and OLAP queries. This is accomplished by using modern on-demand distributed computing techniques that spin up or automatically provision resources on a just-in-time basis, with dedicated CPU and memory for each workload type. Transactional read, transactional write or analytic workloads can each spin up their own compute resources when needed, and avoid competing for the same fixed compute resources. This enables many parallel processors to operate on the same physical instance of data without requiring data to be copied.

7. Schema-on-query (multilingualism)

Traditional RDBMS systems are based on the principle of “schema-on-write,” meaning that you must have one data model defined in advance before you can start writing or loading data into the database. Analytic platforms introduced the concept of “schema-on-read,” which decouples the definition of the data model from the loading of the data. This enables the ability to bring together and load copies of data from multiple sources into one target system, namely the analytic platform. You can then build and define one or more consumption data models later by inferring the source data model when reading or introspecting the data.

Collaborative databases advance this further with “schema-on-query.” Because each block of data saves its semantic context inside itself, and the knowledge graph triple store structure enables one or more data dictionaries to be semantically linked, it is possible to query data from one of many possible schemas, or semantic contexts. At query time, the consumer can choose a specific vocabulary that it understands, define it as part of the query context, and then query data using that model. For example, if System A describes data as “clients,” System B describes them as “customers” and System C as “parties”; when they are in the same collaborative database platform, a user could issue a query to “return all customers” using the System B context, and retrieve all clients, customers and parties. This prevents the need to extract copies of data and transform them into multiple consumption schemas so that business functions can share data among one another.

8. Self-harmonization

In traditional RDBMS systems, there was one primary writer (the business application), one data model associated with the data, and one set of data values, which was presumed to be a version of the “truth.” Once data from multiple business functions and systems began to get integrated into analytic platforms, it became obvious that there was no one “truth” because each business application could define its own data model and data dictionary. This led to the need for reference data and master data management programs to be stood up in order to enable data to be harmonized or normalized such that it could be queried consistently and accurately across sources.

By contrast, collaborative databases cannot by definition depend on the fact that data will come from one source or writer. They also won’t depend on the source(s) always being internal. (In fact, the boundaries of internal versus external can be blurred.) Collaborative databases also can’t depend that data harmonization standards will be consistently applied at the source, nor that teams of people will be on hand to manage mapping tables, matching logic or survivorship rules. Instead, they will take advantage of new and emerging generative artificial intelligence and machine learning techniques to self-harmonize new data added into the network, which can then be reinforced by the “crowd” of users that actively consume data. There are two levels of self-harmonization that can be facilitated by AI. The first is when new data that is added into the store is detected as likely to be semantically equivalent, or synonymous to, other blocks of data in the network (i.e., “schema harmonization.”) The second is when the contents of new blocks of data added to the network are identified as being equivalent to contents from another block of data (i.e., “content harmonization”). The most sophisticated content harmonization can not only introspect the contents of data and learn whether there are conflicts, but also learn how to resolve them and create “Golden Sources of Truth” inside the network.

Data governance shifts from a model of centralized management to one where a crowd of producers and consumers participate in the action of describing and fixing data just by actively using it. We’ll talk about the possibilities about this a little bit more in our upcoming article around data governance.

Putting It All Together

As should hopefully be evident, the integration and combination of these eight technology innovations into a single platform called the collaborative database will empower businesses to transform towards becoming data-driven organizations. As discussed in the first article, it will make it easier for organizations to:

Fluree 4.JPG

We hope that by now it is clear what a data-centric architecture is, how a collaborative database enables it, and what makes this different from other existing data technologies. Moreover, we’ve discussed the business benefits of adopting a data-centric architecture, namely faster access of data to the business, simplified and less complex data architecture, and more secure information security control, among others.

But if you’re come this far, reading this may have yielded even more questions. Where do we even begin? How do we even think about adopting yet another technology on top of the hundreds of mainframes, relational databases, data marts, data warehouses and data lakes across on-premise data centers and/or multiple cloud providers? Or even before that, who in the organization actually owns this system and who manages the data, if not the business function owner? What does it mean to manage data in a data-centric architecture?

These are critical and important questions. In our next article, we will explore the implications of data-centric architectures to data ownership and governance models. And we will conclude with a final discussion on adoption considerations, including how to finance the business case for transformation, which process to begin with, and how to undertake a transformation effort across the dimensions of people, skills, processes and technologies.

About the Author

Eliud Polanco is a seasoned data executive with extensive experience in leading global enterprise data transformation and management initiatives. Prior to his current role as President of Fluree, a data collaboration and transformation company, Polanco was the Head of Analytics and Data Standards at Scotiabank. He led a full-spectrum data transformation initiative to implement new tools and technology architecture strategies, both on-premises and cloud, for ingesting, analyzing, cleansing, and creating consumption-ready data assets.

Palanco’s professional experience also includes Global Head of Analytics and Big Data at HSBC, head of Anti-Financial Crime Technology Architecture for U.S. Deutsche Bank, and Head of Data Innovation at Citi.