Opinion & Analysis

Is Your Modern Data Stack Ready for AI?

Written by: Colin Priest, Chief Evangelist | FeatureByte

Updated 11:23 AM UTC, Wed August 9, 2023

The modern data stack has evolved alongside business analytics to become an integral part of modern business operations. However, as artificial intelligence (AI) becomes integral in enterprises, the modern data stack must evolve to the unique demands of AI systems.

Meticulous joining and time-aware feature engineering are essential for AI-ready data. Furthermore, AI decision-making processes demand a multitude of input columns, whereas dashboard metrics usually rely on a single input or a ratio of two inputs.

Minimize Data Movement

An AI-ready data stack minimizes the movement of raw data, and when necessary, moves data in a bandwidth-efficient format. AI systems need high volumes of detailed data, making data movement expensive and inefficient. For example, when an AI system at a major bank failed in production, investigations showed that the root cause was a legacy data pipeline tool designed for dashboard summary statistics.

Large volumes of data were transferred in an inefficient JSON format, overwhelming the network. AI-ready data should be prepared and processed within the modern cloud data platform or data warehouse to ensure scale and efficiency, with data pipelines capable of handling AI’s large-scale demands.

Recognizing the key differences between BI and AI is vital for businesses as they assess their data infrastructure and determine the most effective approach for their needs. The modern data stack was originally created to simplify the ingestion and computation of data so that Analytics Engineers and BI professionals could manage data pipelines without relying on Data Engineers for everything.

Although the modern data stack has matured to include automated machine learning (auto-ML) and machine learning operations (MLOps), the same progress is not apparent in data pipeline tools. Those tools and practices frequently draw from traditional BI architectures not ideally suited for AI applications.

High Computational Complexity

Perhaps the most fundamental difference is in the computational horsepower needed in the stack, reflecting the greater demands of AI versus BI. Preparing data for AI usage involves more intricate and resource-intensive processes, known as ‘feature engineering,’ compared to the simpler data preparation for dashboard-focused BI systems.

Modern Data Stack for AI.png

Training-Prediction Consistency

AI systems train on carefully constructed historical data samples, then deploy using live data, necessitating two distinct yet consistent data pipelines. This isn’t as easy as it first seems. Ensuring consistency between these pipelines is crucial to prevent potential issues arising from discrepancies in the AI’s training data pipeline and the production data pipeline.

Rapid Experimentation

With AI, your data stack must accommodate rapid iterations and experiments, as well as a short time to market. Machine learning models, the foundation of modern AI, rely on past data patterns to predict the future. When patterns change, models can break, causing ‘model drift’ and potential failures. For example, in 2021, one telco discovered that their AI-powered network routing required urgent upgrades after it failed to meet contractual service standards for their customers.

Work-from-home and videoconferencing had become widespread among COVID-impacted businesses, causing the telco’s artificial intelligence system to fall behind. Machine learning models need regular retraining and updating, which may require frequent changes to the data pipeline. A modern data stack designed for stability may not suffice for the rapid iterations needed to maintain AI systems.

Data Security & Governance

The detailed data required by AI introduces new security risks associated with moving and storing sensitive information. AI governance standards are still developing, with limited industry guidelines available. Many existing data governance practices can help reduce AI system risks and protect a company’s reputation. However, many organizations have ungoverned sandboxes or data lakes for machine learning purposes, introducing significant reputational and business risks.

High-quality enterprise governance practices should include role-based access controls for feature engineering. The data pipelines utilized by AI systems should have the same audit trails, approval workflows, and version controls as the machine learning models that consume the data.

Granular Data Quality

High-quality data is crucial for AI system success. Unlike business dashboards, AI systems cannot rely on the averaging effect of broad summary metrics to address data quality issues. AI systems operate at a granular level and can be significantly impacted by even small data errors or missing data. Unlike BI pipelines, AI pipelines need to be more careful and nuanced, necessitating more granular error detection and correction.

Without human intervention to apply common sense and domain knowledge, AI systems may make harmful decisions impacting customers and business processes. For example, a healthcare organization’s model validation team discovered that doctors were using two different spellings for the same drug, and the proposed AI system was unable to handle the alternative spelling extracted from doctors’ notes.

To address these data quality challenges, the modern data stack must incorporate comprehensive data quality safeguards, including early warning systems that trigger alerts when data quality standards are not met, ensuring the AI system’s reliability and accuracy.

To fully harness AI’s potential in modern business operations, we must embrace a new approach to data management explicitly designed for AI systems. This innovative approach should be flexible and adaptable, allowing for rapid iteration and retraining of machine learning models, ensuring their continued relevance in a rapidly changing world.

By acknowledging the unique challenges posed by AI and proactively addressing them, businesses can unlock the full potential of this powerful technology and achieve better business outcomes. It is time to create a modern data stack that can keep pace with the dynamic world of AI, supporting businesses as they navigate an increasingly complex competitive landscape.

About the Author

Colin Priest is the Chief Evangelist at FeatureByte. He is a seasoned business leader with more than 30 years of experience across various industries, including finance, healthcare, security, oil and gas, government, telecommunications, and marketing. With a focus on data science initiatives, he has held several CEO and general management roles, while also serving as a business consultant, data scientist, thought leader, behavioral scientist, and educator.

Colin’s passion for exploring the synergy between humans and AI has led him to contribute to projects on AI ethics, governance, and the future of work. He has gained global recognition from the World Economic Forum and Singapore government for his work on AI governance and ethics. Additionally, he is a passionate healthcare advocate who does pro-bono work for cancer research.