Sri Krishnamurthy

Data-driven decision making has enabled companies to significantly harness value from data in all aspects of business. Chief data officers’ (CDOs’) leading digital transformation efforts are seeing the importance of data from collection to processing to learning insights from datasets. With innovations in hardware, cloud and algorithms and availability of large datasets and analytical tools, the adoption of data science, artificial intelligence (AI) and machine learning (ML) is exploding. In addition, the AI/ML revolution has provided an impetus to ensure the collection, processing, storage and governance of data are robust and well-managed within the organization. Recognizing the value of data to the enterprise, data governance efforts have scaled up to identify, address and mitigate risks associated with data. As more and more decisions become automated, AI/ML models — which rely heavily on data — face challenges associated with interpretability, bias, explainability, fairness and model governance. Most organizations undergoing digital transformation efforts recognize that data governance and AI/ML model governance are heavily intertwined but continue to treat data and model governance in two separate silos.

In this article, we seek to bring clarity on some of the data and model governance challenges when adopting data science and AI/ML processes in the enterprise. As the role and importance of the CDO evolves within an organization, it is essential to recognize how the landscape of data and model-driven methods are changing traditional business practices. We believe a holistic data and model governance framework is needed to successfully adopt AI/ML within the enterprise and to plan for a future where data -driven decision making plays a key to the execution of business strategies.

Every CDO needs to think about five key drivers when establishing a comprehensive data and model governance practice when adopting AI/ML in the enterprise.

1. DATA AND MODEL GOVERNANCE ARE INTERTWINED

When working with AI/ML models, it is important to realize that data drives the AI/ML models. In a recent conversation with a model governance team, as a third-party independent model validation agency, we requested a meeting with the data governance team. The data governance team questioned why they had to be involved in the model validation exercise for machine learning models. We had to discuss the interdependencies and convince the data governance team to be engaged in the model validation process since the processes are intertwined. In the past, data governance and model governance were treated as separate silos, and organizations drew distinctions between the realm of data modeling and the model development world. In today’s world of AI/ML, a comprehensive strategy is needed to integrate data governance and model governance issues.

2. DATA QUALITY IS PARAMOUNT FOR PRODUCTIONIZING MACHINE LEARNING MODELS

In machine learning, Garbage in = Garbage out. It is important to govern every step of the machine learning workflow, including data processing steps, considering that bad quality data hinders adoption of AI and machine learning processes in production. In a recent article [1], Tom Redman emphasizes the importance of data quality and concludes that machine learning models are useless if the data quality is bad. It is estimated that 80% of the effort in machine learning is spent in data processing. We recently worked on a project where the modelers threw out 60-70% of the data in the modeling process because of data quality issues significantly affecting the quality of machine learning outputs. A comprehensive data quality framework is required to cater to machine learning needs in organizations. This includes having a comprehensive strategy involving data acquisition storage, pre-processing, handing missing values, feature engineering, master data management, archiving, and having processes to address and mitigate data risks..

3. META-DATA MANAGEMENT IS IMPORTANT

In addition to processing datasets, it is important to consider the meta-data associated with these datasets. In the last five years, the machine learning life cycle has matured significantly. Feature stores [2] are becoming extremely popular to serve features for machine learning applications. Feature stores provide curated datasets to machine learning applications and enable traceability of data. In addition, metadata management — especially associated with versioning different data snapshots and tracking the provenance and lineage of data sets — is essential to enhance reproducibility and tracking of machine learning models performance. Companies like Amazon [3] have proposed frameworks to manage the provenance and lineage of metadata when building machine learning models. In addition, open source projects like Delta Lake [4] are being proposed to enable life-cycle management of data lakes for machine learning projects. This is an evolving area but is becoming important as the scale of data to be managed increases within the enterprise.

4. FAIRNESS, BIAS, PRIVACY, SECURITY

AI/ML governance has been an important topic of discussion in the last year. At a recent model governance conference in San Francisco, I had the opportunity to discuss the topic of AI/Model governance with various governance teams from multiple financial organizations. The lack of comprehensive guidance from regulators, the pace of technological innovation, the plethora of options to build machine learning systems today-from open-source to black boxes, makes adopting machine learning a complicated process. In addition, with models becoming so complex to design, especially in unanticipated and volatile situations like Covid19, explicit efforts need to be made to understand the behavior of models especially when addressing stressed and edge cases. The World Economic Forum [5] recently issued comprehensive guidelines to address governance issues pertaining to adoption of AI/ML products. In addition, GDPR, the European Union’s guidelines for adoption of AI, etc., provide guidelines to ensure issues like fairness, bias, privacy, security, interpretability, explainability and auditability issues are addressed as a part of AI adoption within the enterprise. Companies must have a comprehensive strategy to formulate polices on how to address these aspects and to address potential gaps in the data processing steps. Data annotation, synthetic data generation and tagging/labeling are novel areas to many organizations, and governance policies must assess how these new areas will impact their operational processes, model development and deployment.

5. ADDRESSING THE SKILL GAPS

Despite the downturn in economy, organizations adopting and relying on data-driven decision making continue to experience skill shortages in the areas of machine learning and data processing. At QuantUniversity [7], we have trained thousands of analysts and data professionals in data and machine learning techniques in the last few years. Despite the rapid growth of educational program teaching AI and machine learning topics, there is a skill shortage of qualified data and machine learning professionals who can address evolving challenges within the enterprise. Companies must proactively review skill gaps and ensure that comprehensive teams are formed within organizations. In addition to model, data and operational risks, companies take a huge reputation risk when data-related & model-related issues affect business processes. Security breaches within organizations, use of stale data in models, and the wrong parameters applied to models can cause enormous shocks in the operations of organizations and if the scale is large, could lead to systemic shocks affecting financial markets, supply chains, etc. Organizations must ensure that quality trained personal who can enforce the data and model governance policies are available within an organization to address the growing challenges of machine learning.

CONCLUSION

It is said that data is the new fuel when it comes to AI/ML models. As organizations move toward data-driven decision making, it is important for CDOs to proactively develop strategies to enable the benefits of AI/ML methods within their organizations. The rise of AI/ML in the enterprise has created novel challenges to CDOs who have the responsibility of ensuring that the data strategy is done right to ensure successful adoption of AI/ML in the enterprise. To summarize, with the introduction of AI/ML methods in the enterprise:

1. Governance needs to be more comprehensive and integrated across data and AI/ML.

2. Data quality needs to be prioritized and streamlined from the ground up and driven by business.

3. Issues like metadata management must be proactively designed from the beginning.

4. Issues of privacy, bias, security and fairness need to be assessed and factored into workflow design.

5. Organizations must evaluate the evolving skill needs, and proactively gear up toward acquiring or retraining employees to address the skill gaps.

The AI/ML revolution has just begun, and CDOs are front and center in steering their organizations’ data strategies toward the fourth industrial revolution. While the technologies are exciting and companies are leaping towards adopting these frontier areas, factoring governance throughout the process is the responsible thing to do and will lead to successful outcomes.

REFERENCES

1. https://hbr.org/2018/04/if-your-data-is-bad-your-machine-learning-tools-are-useless

2. http://featurestore.org/

3.https://pdfs.semanticscholar.org/093c/3b389384812ea16f1ad18ce6c5f43c4f7106.pdf

4. https://databricks.com/product/del-ta-lake-on-databricks 

5. https://www.weforum.org/white-papers/ai-governance-a-holistic-approach-to-implement-ethics-into-ai

6. https://ec.europa.eu/digital-single-market/en/artificial-intelligence

7. www.quantuniversity.com

Sri Krishnamurthy, CFA, CAP, is the founder of QuantUniversity.com, a data and quantitative analysis company. Sri is a recognized AI and machine learning expert with more than two decades of experience in quantitative analysis, statistical modeling, data and model governance. Prior to starting QuantUniversity, Sri has worked at Citigroup, Endeca, and MathWorks, and has consulted with more than 25 companies, including leadership teams at many Fortune 500 companies. Sri serves as an adjunct professor and has trained more than 1,000 students in quantitative methods, analytics and big data in the industry, at Babson College, Northeastern University and Hult International Business School. Sri is a frequent speaker on AI and machine learning-related topics, and he has spoken at various industry gatherings and conferences hosted by the CFA institute, PRMIA, CQF, ARPM, ODSC, ReWork, GFMI, Marcus Evans, QWAFAFEW, QCon,, SAMSI, PAPIS, MathWorks, Babson College, Northeastern University, COSEAL, DataCon, etc.

Sri earned master’s in science degrees in computer systems engineering and computer science from Northeastern University, and an MBA with a focus on investments from Babson College.

Sri can be reached at sri@quantuniversity.com.