Poor data quality is public enemy number one for all artificial intelligence projects and programs. The now-famous “Weapons of Math Destruction” and other sources illustrate what can go horribly wrong. More commonly, however, bad data adds enormous time, expense and uncertainty, compromising, even killing, otherwise promising work. The quality requirements for artificial intelligence are steep indeed — far steeper than for anything else most businesses face, and meeting them is tough. Yet it is easy to underestimate them. So, step one is “develop a detailed understanding of the issues,” and it is the subject of this article. This point is especially timely today, as ChatGPT reminds leaders that AI must be on their radar screens!
I’ll describe further steps in a subsequent article.
The first issue is the sheer breadth and depth of data quality requirements. I often start data quality discussions with the following vignette: You’re on your commute home when you get a call from the principal at your teenager’s school. “Your child was caught fighting today. I don’t think they started it, but our policy is clear. A week’s suspension.”
Naturally, your attention turns to your kid, and your mind reels. When you get home, you ask your child, as calmly as you can, “How was school today?”
Their response is immediate. “It was great. I got an A-minus on my Spanish test.”
Now, you do not doubt your child got an A-minus. The response was accurate. But it certainly was not relevant to the conversation you wanted to have. And you’re pretty certain your child knows this.
The vignette illustrates an important point: In many situations, accuracy is not enough.
I find that two broad categories of quality requirements are important in almost all artificial intelligence projects:
Are the data right, that is, “correct enough” for the task at hand?
Are these the right data for the task at hand?
Let’s consider each in turn. In most ongoing business operations, we only worry about the data right category (usually because the right data question was dealt with sometime earlier). But still, only about 3% of companies’ data meet basic standards in this category. This statistic stems from the so-called Friday Afternoon Measurement, in which those familiar with a data set call out obvious (to them) errors. It supports the case that leaders should adopt a “guilty till proven innocent” attitude with respect to data quality.
Further, the data right category includes issues — such as, “are all the labels correct? — that don’t come up otherwise. It also includes issues that only arise when one combines data from various sources, as is often important in AI projects. For example, “Is this ‘Mary Smith’ from source A the same person as this ‘Smith, ME,’ from source B?” Such issues grow exponentially with the number of sources. There is a lot to think through here.
As if data right requirements weren’t enough, for AI, right data requirements are equally important. And there are many of them. Most people know AI requires “lots of data (enough to train a model).” Four other requirements — relevancy, predictive power, comprehensiveness, and freedom from bias — are almost always important, and there may be others. Since most people are less familiar with these terms, I’ve included brief descriptions in the box.
All of these can be tough, though bias garners the most attention. Consider that there are so many ways that a data set can contain historical biases. Clearly, a historical data set may reflect racial bias because redlining was practiced. But how about unintended biases based on age, high school attended, years in the military, whether one’s parents are alive, and so forth? It is difficult enough to even state the requirement, much less test it. Leaders ignore the potential for bias at their peril. They should also pay considerable attention to understanding the full range of right data requirements.
Dealing with quality issues to create a model involves cleaning the old data. It is time-consuming, fraught, expensive, and simply no fun. Perhaps 80% of the time needed to develop a model goes to cleaning the data. Worse, with even the best cleaning, errors go undetected and/or uncorrected, with no way to understand the impact on the predictive model.
But cleaning is only part of the task. The second big issue is that quality requirements come up twice — once in the old data used to train the model and then again in the new data when the model is put to work. Cleaning new data on an ongoing basis quickly becomes more onerous and expensive than cleaning up old data. As I’ll discuss, you must attack this issue differently.
A very important “people issue” exacerbates these two. Senior-level data scientists know that developing a deep understanding of how and why the data they use is created, the subtleties in their definition, and their strengths and weaknesses concerning quality, make for better models and business results. Doing so contributes to evaluating whether this is the right data and whether it is right enough in their models. But, as the article “Everyone wants to do the model work, not the data work,” written by Google researchers, points out, too many data scientists simply do not want to do this essential work. Nor do most have the skills to do so, even if they want to. "The data work” is time-consuming, grubby, and filled with real-world complications. Data quality and “getting out there” are not featured in data science curriculums. So much the better to focus on models — after all, the computer hums along with whatever it is given.
Don’t short any of these issues — you simply must understand them in specific detail as they relate to each AI project.
About the Author
Thomas C. Redman, Ph.D., “the Data Doc,” is President of Data Quality Solutions. He helps startups and multinationals, senior executives, chief data officers, and leaders buried deep in their organizations chart their courses to data-driven futures, emphasizing quality and analytics. Redman holds a doctorate in statistics.