Opinion & Analysis

Clean Data Is to LLMs What Rich Fertilizer Is to Crops

The author draws parallels between developing robust AI models and cultivating a healthy crop.

Written by: Tarun Sood | Chief Data Officer, American Century Investments

Updated 4:54 PM UTC, Wed September 13, 2023

Developing robust AI models and cultivating a healthy crop share surprising parallels, especially when examining the importance of the data preparation phase in AI/ML and the role of fertilizers in agriculture. Clean data is crucial for the practical training of Large Language Models (LLMs).

These LLMs excel in generating human-like textual outputs, with their competence deeply rooted in the caliber of their training datasets. Therefore, rigorous data quality is fundamental to achieving excellence in LLMs. Thus, the vital step in the journey of LLM success is data cleansing, guaranteeing that the model receives data that is not only of top quality but also pertinent and indicative of the desired outputs.

The adage, “Garbage in, garbage out,” possesses significant relevance in the domain of AI/ML. Suppose we compare an AI model to a student in a classroom. Students rely heavily on educational materials to shape their knowledge base. However, if there are several inaccuracies or biased perspectives within these materials, it can severely impede the student’s learning.

“When subjected to inconsistent or biased datasets, AI models often generate outputs that may deviate from expected norms.”

Tarun Sood | Chief Data Officer at American Century Investments

Similarly, when subjected to inconsistent or biased datasets, AI models often generate outputs that may deviate from expected norms. Though high in computational confidence, these outputs can misrepresent or distort the objective truths they aim to deduce.

This divergence underscores the importance of ensuring the accuracy of the training data. If a student’s misconceptions can be attributed to flawed textbooks, by analogy, the anomalies in AI predictions can often be traced back to the quality of data upon which these models are trained.

The challenge becomes more pronounced with language models, which tease out subtle relationships, detect complex patterns, and comprehend nuanced contexts. High data quality is essential for LLMs and critical for the broader AI and ML models.

Suppose we compare developing an AI/ML model to our business strategy. The steps required to deploy an ML model successfully would be synonymous to:

Strategic Vision (Problem Definition): We begin by identifying the nature of the challenge. Is it a categorical assessment (classification), a predictive analysis (regression), or a segmentation exercise (clustering)?
Resource Acquisition (Data Collection): Secure premium, relevant data. It’s akin to choosing the finest ingredients for an exquisite dish.
Data Refinement (Data Preparation): Ensure ingredients are pristine for cooking any dish. Data must undergo cleansing to remove anomalies, ensuring optimal model performance.
Methodology Selection (Model Selection): Align the business problem with the suitable algorithmic approach—choosing the right tool for the task.
Calibration (Model Training): Curate the selected model’s data to foster learning and adaptability.
Performance Review (Model Evaluation): Assess the model’s efficacy with diverse datasets, ensuring it is primed for various scenarios.
Operational Integration (Deployment): Integrate the validated model into the operational framework to drive decision-making and innovation.
Ongoing Oversight (Monitoring & Maintenance): Update and refine the model in alignment with evolving business landscapes and data shifts.

The above list points to the different steps in the model development lifecycle. However, for this article, let us double-click into the data collection and preparation stages of the AIML models and discern how data quality is embedded in the different data phases.

As the picture above shows, data cleaning is crucial in any AI/ML data preparation phase.

First, a robust data collection process lays the foundation for precise and trustworthy AI/ML predictions.
Second, ensuring sound data quality at the data collection phase minimizes time-consuming and often complex data cleaning tasks, accelerating model deployment timelines.
Third, addressing data quality upfront can prevent expensive corrective actions later, mainly when errors propagate through AI/ML pipelines.

Data quality is an explicit step whereby data is cleansed to handle missing data, removing the outliers, and correcting the data errors. It is also an implicit step in various data preparation phases — data collection, integration and transformation, reduction and splitting, and feature engineering.

“Clean data ensures that models learn the actual patterns and relationships, not noise or errors.”

Clean data ensures that models learn the actual patterns and relationships, not noise or errors. It is crucial for both model training and detecting model drift, ensuring timely retraining and optimal performance. Maintaining high data quality reduces biases and curbs overfitting, ensuring more accurate and generalizable AI/ML models.

In summary, data in AI/ML acts as an anchor for algorithms, paralleling the role of soil for crops, and data cleaning in AI/ML is analogous to using fertilizer for those crops. Just as fertilizer rectifies soil deficiencies, enriching it to optimize plant growth, data cleaning corrects inconsistencies, missing values, and errors in datasets to optimize model training.

If crops are given poor soil lacking the necessary nutrients, their growth is marginalized, producing a subpar yield. Similarly, an AI/ML model trained on unclean data can result in flawed predictions and insights, marginalizing the model’s performance. In both scenarios, the initial groundwork — whether it is fertilizing the soil or cleaning the data — lays the foundation for the success of the entire endeavor.

About the author:

Tarun Sood is the Chief Data Officer at American Century Investments. Sood joined American Century Investments in 2022, and he is responsible for spearheading the company’s data and analytics initiatives, formulating enterprise data strategies, ensuring data governance, and managing data science and engineering.

Over the years, he has nurtured a deep passion for data analytics, artificial intelligence, machine learning, and championing data-driven and cloud transformations. He counseled top executives on data analytics and AI/ML strategies in his earlier endeavors.

Sood has become a sought-after speaker at conferences discussing these intricate and evolving subjects. Beyond his primary duties, he actively contributes to various boards and is prominent in numerous data and analytics communities.