Opinion & Analysis

Introduction to Synthetic Data: Balancing AI Innovation and Privacy

Written by: Gokula Mishra

Updated 6:32 PM UTC, Thu February 8, 2024

As we have more and more adoption of AI and data-driven solutions, there are more data protection regulations being put in place in many countries throughout the globe. Gartner predicts, “By 2024, 75% of the Global Population will have Its Personal Data Covered Under Privacy Regulations.”

These regulations present a situation where companies have to walk the fine line where they want to use all available data to get a better quality AI solution, but have to be compliant with the regulations in various countries they operate their business in. One of the methods to deal with this situation that has been gaining popularity is using synthetic data.

What is synthetic data?

Synthetic Data is artificial computer-generated data that mimics real-world data in format, granularity, and statistical properties. However, it is not sourced from actual observations or events. It is generated using many mathematical and statistical techniques and deep learning models such as Generative Adversarial Networks (GANs).

What it is not? Dummy data / random data w/ the same schema. You can consider it as generative AI for tabular data.

It’s important to note that while synthetic data can be highly useful, it’s not a perfect replacement for real data in all situations. The generated data might not capture all the nuances and complexities of the real world, potentially leading to limitations in certain applications. However, advances in synthetic data generation techniques are continually improving the quality and utility of synthetic data for various use cases.

Synthetic data uses

There are many use cases for synthetic data such as:

Privacy protection: When dealing with sensitive or personal data, we will need to anonymize sensitive data to be compliant with data privacy laws and regulations to pursue AI/ML and other analytics projects without hassles.
Testing and development: Synthetic data can be used in software testing, model development, and algorithm tuning,
Data set augmentation – make your data better than real (fix imbalances, increase volume, etc.)

There are many more use cases which will be covered in depth in following articles.

It is important to note that different use cases require different synthetic data technology (i.e. just as there is no singular cybersecurity tool, synthetic data exists on a spectrum and can be used for many different tasks). For example, data set augmentation requires a lot of technical customizability (and access to the source data). However, technical customizability creates privacy risk, making it a poor form factor for data sharing.

Currently, the most common use case for Synthetic Data is using it for privacy and compliance. Around 54% of CDO survey respondents list Regulatory and Ethical Issues as a primary barrier to adoption.

Only 4% of Large Enterprises Have Generative AI Projects in Production (Source: AlphaWise, Morgan Stanley Research 2Q23 CIO Survey).

The key success factors are:

Simplifying data governance and access controls to enable data democratization
Strengthening third-party risk management and reduce time-to-value for new tools
Reducing regulatory risk exposure through anonymized data safe harbors

To adopt synthetic data at enterprise scale:

Resolve the legal challenge unambiguously and automatically
Data quality automation – high quality data without a team of synthetic data experts
Standard connectors, not custom integrations – the existing data stack should still work

There are many methods for generating synthetic data. I will cover the technology behind synthetic data generation in a follow-up article.

While Synthetic Data techniques have been in use for a while, only recently it has drawn attention from the AI camp to deal with data challenges as well as privacy related restrictions on data usages.

I want to emphasize that not only the generation of synthetic data is important but its management, deployment, validity testing for the purpose it was originally designed for, management of deployment, governance as well as versioning, etc. are equally important.

There are a number of vendors emerging in this space such as Subsalt, Cognida.ai, Ydata, Tonic, Datomize, Betterdata, etc., and open source products such as Synner, Datagene, mirrorGen etc. I will be covering the product and vendor landscape and customer findings in detail in a follow up article.

The Next article will focus on the use cases for using synthetic data in detail.

About the Author:

Gokula Mishra is Chief Editorial Reviewer of CDO Magazine editorial board, former VP of Data Science & AI/ML, Direct Supply and Head of Data Analytics and Supply Chain globally at McDonalds. He brings 30+ years of Data analytics and AI/ML experience across many industries in creating lasting business value internally and externally.