Branded Content

The Top 5 Approaches to AI Data Privacy

Matching AI use cases to appropriate delivery mechanisms.

Written by: Shayde Christian | Chief Data and Analytics Officer, Cloudera

Updated 1:31 PM UTC, Mon July 8, 2024

While proper use case selection, data quality, and ethical responsibility remain important considerations for employing generative AI (GenAI) apps in your data community, data privacy is still the most critical criterion and essential concern.

For most of us the question isn’t whether we will deliver AI applications, but using which approach, or with which architecture.

There are a myriad of options to answer those questions, but ultimately, the relative need for data privacy is what will dictate our approach. But how can you tell which approach is right for a specific situation?

Whether your organization is trying to generate marketing content, summarize use cases, or train image recognition models, the underlying need for data privacy remains constant. There are a number of paths one could take, but for our purposes let’s zero in on five of the most relevant and explore some of the dos and don’ts of AI data privacy.

1. Expose data directly to powerful OpenAI models via prompts

GenAI has inspired employees to increase their own productivity, efficiency, or effectiveness. For many organizations valid use cases exist, and that justifies the authorization of employees to prompt OpenAI models like ChatGPT, Llama, or Claude directly. Nevertheless, depending on the configuration, models will be trained on the data employees provide them, thus exposing it to third parties.

Prompting open models directly may be appropriate for researching and summarizing publicly available customer and prospect information, for conducting competitor analysis, or for generating marketing content.

For each of those use cases, the data being used or generated is publicly available, leaving only text in the prompt and response as “exposed data.”

Best practices when employing this approach include:

Drafting and communicating policies restricting the use of words or terms in prompts that might identify or infer the corporation, its intentions, or its employees.
Reviewing and approving/rejecting OpenAI model usage across the enterprise.
Opening channels to share (or expose) how employees are using open models, and govern usage accordingly.

2. Expose data confidentiality to an auspiciously forgetful OpenAI model

Use retrieval-augmented generation (RAG) to expose private data to OpenAI models to enhance responses to your prompts. Popular models like ChatGPT, Llama, or Claude can be used in your cloud instance where your data is double-encrypted at rest.

Although your organizational data is exposed to the model to generate its response, it does not train itself on the data and expunges it from memory at or before the end of your session. None of the prompts, data, or responses are made available to third parties.

This approach may be appropriate for generating product release content, creating custom messaging based on customer sentiment, mapping customer needs to product offerings, or summarizing customer cases. The use of RAG across the enterprise should also be governed.

3. Do not expose data (build)

If you do not want to expose your data at all, you can develop large language models (LLMs) from scratch. Informal polling and dialog with my peers suggest that this typically costs US$ 2-3 million. Building models may be well-suited when model complexity is high, when effective models are unavailable, and when model outputs have the potential to yield large-scale efficiency or significant product differentiation.

This was the only suitable approach for insurers, for instance, who trained home-grown image recognition models to recommend claims decisions instantly, eliminating the need for humans to inspect damage or calculate settlements.

Once a good option for building chatbots and virtual assistants for internal engineering or customer use, robust open-source foundational models have emerged as a buy option.

4. Do not expose data (buy)

There is another method to retain data privacy that is far less costly than building an LLM, and it is suitable for many AI applications: bring the models to your data. Cloudera Machine Learning (CML) excels at this. It uses AMPs (Accelerators for ML Projects) to bootstrap open-source foundational models, from Hugging Face, for example, into your private cloud for isolated training where your data lives.

Foundational models come well-trained on domain expertise such as medical language NLP, or SQL coding, such that you do not need to train from scratch. Instead, you fine-tune them privately on your medical claims or enterprise data repository code.

This bring-models-to-your-data approach may be effective for assessing legal contract vulnerabilities and proposing mitigation clauses, or for automating data pipeline coding, documentation, and optimization.

5. Extend data protections to AI software

Certainly, most of your software vendors are already claiming that the latest versions contain AI, but tomorrow, even Jello might ship with AI, so to protect your data, consider the following:

Review contracts and security agreements carefully. Obtain addendums to extend MSA data protection to AI.
Test AI thoroughly not only to ensure data privacy but also to gauge the effectiveness of harm mitigation mechanisms.
Develop employee awareness campaigns well in advance, particularly before implementing NLP models grafted into your conferencing, emailing, and direct messaging software.
Your organization may rightly own all the data in those apps but many employees may not have communicated what they communicated, or how they communicated it, had they envisioned that one day AI would be reading it, summarizing it, and opining on it to wider audiences.

For more, learn the Top 5 Lessons Learned from CDAOs Successfully Implementing GenAI.

About the Author:

Shayde Christian is Chief Data and Analytics Officer at Cloudera. He guides data-driven cultural change for Cloudera to generate maximum value from data. Christian enables customers to get the absolute best from their Cloudera products such that they can generate high-value use cases for competitive advantage.

Previously a principal consultant, Christian formulated data strategy for Fortune 500 clients and designed, constructed, or turned around failing enterprise information management organizations. He enjoys laughter and is often the cause of it.