Opinion & Analysis

What We Learned Deploying GenAI for Compliance at Scale: 4 Lessons and 1 Big Misconception

Written by: Shesh Narayan Gupta | Senior Manager of Data Science at Capital One

Updated 3:27 PM EDT, June 3, 2026

The first time our automated AI system flagged a customer notice as non-compliant, everyone got excited. We had been grinding on this for months. Then it flagged the same notice as compliant two runs later. Nobody said anything for a second. That moment probably taught me more about deploying AI in financial services than anything else.

Regulatory compliance in financial services is document-heavy, time-intensive, and unforgiving. For years, the answer was headcount: more compliance advisors, more reviewers, more manual checks. Then generative AI (GenAI) arrived, and the pressure to adopt it came fast, often from the top.

Having led a team doing exactly this work at a major US financial institution, I want to offer something more useful than the usual enthusiasm: an honest account of what actually happens when you deploy GenAI for compliance in production.

The short version: it works, it is worth doing, and it will take longer and cost more than your first estimate.

The tool we used and why

Our team used Meta’s Llama with Scout capabilities, leveraging its combined text and vision features to process financial documents. The vision capability was particularly important for our use case because compliance documents in financial services are not always clean, structured text files.

They are PDF files with variable formatting, scanned notices, letters with embedded tables, and documents that look very different from one another, even when they serve the same regulatory purpose.

Llama Scout’s ability to handle both text extraction and visual document understanding in a single model simplified our architecture considerably. We did not need separate pipelines for typed versus scanned content, which matters when you are operating at enterprise scale with thousands of documents flowing through a system.

There was also a practical constraint: it was the only model approved by our enterprise security and compliance teams for generative AI use.

That said, choosing the right model is only the beginning of the problem.

How we actually built it

Since people always ask: no, we did not use RAG. I know that is the default answer for anything LLM plus documents right now, but our problem did not call for it. The regulatory requirements we were checking against were not a large, evolving corpus that needed dynamic retrieval.

They were specific, known, well-defined attributes. Instead of retrieving context at inference time, we embedded the relevant regulatory criteria directly into each prompt. We took a different approach.

We broke each regulatory requirement down into a discrete, testable question. Something like: Does this adverse action letter contain a FICO credit score?

That single question maps to a specific regulatory requirement. We wrote a prompt for it. Then another question for the next requirement. Then another. Each prompt was scoped to one clear intent rather than asking the model to evaluate compliance broadly. Broad questions get broad answers. In compliance, that is useless.

Figure 1: Document processing pipeline architecture

This approach meant more prompts to manage but much more reliable outputs. The model is not being asked to reason across ten requirements at once. It is being asked one specific thing and expected to answer it with evidence. That structure turned out to matter a lot for the evaluation side too.

On the document side, we converted everything to images first, including PDFs. Non-image documents were converted, then compressed to keep file sizes manageable without losing the detail the model needed to read. Each page was treated as a separate image. We extracted text from each page individually and then combined the full page sequence to give the model complete document context before asking it anything.

That pipeline handled a lot of the variability problem because we stopped fighting document formats and just converted everything to the same thing.

For evaluation, we measured model output against what a human reviewer would produce, including their justification and the specific evidence they cited. Not just whether the answer was right or wrong but whether the reasoning matched.

A model that gets the right answer for the wrong reason is a problem waiting to happen in a regulated environment. We caught several of those early and it changed how we thought about prompt design going forward.

Lesson 1: Context matters more than the model

Here is what nobody tells you early enough: context determines everything. A model that does not understand the regulatory framework it is working in will give you answers that sound completely reasonable and are completely wrong.

In practice, that means doing a lot of prompt engineering and building your context layer before you touch production code. Your model needs to know not just what the document says but also what it should say, under which state law, under which version of that law. That is a lot of context to encode into prompts, and building it is painstaking work.

We spent significantly more time on the context layer than on the model itself. There is a natural instinct to treat model selection as the hard problem. It is not. The hard problem is ensuring the model has sufficient understanding of your specific regulatory landscape, and that requires compliance experts in the room, not just engineers.

Below is an example of a typical prompt structure our team used:

Figure 2: Example prompt structure for single-rule compliance check

Lesson 2: Document variability will break your first pipeline

In a controlled demo, your GenAI system will perform beautifully. Feed it a clean and well-formatted document, and it will extract the relevant clauses accurately. Then you move to production.

Production documents come from dozens of sources, in dozens of formats. Some are scanned at an angle. Some have handwritten annotations. Some have tables nested inside tables. A model that handles 95% of documents correctly in testing will encounter the difficult 5% continuously in production, because volume is high and edge cases accumulate fast.

We learned this the hard way and then spent significant time rebuilding our pre-processing pipeline to handle what the real world actually sends you.

The practical solution is building better pre-processing into your pipeline and investing in evaluation infrastructure early. You need to know, for every document that flows through your system, whether the model’s output was correct. Which brings us to the next lesson.

Lesson 3: You need ground truth before you need a model

Nobody talks enough about evaluation. How do you actually know if the model is right? In most applications, that question is annoying but manageable. In compliance, it matters legally. You cannot just eyeball a sample and move on.

We had to build ground truth before we could run real experiments. A set of documents with known correct answers, verified by actual compliance professionals. It took time. It was not the kind of work that makes anyone’s highlight reel. But without it, we had no way to tell whether the model was getting better or just getting more confident while being wrong. Those two things can look identical from the outside.

Budget for this. It is not optional. The work that does not feel like AI work is where most of your time actually goes: building ground truth, securing data approvals, writing evaluation scripts. Not the model. The infrastructure around the model.

Lesson 4: Data security approvals will take longer than you expect

This one catches nearly every team off guard. In a regulated financial institution, you cannot simply connect a model to your production data and start experimenting. Every data access request goes through a formal approval process.

In our case, that meant Privacy and Security Assessment (PSA) approvals before we could work with any data that touched real customer or regulatory information. The approval process was frustrating. We had engineers ready to go and were waiting weeks. But these controls exist because financial data is sensitive in ways that matter to real people.

When I stopped treating the approvals as obstacles and started treating them as part of the job, the whole process got easier. Your timeline needs to include them. Not as a risk item. As a given.

We worked with synthetic and anonymized data during early development and built our roadmap around the reality that production data access takes time.

If you are leading an enterprise AI initiative in a regulated industry and your timeline does not account for data security approval cycles, your timeline is wrong.

The biggest misconception: GenAI can do everything

The biggest mistake I see is people expecting full autonomy. The pitch sounds clean: feed the model the state laws, connect it to your document systems, and let it tell you what is compliant.

Done. No more compliance reviewers sitting in a room reading notices all day. That vision is appealing, but it does not reflect how these systems actually perform. Reality is messier.

Take a simple case: A customer notice has certain wording. A state law requires disclosure in a specific way. A compliance expert with ten years of experience reads both and makes a call, drawing on case history, regulatory context, and professional judgment built up over years. That judgment is not something you can prompt your way to.

What the model can do is flag things, surface relevant regulatory text, cut down the volume of documents a human has to read cover-to-cover. That is genuinely valuable and nothing to dismiss. But making the final call on an ambiguous case? Not there yet. Maybe not ever for certain categories of decision.

Teams that go in expecting full autonomy either end up with a system that makes wrong calls nobody catches, or they add so many guardrails that the whole thing slows down to roughly the speed of the manual process it was supposed to replace. Neither is good.

What actually changes when you get this right

When it works, the impact is tangible and measurable. Reviewers stop reading every document and start focusing on flagged cases. Turnaround drops. Consistency improves significantly, which in a regulated environment may be the most important outcome of all.

The business case is there. I have seen it.

The organizations that get the most out of this are not the ones that ship fastest. They are the ones that invest in the foundational work first and then ship something that actually holds up.

Move carefully. It is faster in the long run.

About the Author:

Shesh Narayan Gupta is Senior Manager of Data Science at Capital One, where they lead ACOD (Automated Compliance and Data Oversight), a team applying Generative AI and advanced analytics to regulatory compliance in financial services. They are a published author of two Apress books on Generative AI and a researcher whose work has appeared on arxiv and is under review at Springer Nature.