· Michael Avdeev · Insights · 6 min read
Scan Your Data Before It Enters the LLM
A security engineer at a mid-size fintech pinged me a few months back. They’d fine-tuned an internal LLM on two years of customer support tickets to help agents draft responses faster.
Worked great. Until someone asked it: “What’s the account number for John Smith in Denver?”
The model answered. With a real account number. From a real customer.
The support tickets had never been scrubbed. Account numbers, SSNs, email addresses—all of it went straight into the training set. Now it’s encoded in the model weights. There’s no query to delete it. No file to remove. It’s just… in there.
They ended up retraining from scratch. Took three months.
The Problem Nobody’s Talking About
Everyone’s racing to deploy LLMs. Fine-tuning on internal data. Building RAG pipelines over document stores. Indexing years of email, contracts, and customer records to make their AI “smarter.”
But here’s what’s getting skipped: scanning that data first.
Not for format or quality. For sensitive data. PII. PHI. Credentials. The stuff that, once it’s in the model, doesn’t come out.
The training pipeline looks something like this:
- Collect internal documents
- Clean and chunk for embedding or fine-tuning
- Train or index
- Deploy
Step 1.5—“scan for sensitive data”—doesn’t exist in most workflows. And that’s where the liability enters.
Why This Is Different from Traditional Data Governance
With traditional data storage, you find sensitive data, you delete it or encrypt it. Problem solved.
With LLMs, it’s not that simple.
Fine-Tuning Bakes Data Into Weights
When you fine-tune a model on data containing PII, that information becomes part of the model itself. It’s not stored in a database you can query. It’s distributed across billions of parameters. There’s no “SELECT * WHERE contains_ssn = true” to find and remove it.
The only fix is retraining. Which means:
- Scrubbing your training data properly
- Re-running compute (expensive)
- Re-validating the model
- Re-deploying
For large models, that’s not a weekend project.
RAG Pipelines Index Everything
Retrieval-Augmented Generation (RAG) is supposed to be safer—you’re just indexing documents, not training on them. But the index still contains whatever you fed it.
If your document store has SSNs embedded in contracts, those SSNs are now retrievable. The model can surface them in responses. Your “grounded” AI just became a sensitive data search engine.
Prompt Injection Meets Sensitive Data
Here’s where it gets worse. Prompt injection attacks can trick models into revealing training data or indexed content. If your model has seen sensitive data, an attacker might be able to extract it.
Research has shown that LLMs can be prompted to regurgitate training data verbatim—including PII that was in the original dataset. One study found that 16.9% of generated responses contained memorized PII, with 85.8% of it being authentic. Another demonstrated that phone numbers could be extracted from 1 in 15 individuals in the training set with the right prompts. If you didn’t scan before training, you’re gambling that nobody figures out how to ask the right question.
What to Scan For
Before any data enters your LLM pipeline—whether for fine-tuning, RAG indexing, or even prompt context—scan for:
Identity Data
- Names + SSNs, dates of birth
- Email addresses, phone numbers
- Account numbers, customer IDs
Financial Data
- Credit card numbers (full PANs)
- Bank account and routing numbers
- Transaction details with identifiable info
Health Data
- Medical record numbers (MRNs)
- Diagnoses, ICD-10 codes
- Treatment notes, prescriptions
Credentials
- API keys, tokens
- Passwords (even hashed ones in logs)
- Connection strings
Legal/HR Data
- Employment records
- Salary information
- Performance reviews
- Legal case details
The goal isn’t to block all internal data—it’s to catch the stuff that shouldn’t be there before it’s too late to remove.
Where the Sensitive Data Hides
In my experience, the highest-risk sources for LLM training contamination are:
Support tickets — customers paste full credit card numbers, SSNs, account details. Support agents copy them into notes. Nobody scrubs them before archiving.
Email archives — years of internal communication, often with attachments containing PII. HR emails alone are a minefield.
Contracts and legal documents — names, addresses, SSNs in exhibits and schedules. OCR’d PDFs that nobody’s reviewed in years.
CRM exports — customer data dumps used for “analytics” that end up in training sets.
Slack/Teams exports — engineers sharing credentials, PMs pasting customer data for context, HR discussing personnel issues.
Legacy document stores — SharePoint sites from 2015, Box folders from acquisitions, NAS drives that “might have something useful.”
The common pattern: data that was never meant to be public, sitting in a format that’s easy to bulk-ingest into an LLM pipeline.
A Simple Pre-Training Workflow
Here’s what I recommend before any data touches your LLM:
1. Inventory Your Sources
What data are you planning to use? Support tickets? Internal docs? Customer records? List everything.
2. Scan Everything
Run sensitive data discovery across all source data. Flag files containing:
- Direct identifiers (SSN, credit card, MRN)
- Indirect identifiers (name + DOB, name + address)
- Credentials and secrets
3. Remediate or Exclude
For each flagged file:
- Redact: Replace sensitive values with tokens ([SSN_REDACTED])
- Exclude: Remove from training set entirely
- Review: Manual check for edge cases
4. Document Your Process
When regulators ask “how do you ensure PII doesn’t enter your models?”—you need an answer. Keep logs of what was scanned, what was found, and what action was taken.
5. Repeat for Updates
Your model isn’t static. Neither is your training data. Every time you add new documents or retrain, scan again.
The Regulatory Pressure Is Coming
Right now, AI governance is mostly self-regulated. That’s changing fast.
The EU AI Act Article 10 requires high-risk AI systems to demonstrate data governance. Training data must be “relevant, sufficiently representative, and to the best extent possible, free of errors.” That includes ensuring personal data is handled appropriately. The Act entered into force August 2024, with most provisions applying by August 2026.
GDPR already applies to LLMs. If your model can output a person’s data, that person has a right to request deletion. Good luck deleting someone from model weights.
In the US, the FTC has already used “algorithmic disgorgement” to force companies to delete AI models trained on improperly collected data. Cases include Everalbum (2021) and Weight Watchers (2022). As an FTC associate director stated in 2023, algorithmic disgorgement is now a “significant part” of their enforcement strategy.
The pattern is clear: regulators are going to ask “what was in your training data?” You need an answer before they ask.
Why Speed Matters Here
AI projects move fast. Timelines are measured in sprints, not quarters. Nobody wants to wait six months for a data governance platform to deploy before they can train a model.
This is exactly why we built Risk Finder as a containerized scanner. Pull the Docker image, point it at your training data, get results. No agents. No complex deployment. No data leaving your environment.
For AI/LLM workflows specifically:
- Scan document stores before indexing into RAG
- Scan exports before fine-tuning
- Scan embeddings sources before vectorization
- Full OCR for scanned documents that might contain hidden PII
The scan happens in your environment. Your training data never leaves. You get a report of what’s risky before anything enters the model.
The Bottom Line
Your LLM is only as clean as your training data. Once sensitive data gets baked into model weights or indexed into RAG pipelines, you can’t just delete it.
The time to catch PII is before it enters the pipeline. Not after your model starts answering questions about real customers.
Scan before you train.