· Michael Avdeev · Insights  · 4 min read

77% of Employees Paste Company Data into AI Chatbots. Does Anyone Know What’s in Your Training Set?

I was on a call last week with a CISO at a mid-sized financial services firm. They’re building an internal LLM for customer service automation. Smart move. I asked a simple question: “Do you know if there’s PII in your training data?”

Silence.

Not “yes” or “no.” Just… nothing. They hadn’t looked. Nobody had.

This isn’t an outlier. A recent study found 77% of employees using AI have pasted company data into chatbots—and 22% of those cases included confidential information. Stanford’s 2025 AI Index documented 233 AI-related security incidents last year. That’s a 56% jump from the year before.

The governance gap is real. And it’s getting wider.


The Problem Nobody Wants to Talk About

Here’s the thing about training an LLM on sensitive data: that model becomes an exposure vector for everything it learned.

Not theoretical. Happening now.

IBM’s 2025 report says 13% of organizations experienced breaches involving AI models or applications. And 40% of data security incidents now occur within AI applications themselves.

The risks stack up fast:

  • Training data leakage — Models memorize things. Sometimes they spit them back out.
  • Prompt injection — Attackers craft queries to extract what the model “knows.”
  • RAG pipeline exposure — You indexed a customer database export by accident. Now your chatbot surfaces credit card numbers to anyone who asks nicely.
  • Shadow AI — Employees using unauthorized tools to process data you never approved.

That last one? It’s everywhere. A Big 4 consultant I know calls it “information is power that turns into ‘I need to keep everything.’” People hoard data into AI tools because it feels useful. Nobody’s checking what’s in it.


The GDPR Problem That’ll Ruin Someone’s Year

Picture this: A customer invokes their GDPR “right to be forgotten.” Your company trained a model on their data six months ago.

Now what?

If that customer’s data influenced your model, you might need to retrain the whole thing without their data. That’s not an inconvenience. That’s a compliance catastrophe.

The EU AI Act is introducing mandatory requirements for high-risk AI systems. Proving you scanned your training data for PII before ingestion isn’t optional anymore. It’s becoming law.

A friend in legal told me last month: “Everyone’s worried about AI bias. Nobody’s asking where the training data came from.” She’s right. The operationalization of AI governance is lagging way behind the deployment speed.


High Concern, Low Visibility

A study of 300 tech leaders found over three-quarters rated AI governance “extremely important.” Top concerns:

  • System integration risks
  • Data security vulnerabilities
  • Managing LLM costs
  • Regulatory compliance

But here’s the disconnect: these same organizations often have zero visibility into what sensitive data exists in their training datasets, fine-tuning corpora, or RAG document stores.

High concern. Low visibility. That gap? That’s where breaches happen.


What Actually Works

I’ve seen this done right exactly twice. Both times, the approach was the same:

1. Scan Before You Train

Before any dataset enters an LLM pipeline, run it through classification. Every time. Not sampling—full scans. You’re looking for:

  • PII (SSNs, names, addresses)
  • PHI (medical records, diagnoses)
  • Financial data (credit cards, bank accounts)
  • Credentials (API keys, passwords, tokens)

One scan. Before ingestion. Not after the model ships.

2. Audit Your RAG Pipelines

If you’re building retrieval-augmented generation, your document stores become part of your AI’s memory. That memory needs to be clean.

I talked to a team that indexed three years of customer support tickets into their RAG system. Guess what was in those tickets? SSNs. Lots of them. They found out when a beta tester asked the chatbot for “examples of customer data” and got exactly that.

3. Keep Scanning

Data doesn’t stay static. New documents hit knowledge bases daily. Employees upload files to shared drives that feed AI systems. Continuous scanning catches drift before it becomes exposure.


The Cost of Getting This Wrong

Stanford’s research shows AI incidents are accelerating. Not stabilizing. Accelerating.

Consider:

  • A healthcare AI trained on unredacted patient records
  • A financial services model that memorized account numbers
  • A legal AI that indexed privileged communications

These aren’t hypotheticals. They’re the inevitable result of shipping AI without data discovery.


Stop Guessing. Start Scanning.

We built Risk Finder for exactly this. Before datasets enter your AI pipeline:

  • 150+ classifiers catch PII, PHI, PCI, and credentials in a single pass
  • Flat-rate pricing means scanning large training sets doesn’t blow your budget
  • Local processing keeps data in your environment—no cloud dependencies
  • JSON output plugs into pipelines for automated remediation

By next year, AI governance will be a top security priority for most organizations. The question is whether you’ll be ahead of it—or explaining to regulators why PII ended up in your production model.


Start a free trial | See Risk Finder | Try the free scanner


Your employees are pasting data into AI tools right now. Do you know what’s in there? Scan before someone else finds out.

Back to Blog

Related Posts

View All Posts »

Scan Your Data Before It Enters the LLM

Your LLM is only as clean as your training data. Once PII gets baked into model weights, there is no delete button. Here is how to catch it before that happens.

How "Classification Intelligence" enables Risk Management

Organizations face an ever-evolving landscape of cyber threats and regulatory scrutiny. The global average cost of a data breach in 2024 is $4.88M, IBM highlights in the 2024 Cost of Data Breach. Effective and accurate data classification has emerged as a critical strategy for enterprises to manage risks, enhance security posture, and build resilience.