How to Discover Sensitive Data in Your Environment

A CISO at a regional hospital called me after their breach notification. Ransomware. Standard playbook—encrypt systems, exfiltrate data, demand payment.

The hard part wasn’t the ransom decision. It was answering the question regulators asked: what patient data was compromised?

They didn’t know. Not because the forensics failed. Because they’d never mapped what sensitive data existed where. The breach affected a legacy file server nobody remembered provisioning. It held 340,000 patient records exported from a system decommissioned in 2019.

Nobody knew that data was there. Until the attackers found it.

The Discovery Problem Most Organizations Ignore

Ask any security team: “Where is your sensitive data?”

You’ll get a confident answer about production databases. Maybe some mention of cloud storage policies. Perhaps a reference to the DLP tool watching endpoints.

Now ask: “What about the file shares? The legacy systems? The test environments with production data copies? The backup archives? The employee laptops with years of downloaded attachments?”

Silence.

Most organizations know where sensitive data should be. They don’t know where it actually is.

Why? Because discovery gets treated as a one-time project. Someone scanned the file shares in 2021. That inventory is fiction now. Data has been created, copied, migrated, and abandoned a thousand times since.

The organizations that get breached—and can’t answer basic questions about what was exposed—are the organizations that stopped looking.

Where Sensitive Data Actually Hides

Sensitive data doesn’t stay where you put it.

The Obvious Places (That Still Aren’t Scanned)

Production databases get attention. But when did you last scan what’s in them versus assuming schema-level classification is accurate? Databases accumulate free-text fields, notes columns, and attachment blobs that contain PII nobody mapped.

Cloud storage has policies. But S3 buckets created for “temporary” projects become permanent. Azure Blob containers proliferate faster than governance can track. Google Drive folders get shared externally before anyone reviews contents.

The Hidden Places (Where Breaches Actually Start)

Legacy file shares. Every organization has them. The Windows file server migrated three times but never cleaned. The NAS device in the closet “the accounting team uses.” The departmental share with 15 years of accumulated exports, spreadsheets, and reports—all containing customer PII nobody classified.

Test and development environments. Developers copy production data for testing. It’s faster than generating synthetic data. That “temporary” copy persists for years. Same PII, zero security controls.

Backup archives. You kept backups for disaster recovery. Those backups contain full copies of every sensitive record that existed at snapshot time. Including data you thought you deleted years ago.

Employee endpoints. Downloads folder. Desktop. Email attachments saved locally. Exports from CRM, ERP, and HR systems. Years of accumulated files on devices that might leave the building tonight.

Email archives. Every sensitive document ever attached to an email. Every customer SSN someone pasted into a message. Sitting in PST files, email archives, and backup tapes.

SaaS application exports. Someone exported the customer list from Salesforce for a mail merge in 2018. That CSV with 50,000 customer records is still on a shared drive.

The Truly Forgotten

Decommissioned systems. The server got powered off. The data wasn’t deleted—it was orphaned. Sitting on storage that’s still mounted, accessible, and unmonitored.

Acquired company assets. M&A transactions absorb entire data estates. Due diligence checked the big systems. Nobody inventoried the file shares, legacy applications, or departmental data.

Shadow IT. Dropbox accounts employees created before IT provided cloud storage. Personal Google Drives with work documents. Tools adopted by teams without security review.

The Methodology: How to Find PII in Files Systematically

Random scanning isn’t discovery. You need a systematic approach that covers your actual data estate—not just the parts you remember exist.

Step 1: Map Your Data Estate (Honestly)

Before scanning anything, inventory where data might exist. This isn’t a technical exercise—it’s an organizational one.

Interview questions:

What systems existed five years ago that don’t exist now? Where did their data go?
What file shares do departments use outside official IT systems?
Where do employees store files when working remotely?
What third-party tools have access to export customer data?
What happens to data during system migrations?

Technical discovery:

Network scans for file shares and storage devices
Cloud account enumeration (all regions, all services)
Endpoint inventory including mapped drives
Backup system catalogs

You won’t get it perfect. That’s fine. “We don’t know what’s on that legacy NAS” is better than pretending it doesn’t exist.

Step 2: Prioritize by Risk, Not Convenience

You can’t scan everything simultaneously. Prioritize based on risk factors:

High priority:

Externally accessible storage (public S3 buckets, shared drives)
Systems with broad internal access (company-wide file shares)
Environments with weak access controls (legacy systems, test environments)
Recently acquired assets (M&A data estates)
Data feeding AI/ML pipelines

Medium priority:

Departmental storage with restricted access
Cloud storage with proper IAM controls
Production databases (usually better protected but verify)

Lower priority:

Air-gapped archives with physical access controls
Encrypted backup tapes in offsite storage

Start where exposure is highest. A public S3 bucket with PII is more urgent than an encrypted tape vault.

Step 3: Scan Completely, Not Sampled

Here’s where most discovery efforts fail: sampling.

Sampling means scanning 10% of files and extrapolating. It’s faster. It’s cheaper. It’s also useless for security.

The PII hiding in your environment isn’t evenly distributed. It’s concentrated in specific files, folders, and systems. Sampling misses the 47,000-row customer export sitting in one spreadsheet on one share. It misses the database backup someone copied to their desktop. It misses the single document that contains your entire breach notification list.

Real discovery means:

Full file scanning — every file, not samples
Complete text extraction — content, not just metadata
Archive handling — ZIP, tar, PST, nested containers
OCR — scanned documents and images
All file types — not just Office docs

If your tool samples or skips file types, your inventory is incomplete. False confidence in a partial picture.

Step 4: Classify with Context, Not Just Patterns

Finding a 9-digit number isn’t finding a Social Security Number. Pattern matching generates noise—test data, reference numbers, false positives that overwhelm security teams.

Good classification looks at:

Context — Is this SSN in a tax document or a test file?
Multiple elements together — Name + SSN + DOB means PII. SSN alone might be noise.
Document structure — Headers, forms, layouts all provide signals
Confidence scores — Prioritize high-confidence findings, review the rest

You want findings you can act on. If your tool produces 500,000 alerts, you don’t have discovery—you have noise.

Step 5: Maintain Continuous Visibility

Discovery isn’t a project. It’s a capability.

Data estates change daily. New files created. Old files copied. Systems migrated. Employees hired and departed. A point-in-time inventory becomes fiction within months.

What this looks like in practice:

Scheduled rescans of high-risk storage (weekly or monthly)
Change detection to catch new sensitive data fast
Scanning tied to IT workflows — migrations, acquisitions, new storage
Alerts when PII shows up somewhere unexpected

The organizations that know their data are the ones that keep looking.

When Discovery Becomes Urgent

Sometimes you can’t wait. If any of these apply, stop planning and start scanning:

Before AI Training

Data entering LLM training or RAG pipelines gets encoded permanently. There’s no “delete” button for PII baked into model weights. Scan before that data touches your AI.

Before Migration

Cloud migrations, data center moves, and system upgrades are opportunities to leave sensitive data behind—or carry liability forward. Discover before you migrate, not after.

During M&A

Acquiring a company means acquiring their data liability. That “small” target company might have 20 years of unmanaged file shares containing customer PII you’re now responsible for.

After Breach Indicators

If you suspect compromise, discovery scope determines notification scope. The faster you know what data exists where, the faster you can answer the question regulators will ask: “What was exposed?”

Before Compliance Audits

HIPAA, PCI DSS, GDPR, CCPA—all assume you know where sensitive data exists. Walking into an audit without discovery means walking in blind.

What Discovery Should Actually Cost

Legacy approaches to sensitive data discovery—big platform deployments, per-GB scanning fees, months of professional services—priced most organizations out of comprehensive coverage.

The economics were backwards: the more data you had, the more discovery cost. Organizations with the biggest risk exposure paid the most. So they scanned less.

That’s fixable now. Flat-fee pricing. No per-GB charges. Deploy a Docker container in hours instead of a six-month implementation. Scan 1100+ file types with full text extraction, no sampling.

When scanning is cheap, you scan everything. When it’s expensive, you sample and hope.

The Question You Need to Answer

If your organization was breached tomorrow, could you answer:

What sensitive data existed in the compromised systems?
How many individuals need to be notified?
What regulatory reporting obligations are triggered?

If the answer is “we’d need to figure that out,” you have a problem.

The companies that survive breaches—without the regulatory fines and class actions—can answer those questions immediately. They already knew what data existed where.

Discovery isn’t prevention. Breaches will happen. But the difference between a manageable incident and a disaster often comes down to one thing: knowing what you have before someone else finds it.

Want to see what’s actually in your file shares? Start a free risk assessment — 250+ classifiers, full file scanning, no sampling, flat-fee pricing.