How to Detect PHI in Scanned Documents and PDFs

Somewhere in your organization, there’s a folder full of scanned intake forms. Patient names, Social Security Numbers, insurance IDs, medical history — all captured as images inside PDFs.

Your DLP tool has scanned that folder a hundred times. It found nothing.

That’s because traditional DLP can’t read images. It looks at the text layer of a document. Scanned PDFs don’t have a text layer — they’re pictures of text. To your DLP, a scanned medical record looks exactly like a photo of a sunset: just pixels.

This is how PHI hides in plain sight.

The Scanned Document Problem

Healthcare runs on paper. Or at least, it used to. The transition to electronic records created a massive backlog of scanned documents:

Patient intake forms — handwritten or typed, then scanned
Insurance cards and IDs — photocopied at check-in
Faxed records — still the standard for inter-provider communication
Legacy medical records — digitized but never OCR’d
Explanation of Benefits (EOBs) — scanned for billing disputes
Lab results and imaging reports — often arrive as PDF attachments

Every one of these contains PHI. Every one of them is invisible to traditional scanning tools.

A 2024 study found that over 60% of healthcare organizations have PHI in scanned documents that has never been classified or inventoried. It’s not that they don’t care — it’s that their tools literally can’t see it.

Why Traditional DLP Misses Scanned Content

Traditional Data Loss Prevention tools work by pattern matching against text. They look for strings that match known formats:

Social Security Numbers: XXX-XX-XXXX
Credit card numbers: 4XXX-XXXX-XXXX-XXXX
Medical Record Numbers: varies by system

This works great for text files, Word documents, Excel spreadsheets, and PDFs with embedded text. It fails completely for:

Image-Based PDFs

When you scan a paper document, you get a picture. The resulting PDF contains that picture, not searchable text. Open a scanned PDF and try to select the text — you can’t. Neither can your DLP.

Embedded Images

A Word document might have a scanned form pasted in as an image. The document itself is searchable, but the image within it isn’t. DLP scans the text, reports “clean,” and moves on — while PHI sits in the embedded image.

Fax-to-Email

Fax machines convert to PDF or TIFF. These are image files. Every faxed prescription, referral, or lab result is invisible to text-based scanning.

Screenshots and Photos

Someone photographed a patient chart for a quick consult. That photo is now in an email attachment, a Slack message, or a shared drive. It’s a JPEG. DLP doesn’t read JPEGs.

The result: your “compliant” environment has blind spots everywhere scanned content exists.

OCR + Classification: How Modern Tools Work

The solution is Optical Character Recognition (OCR) combined with sensitive data classification. Here’s how it works:

Step 1: Identify Image Content

The scanner detects files that contain images — not just image files (JPEG, PNG, TIFF) but also documents with embedded images (PDFs, Word docs, PowerPoint).

Step 2: Extract Images

For compound documents like PDFs, the scanner extracts each embedded image for separate processing. A 50-page scanned PDF might contain 50 separate images.

Step 3: Run OCR

OCR converts the image to text. Modern OCR handles:

Typed text (high accuracy)
Handwritten text (moderate accuracy, improving with ML)
Rotated or skewed documents
Poor scan quality
Multiple languages

Step 4: Classify the Extracted Text

Once OCR produces text, standard classification runs against it. The same 250+ classifiers that find SSNs in spreadsheets now find them in scanned intake forms.

Step 5: Report with Context

Results show the original file, the page or image where PHI was found, and the specific data types detected. You know exactly which scanned document needs attention.

The key insight: OCR transforms image content into text that classifiers can read. Without OCR, scanned documents are invisible. With it, they’re just another file type.

Why Pattern Matching Isn’t Enough: EDM and HDM for PHI

Here’s the problem with pattern-based classification: it guesses.

A regex that matches XXX-XX-XXXX will flag every string that looks like a Social Security Number. Some are real SSNs. Some are phone numbers with dashes. Some are product codes. Your team spends hours triaging false positives instead of remediating real risk.

In healthcare, you can’t afford that noise. When you’re determining breach scope or proving HIPAA compliance, you need certainty — not probability scores.

Exact Data Matching (EDM)

EDM compares discovered data against a known dataset. Instead of asking “does this look like an SSN?” it asks “is this SSN in our patient database?”

How it works:

Load your known PHI — patient names, MRNs, SSNs, DOBs from your EHR or claims system
Scan your environment — Risk Finder extracts text from documents (including OCR for scanned content)
Match against known records — when a scanned intake form contains “John Smith, 555-12-3456,” EDM confirms that’s an actual patient, not a random string

The result: 100% confirmed PHI, not “high confidence” matches.

Hashed Data Matching (HDM)

Some organizations can’t load raw patient data into a scanning tool — even one that runs locally. HDM solves this:

Hash your sensitive data — SSNs become irreversible hashes like a3f2b8c1...
Load only the hashes — no plaintext PHI in the matching database
Compare hash-to-hash — discovered SSNs are hashed on-the-fly and compared

You get the same 100% match accuracy without exposing the original patient data during scanning.

Why This Matters for Healthcare

Approach	Accuracy	False Positives	Use Case
Pattern matching	~80-90%	High	Initial discovery, broad scans
EDM	100%	Zero	Breach scope, compliance proof
HDM	100%	Zero	High-security environments

For breach notification, you need to tell HHS exactly whose PHI was exposed. “We found 1,247 possible SSNs” doesn’t cut it. “We confirmed 892 patient records were in the compromised folder” does.

For HIPAA audits, showing that you matched against known patient data demonstrates a higher standard of due diligence than pattern-only scanning.

Risk Finder supports both EDM and HDM — load your patient database (or hashes) and turn “possible PHI” into “confirmed PHI.” When combined with OCR, you get 100% accurate detection even in scanned documents.

File Types That Contain Hidden PHI

Not all PHI hides in obvious places. Here’s where scanned content accumulates:

High-Risk File Types

File Type	Why PHI Hides Here
Scanned PDFs	No text layer — pure image content
TIFF files	Standard fax output format
JPEG/PNG images	Photos of documents, ID cards, charts
Multi-page PDFs	Mixed content — some pages OCR’d, some not
Email attachments	Faxes forwarded as attachments

Unexpected Locations

Location	What’s There
Email archives (.PST)	Years of faxed records as attachments
SharePoint document libraries	Scanned contracts with SSNs
Shared drives	Legacy “scanned documents” folders
Cloud storage	Mobile uploads of patient forms
Backup archives	Historical scans never classified

The Nested Problem

PHI doesn’t just hide in images — it hides in images inside archives inside folders:

/shared/old-records/2019-archive.zip
  └── patient-files/
       └── intake-forms.pdf (50 pages, all scanned images)
            └── Page 23: SSN, DOB, diagnosis codes

A scanner that doesn’t extract archives, process PDFs, run OCR, and classify the result will never find that SSN on page 23.

Step-by-Step: Scanning PDFs for PHI with Risk Finder

Here’s how to find PHI in scanned documents using Risk Finder:

1. Deploy the Scanner

Pull the Docker image and run it against your target location:

docker pull inspectdatainc/risk-finder
docker run -v /path/to/documents:/scan inspectdatainc/risk-finder

Risk Finder runs locally — your documents never leave your environment.

2. OCR Runs Automatically

Risk Finder includes built-in OCR. When it encounters a scanned PDF or image file, it automatically:

Detects image-based content
Extracts embedded images from PDFs
Runs OCR to convert to text
Classifies the extracted text

No configuration required. No separate OCR license.

3. Review Results

The scan report shows:

File path: Exact location of the document
Page number: Which page contains PHI (for multi-page PDFs)
Data types found: SSN, MRN, DOB, diagnosis codes, etc.
Match count: How many instances per document
Risk level: Severity based on data type combination

4. Prioritize Remediation

Focus on high-risk findings first:

SSN + Medical data = HIPAA breach risk
Open share + PHI = Immediate exposure
High volume = Systematic process problem

5. Document for Compliance

Export results as PDF or JSON for your compliance records. You now have evidence that you’ve scanned for PHI — including in scanned documents that other tools miss.

HIPAA Compliance Requirements for Document Scanning

HIPAA doesn’t explicitly mention OCR. But it does require you to know where PHI exists.

The Security Rule Requirements

§ 164.308(a)(1)(ii)(A) — Risk Analysis:

Conduct an accurate and thorough assessment of the potential risks and vulnerabilities to the confidentiality, integrity, and availability of electronic protected health information held by the covered entity.

You can’t assess risk to PHI you don’t know exists. If your scanned documents contain PHI and you’ve never inventoried them, your risk analysis is incomplete.

§ 164.310(d)(1) — Device and Media Controls:

Implement policies and procedures that govern the receipt and removal of hardware and electronic media that contain electronic protected health information.

“Electronic media” includes storage containing scanned documents. If you don’t know which drives contain PHI in scanned form, you can’t properly control them.

What Auditors Look For

During a HIPAA audit or OCR investigation, auditors ask:

Where is your PHI inventory? — You need to show you know where PHI exists
What scanning methodology did you use? — “We ran DLP” isn’t sufficient if DLP can’t read images
How do you handle scanned documents? — Auditors know this is a gap
What’s your remediation process? — Finding PHI is step one; controlling it is step two

Organizations that can demonstrate OCR-based scanning show a higher level of due diligence than those relying on text-only tools.

Breach Notification Implications

When a breach occurs, you need to determine what was exposed. If the compromised system contained scanned documents, you need to know:

Did those documents contain PHI?
Whose PHI was in them?
What data elements were exposed?

Without OCR-based classification, you’re guessing. With it, you have an inventory that tells you exactly what was at risk.

Every healthcare organization has scanned documents. Most have never classified them for PHI.

Traditional DLP tools give you a false sense of security. They scan your environment, report “compliant,” and completely miss the scanned intake forms, faxed records, and image-based PDFs that contain the most sensitive patient data.

OCR-based classification closes that gap. It reads what other tools can’t see.

The question isn’t whether you have PHI in scanned documents. You do. The question is whether you know where it is before an auditor — or an attacker — finds it first.

→ Find PHI hiding in your scanned documents. Start your free risk assessment — includes built-in OCR, 250+ classifiers, EDM/HDM for near-perfect match accuracy, and support for PDFs, images, and archives.

Risk Finder includes OCR and EDM/HDM capability at no additional cost. All processing happens locally in your environment — scanned documents and patient data are never uploaded to external services.