· Michael Avdeev · Compliance · 9 min read
How to Detect PHI in Scanned Documents and PDFs
Somewhere in your organization, there’s a folder full of scanned intake forms. Patient names, Social Security Numbers, insurance IDs, medical history — all captured as images inside PDFs.
Your DLP tool has scanned that folder a hundred times. It found nothing.
That’s because traditional DLP can’t read images. It looks at the text layer of a document. Scanned PDFs don’t have a text layer — they’re pictures of text. To your DLP, a scanned medical record looks exactly like a photo of a sunset: just pixels.
This is how PHI hides in plain sight.
The Scanned Document Problem
Healthcare runs on paper. Or at least, it used to. The transition to electronic records created a massive backlog of scanned documents:
- Patient intake forms — handwritten or typed, then scanned
- Insurance cards and IDs — photocopied at check-in
- Faxed records — still the standard for inter-provider communication
- Legacy medical records — digitized but never OCR’d
- Explanation of Benefits (EOBs) — scanned for billing disputes
- Lab results and imaging reports — often arrive as PDF attachments
Every one of these contains PHI. Every one of them is invisible to traditional scanning tools.
A 2024 study found that over 60% of healthcare organizations have PHI in scanned documents that has never been classified or inventoried. It’s not that they don’t care — it’s that their tools literally can’t see it.
Why Traditional DLP Misses Scanned Content
Traditional Data Loss Prevention tools work by pattern matching against text. They look for strings that match known formats:
- Social Security Numbers:
XXX-XX-XXXX - Credit card numbers:
4XXX-XXXX-XXXX-XXXX - Medical Record Numbers: varies by system
This works great for text files, Word documents, Excel spreadsheets, and PDFs with embedded text. It fails completely for:
Image-Based PDFs
When you scan a paper document, you get a picture. The resulting PDF contains that picture, not searchable text. Open a scanned PDF and try to select the text — you can’t. Neither can your DLP.
Embedded Images
A Word document might have a scanned form pasted in as an image. The document itself is searchable, but the image within it isn’t. DLP scans the text, reports “clean,” and moves on — while PHI sits in the embedded image.
Fax-to-Email
Fax machines convert to PDF or TIFF. These are image files. Every faxed prescription, referral, or lab result is invisible to text-based scanning.
Screenshots and Photos
Someone photographed a patient chart for a quick consult. That photo is now in an email attachment, a Slack message, or a shared drive. It’s a JPEG. DLP doesn’t read JPEGs.
The result: your “compliant” environment has blind spots everywhere scanned content exists.
OCR + Classification: How Modern Tools Work
The solution is Optical Character Recognition (OCR) combined with sensitive data classification. Here’s how it works:
Step 1: Identify Image Content
The scanner detects files that contain images — not just image files (JPEG, PNG, TIFF) but also documents with embedded images (PDFs, Word docs, PowerPoint).
Step 2: Extract Images
For compound documents like PDFs, the scanner extracts each embedded image for separate processing. A 50-page scanned PDF might contain 50 separate images.
Step 3: Run OCR
OCR converts the image to text. Modern OCR handles:
- Typed text (high accuracy)
- Handwritten text (moderate accuracy, improving with ML)
- Rotated or skewed documents
- Poor scan quality
- Multiple languages
Step 4: Classify the Extracted Text
Once OCR produces text, standard classification runs against it. The same 250+ classifiers that find SSNs in spreadsheets now find them in scanned intake forms.
Step 5: Report with Context
Results show the original file, the page or image where PHI was found, and the specific data types detected. You know exactly which scanned document needs attention.
The key insight: OCR transforms image content into text that classifiers can read. Without OCR, scanned documents are invisible. With it, they’re just another file type.
Why Pattern Matching Isn’t Enough: EDM and HDM for PHI
Here’s the problem with pattern-based classification: it guesses.
A regex that matches XXX-XX-XXXX will flag every string that looks like a Social Security Number. Some are real SSNs. Some are phone numbers with dashes. Some are product codes. Your team spends hours triaging false positives instead of remediating real risk.
In healthcare, you can’t afford that noise. When you’re determining breach scope or proving HIPAA compliance, you need certainty — not probability scores.
Exact Data Matching (EDM)
EDM compares discovered data against a known dataset. Instead of asking “does this look like an SSN?” it asks “is this SSN in our patient database?”
How it works:
- Load your known PHI — patient names, MRNs, SSNs, DOBs from your EHR or claims system
- Scan your environment — Risk Finder extracts text from documents (including OCR for scanned content)
- Match against known records — when a scanned intake form contains “John Smith, 555-12-3456,” EDM confirms that’s an actual patient, not a random string
The result: 100% confirmed PHI, not “high confidence” matches.
Hashed Data Matching (HDM)
Some organizations can’t load raw patient data into a scanning tool — even one that runs locally. HDM solves this:
- Hash your sensitive data — SSNs become irreversible hashes like
a3f2b8c1... - Load only the hashes — no plaintext PHI in the matching database
- Compare hash-to-hash — discovered SSNs are hashed on-the-fly and compared
You get the same 100% match accuracy without exposing the original patient data during scanning.
Why This Matters for Healthcare
| Approach | Accuracy | False Positives | Use Case |
|---|---|---|---|
| Pattern matching | ~80-90% | High | Initial discovery, broad scans |
| EDM | 100% | Zero | Breach scope, compliance proof |
| HDM | 100% | Zero | High-security environments |
For breach notification, you need to tell HHS exactly whose PHI was exposed. “We found 1,247 possible SSNs” doesn’t cut it. “We confirmed 892 patient records were in the compromised folder” does.
For HIPAA audits, showing that you matched against known patient data demonstrates a higher standard of due diligence than pattern-only scanning.
Risk Finder supports both EDM and HDM — load your patient database (or hashes) and turn “possible PHI” into “confirmed PHI.” When combined with OCR, you get 100% accurate detection even in scanned documents.
File Types That Contain Hidden PHI
Not all PHI hides in obvious places. Here’s where scanned content accumulates:
High-Risk File Types
| File Type | Why PHI Hides Here |
|---|---|
| Scanned PDFs | No text layer — pure image content |
| TIFF files | Standard fax output format |
| JPEG/PNG images | Photos of documents, ID cards, charts |
| Multi-page PDFs | Mixed content — some pages OCR’d, some not |
| Email attachments | Faxes forwarded as attachments |
Unexpected Locations
| Location | What’s There |
|---|---|
| Email archives (.PST) | Years of faxed records as attachments |
| SharePoint document libraries | Scanned contracts with SSNs |
| Shared drives | Legacy “scanned documents” folders |
| Cloud storage | Mobile uploads of patient forms |
| Backup archives | Historical scans never classified |
The Nested Problem
PHI doesn’t just hide in images — it hides in images inside archives inside folders:
/shared/old-records/2019-archive.zip
└── patient-files/
└── intake-forms.pdf (50 pages, all scanned images)
└── Page 23: SSN, DOB, diagnosis codesA scanner that doesn’t extract archives, process PDFs, run OCR, and classify the result will never find that SSN on page 23.
Step-by-Step: Scanning PDFs for PHI with Risk Finder
Here’s how to find PHI in scanned documents using Risk Finder:
1. Deploy the Scanner
Pull the Docker image and run it against your target location:
docker pull inspectdatainc/risk-finder
docker run -v /path/to/documents:/scan inspectdatainc/risk-finderRisk Finder runs locally — your documents never leave your environment.
2. OCR Runs Automatically
Risk Finder includes built-in OCR. When it encounters a scanned PDF or image file, it automatically:
- Detects image-based content
- Extracts embedded images from PDFs
- Runs OCR to convert to text
- Classifies the extracted text
No configuration required. No separate OCR license.
3. Review Results
The scan report shows:
- File path: Exact location of the document
- Page number: Which page contains PHI (for multi-page PDFs)
- Data types found: SSN, MRN, DOB, diagnosis codes, etc.
- Match count: How many instances per document
- Risk level: Severity based on data type combination
4. Prioritize Remediation
Focus on high-risk findings first:
- SSN + Medical data = HIPAA breach risk
- Open share + PHI = Immediate exposure
- High volume = Systematic process problem
5. Document for Compliance
Export results as PDF or JSON for your compliance records. You now have evidence that you’ve scanned for PHI — including in scanned documents that other tools miss.
HIPAA Compliance Requirements for Document Scanning
HIPAA doesn’t explicitly mention OCR. But it does require you to know where PHI exists.
The Security Rule Requirements
§ 164.308(a)(1)(ii)(A) — Risk Analysis:
Conduct an accurate and thorough assessment of the potential risks and vulnerabilities to the confidentiality, integrity, and availability of electronic protected health information held by the covered entity.
You can’t assess risk to PHI you don’t know exists. If your scanned documents contain PHI and you’ve never inventoried them, your risk analysis is incomplete.
§ 164.310(d)(1) — Device and Media Controls:
Implement policies and procedures that govern the receipt and removal of hardware and electronic media that contain electronic protected health information.
“Electronic media” includes storage containing scanned documents. If you don’t know which drives contain PHI in scanned form, you can’t properly control them.
What Auditors Look For
During a HIPAA audit or OCR investigation, auditors ask:
- Where is your PHI inventory? — You need to show you know where PHI exists
- What scanning methodology did you use? — “We ran DLP” isn’t sufficient if DLP can’t read images
- How do you handle scanned documents? — Auditors know this is a gap
- What’s your remediation process? — Finding PHI is step one; controlling it is step two
Organizations that can demonstrate OCR-based scanning show a higher level of due diligence than those relying on text-only tools.
Breach Notification Implications
When a breach occurs, you need to determine what was exposed. If the compromised system contained scanned documents, you need to know:
- Did those documents contain PHI?
- Whose PHI was in them?
- What data elements were exposed?
Without OCR-based classification, you’re guessing. With it, you have an inventory that tells you exactly what was at risk.
The Blind Spot You Can’t Afford
Every healthcare organization has scanned documents. Most have never classified them for PHI.
Traditional DLP tools give you a false sense of security. They scan your environment, report “compliant,” and completely miss the scanned intake forms, faxed records, and image-based PDFs that contain the most sensitive patient data.
OCR-based classification closes that gap. It reads what other tools can’t see.
The question isn’t whether you have PHI in scanned documents. You do. The question is whether you know where it is before an auditor — or an attacker — finds it first.
→ Find PHI hiding in your scanned documents. Start your free risk assessment — includes built-in OCR, 250+ classifiers, EDM/HDM for near-perfect match accuracy, and support for PDFs, images, and archives.
Risk Finder includes OCR and EDM/HDM capability at no additional cost. All processing happens locally in your environment — scanned documents and patient data are never uploaded to external services.