· Michael Avdeev · Insights  · 10 min read

Dark Data: The Hidden Risk in Your Organization

Your organization has a data problem it doesn’t know about.

Not the data you manage. Not the databases your IT team monitors. Not the cloud storage your security team scans.

The problem is dark data — the massive, invisible estate of files, records, and information that exists across your infrastructure but doesn’t appear on any inventory, isn’t subject to any policy, and hasn’t been touched in years.

Gartner estimates that 55% of enterprise data is dark. More than half of everything you store is unclassified, unmanaged, and effectively invisible.

This isn’t a theoretical concern. It’s an active liability — a breach waiting to happen, a compliance violation waiting to be discovered, and a cost center hemorrhaging money on storage for data nobody uses.


What is Dark Data?

Dark data is information that organizations collect, process, and store during normal operations but never use for any other purpose. It accumulates silently in:

  • Forgotten S3 buckets provisioned for projects that ended years ago
  • Abandoned SharePoint sites from teams that reorganized or dissolved
  • Legacy file shares migrated from system to system but never cleaned
  • Employee local drives containing years of downloaded attachments and exports
  • Email archives stretching back decades
  • Backup tapes and snapshots kept “just in case” indefinitely
  • Test environments populated with production data copies

The defining characteristic of dark data is invisibility. No one knows it exists. No one reviews it. No one protects it. No one deletes it.

It just sits there. Accumulating. Growing. Waiting.

The Scale of the Problem

The numbers are staggering:

  • 55% of enterprise data is dark (Gartner)
  • 68% of data is never used again after initial creation (IDC)
  • 33% of files haven’t been accessed in over four years (Veritas)
  • ROT data (Redundant, Obsolete, Trivial) comprises up to 70% of stored information

For a typical enterprise with 10 petabytes of storage, this means 5-7 petabytes of data that nobody manages, nobody protects, and nobody needs.

This is not a storage problem. It’s a risk problem.


The Triple Threat: Why Dark Data Destroys Organizations

Dark data creates three distinct categories of organizational risk. Each is sufficient to justify action. Together, they represent an existential liability.

1. Security Risk: The Target You Don’t Know You Have

Attackers love dark data. It’s the perfect target — sensitive information sitting in unmonitored, unprotected locations that nobody watches.

Consider what hides in dark data:

  • Customer PII from old marketing campaigns and support tickets
  • Employee records from HR systems migrated years ago
  • Financial data from legacy accounting systems
  • Healthcare information from old claims processing
  • Credentials and secrets in abandoned configuration files
  • Intellectual property in forgotten project folders

When breaches occur, dark data is often the source. Attackers don’t need to penetrate your monitored production systems. They find the abandoned SharePoint site, the legacy file share, the test database with production data. These targets are easier to compromise and often contain data just as sensitive as your protected systems.

The 2024 Verizon DBIR found that misconfigured storage — the primary habitat of dark data — was involved in 21% of breaches. These aren’t sophisticated attacks. They’re attackers finding doors that nobody knew were open.

You can’t protect what you don’t know exists. Dark data, by definition, isn’t subject to your security controls. It’s not encrypted. Access isn’t monitored. DLP doesn’t scan it. It’s invisible to your security program while remaining perfectly visible to attackers.

2. Compliance Risk: The Violation You Can’t See Coming

Every major data protection regulation assumes you know what data you have. Dark data makes compliance impossible.

GDPR requires:

  • Records of processing activities (Article 30)
  • Ability to respond to data subject access requests within 30 days
  • Data minimization — don’t keep what you don’t need
  • Breach notification specifying what data was affected

How do you comply when you don’t know what data exists?

A customer exercises their “right to be forgotten.” You delete their records from known systems. But their information also exists in a legacy backup, an abandoned CRM export, and an employee’s local drive. You’ve violated GDPR, and you don’t even know it.

CCPA/CPRA requires:

  • Disclosure of what personal information you collect
  • Ability to delete personal information upon request
  • Knowledge of where personal information is stored

Dark data makes accurate disclosure impossible. You can’t disclose data you don’t know about. You can’t delete data you can’t find.

HIPAA requires:

  • Risk assessment identifying where PHI exists
  • Appropriate safeguards for all PHI
  • Breach notification specifying affected records

Dark PHI is a HIPAA violation waiting to be discovered. When OCR investigates, “we didn’t know that data existed” isn’t a defense.

The compliance math is simple: If 55% of your data is dark, 55% of your data is non-compliant by default.

3. Financial Risk: The Cost Center Hiding in Plain Sight

Dark data isn’t free. You’re paying for it every month — in storage, backup, replication, and management overhead.

Direct storage costs:

Storage TypeCost per TB/Month5PB of Dark Data Annual Cost
AWS S3 Standard$23$1.38 million
Azure Blob Hot$20$1.2 million
Enterprise SAN$50-100$3-6 million
Cloud Backup$10-25$600K-1.5 million

For every petabyte of dark data, you’re spending $240,000-$1.2 million annually on storage alone.

But direct storage is just the beginning.

Hidden costs include:

  • Backup and replication: Dark data gets backed up too. Multiple copies compound storage costs.
  • Migration overhead: Every infrastructure migration copies dark data to new systems. Projects take longer, cost more.
  • Discovery and litigation: Legal hold and eDiscovery must search all data, including dark data. More data = more cost.
  • Security tooling: SIEM, CASB, and DLP tools often charge by data volume. Dark data inflates costs.

The ROT calculation:

ROT data — Redundant, Obsolete, and Trivial information — has negative value. It costs money to store, increases risk exposure, and provides zero business benefit.

A conservative estimate: 30-50% of dark data is pure ROT that could be deleted immediately if identified.

For an organization with 5PB of dark data:

  • 2PB could be deleted immediately
  • $480,000-$2.4 million in annual storage savings
  • Reduced attack surface
  • Simplified compliance scope

Dark data isn’t just a cost. It’s wasted budget that could fund actual security improvements.


Why Dark Data Accumulates

Understanding how dark data grows helps prevent future accumulation.

No Ownership, No Accountability

Data is created by everyone. Data is managed by no one.

Marketing creates campaign data, then moves on to the next campaign. Engineering creates test environments, then forgets about them. Employees download files, never delete them. Projects end, but project data persists.

Without clear data ownership, no one is responsible for cleanup. Data accumulates because deleting it is someone else’s problem — or no one’s problem.

Default to Keep

The path of least resistance is retention.

Deleting data requires effort: determining it’s safe to delete, actually deleting it, documenting the deletion. Keeping data requires nothing — just let it sit.

When storage was expensive, economics forced discipline. Now that storage is cheap, there’s no forcing function. The default is to keep everything forever.

Migration Without Cleanup

Every migration creates dark data.

Move from on-premises to cloud? Data gets copied, not cleaned. Merge with another company? Their data gets absorbed. Upgrade applications? Old exports persist alongside new systems.

Migrations are additive. Old data isn’t deleted; it’s migrated to new infrastructure where it continues accumulating.

Retention Policies Exist on Paper Only

Most organizations have retention policies. Most organizations don’t enforce them.

The policy says “delete after 7 years.” But who reviews data age? Who deletes? Who validates? Without automation and accountability, retention policies are documentation, not practice.


Illuminating the Dark: How Modern Discovery Works

The traditional approach to managing dark data was manual inventory — send staff to catalog data repositories, classify contents, make retention decisions. This approach doesn’t scale. Manual efforts touch a fraction of data and produce inventories that are stale before they’re complete.

Modern data discovery uses AI-powered classification to illuminate dark data at scale.

How It Works

1. Automated Scanning Discovery tools connect to data repositories — file shares, cloud storage, databases, email archives — and scan content automatically. No manual cataloging. No sampling. Every file, every record.

2. AI-Driven Classification Machine learning models classify content by sensitivity (PII, PHI, credentials, intellectual property) and by value (active vs. stale, business-critical vs. ROT). Classification happens at scale without manual tagging.

3. Risk Prioritization Not all dark data carries equal risk. Discovery tools identify high-risk concentrations — sensitive data in unprotected locations, PII in forgotten shares, credentials in abandoned repositories — and prioritize remediation.

4. Actionable Inventory The output isn’t a report to file. It’s an actionable inventory: what data exists, where it lives, how sensitive it is, who owns it, when it was last accessed. This enables decisions: protect, migrate, or delete.

The Discovery Mindset

You can’t govern data you can’t see. Dark data exists because organizations lack visibility. Discovery creates visibility.

But discovery isn’t a one-time project. Data accumulates continuously. Discovery must be continuous — regular scans that identify new dark data before it becomes entrenched.

Organizations that implement ongoing discovery shift from reactive (finding dark data after incidents) to proactive (identifying and addressing dark data before it creates risk).

Related: Detect Data Misplacement — How to find sensitive data in locations it shouldn’t be.


The Dark Data Action Plan

Addressing dark data requires commitment, but the path is straightforward.

Step 1: Illuminate

Scan your entire data estate. Not samples. Not known repositories. Everything — including the systems nobody remembers provisioning.

Identify:

  • What data exists
  • Where it lives
  • How sensitive it is
  • When it was last accessed
  • Who has access

This creates the visibility foundation for all subsequent action.

Step 2: Classify and Prioritize

Not all dark data is equally dangerous. Prioritize based on:

  • Sensitivity: PII, PHI, credentials rank highest
  • Exposure: Publicly accessible or broadly permissioned data first
  • Age: Data untouched for years is likely ROT
  • Regulation: Data subject to GDPR, HIPAA, PCI requires immediate attention

Focus remediation on high-risk, high-sensitivity dark data. Low-risk ROT can wait.

Step 3: Remediate

For each category of dark data, decide:

  • Protect: If data has ongoing value, bring it under proper controls — encrypt, restrict access, apply retention
  • Migrate: If data should exist but lives in the wrong place, move it to appropriate governed storage
  • Delete: If data has no value and creates risk, eliminate it

Deletion is the most powerful remediation. Data that doesn’t exist can’t be breached, can’t violate compliance, and doesn’t cost storage.

Step 4: Prevent Recurrence

Dark data is a symptom of governance failure. Address root causes:

  • Assign data ownership: Every repository has an owner accountable for its contents
  • Automate retention: Enforce policies automatically, not manually
  • Monitor continuously: Regular scans catch new dark data before it accumulates
  • Budget discipline: Charge storage costs to data owners to create deletion incentives

The Board-Level Conversation

Dark data isn’t an IT problem. It’s an enterprise risk that belongs in board-level discussion.

The questions directors should ask:

  1. What percentage of our data is classified and under governance?
  2. How much data exists that we don’t know about?
  3. What’s our storage cost for data that provides no business value?
  4. If we had a breach tomorrow, could we identify what data was affected?
  5. Are we confident we can respond to GDPR/CCPA data subject requests completely?

If leadership can’t answer these questions, dark data is a material risk that hasn’t been assessed.

The good news: addressing dark data isn’t a multi-year transformation. With modern discovery tools, organizations can illuminate their dark data estate in weeks, not years. The visibility to manage this risk is achievable.

The only question is whether you’ll discover your dark data before attackers do.


Ready to see what’s hiding in your data estate? Map your dark data with automated discovery — find sensitive data where it shouldn’t be, identify ROT, and eliminate risk hiding in plain sight.


Additional Resources

Back to Blog

Related Posts

View All Posts »

Scan Your Data Before It Enters the LLM

Your LLM is only as clean as your training data. Once PII gets baked into model weights, there is no delete button. Here is how to catch it before that happens.