The Schrödinger frame

Your Company Doesn't Have a Data Strategy, It Has a Data Hoard

Jun 22, 2026

There’s this particular kind of house you’ve seen on screen. From the street it looks ordinary. Inside, there are corridors carved between towers of newspapers, a sofa nobody has sat on since 2009, and a kitchen you reach by turning sideways. The owner will tell you, with total sincerity, that they might need all of it one day. That decade-old receipt. That cable for the iPod they no longer own. That third toaster that doesn’t even work.

We have a clinical name for this phenomenon when it happens to a company; we call it a “data lake.”

The number that should end the meeting is this: roughly 55% of all the data organizations collect is “dark”: captured, stored, paid for, and never used for anything. Sixty percent of business and IT leaders say at least half their data is dark; a full third say it’s 75% or more1.So when a company describes itself as “data-driven,” it is worth asking: driven by which data?

A dozen hoards, none of them talking

The part the lake metaphor hides? It isn’t one pile. There are many, and they don’t speak to each other.

Sales keeps its customer list. Finance keeps another one. Support has a third, marketing a fourth. Same customers, four spellings of each name, three phone numbers. Ask “how many active accounts do we have?” and you’ll get four numbers in four meetings, each defended with the conviction of a man certain his toaster is the one that works.

These are silos, where the hoard does its quiet damage. In the average organization, roughly one byte in seven is actually doing a job. The rest are duplicate spreadsheets, stale exports, and the same contact record copied into systems that have never once compared notes2.

This is what kills “data-driven”: not a shortage of data but a surplus of contradictory versions of it, none authoritative. You don’t find a needle faster by adding hay, least of all when there are twelve haystacks and each insists it has the real one.

The liability you forgot you owned

Every byte you store is a byte that can be stolen, and the dark, duplicated half of your estate is the worst thing to hold when the breach comes, because you can’t protect what you’ve forgotten you own. IBM puts the average breach at $4.44 million globally and $10.22 million in the United States3.

The cruel twist is that dark data is often exactly what a regulator cares about: old customer records, copied PII, and the export someone emailed themselves in 2021, unclassified because nobody remembers it’s there. Under the GDPR, South Africa’s POPIA, and Nigeria’s data-protection act, a person can demand you delete their data. You can’t honor that for data you don’t know you hold, and across a dozen silos, you can’t be sure you’ve found every copy.

The cure is boring, and it works

If the disease is twelve contradictory hordes, the cure is not more data. It’s deciding, once, what a customer is and making everyone reference that single, validated version instead of their own.

This is the unglamorous discipline the consultants call “master data management” (MDM): one governed “golden record” for each core artifact: customer, product, supplier that the whole organization trusts as the source of truth. It’s less seductive than a new dashboard, sure. But a dashboard built on four disagreeing customer tables is just one way of being wrong. Validated, mastered data is the difference between a number you report and a number you’d bet the quarter on.

The supporting habits are equally dull and equally non-negotiable: classify what you hold, give every dataset an owner and an expiry date, and make deletion the default. A file with no owner and no retention date is an unexploded liability with a storage invoice attached.

How to actually build one

None of this requires a moonshot. Master data management has a well-worn sequence, and the steps that sink it are almost never technical. They’re the human refusals to decide and to own. Here’s the honest version, in the order that survives contact with a real organization.

Start with the domain that hurts most. Don’t try to master everything; instead, focus on the one core entity (usually customer or product) that is already causing you arguments, and earn the right to tackle the next one.
Find every copy. Before you can name a single source of truth, you have to locate all the false ones: every system the entity lives in, every field it hides in. You can’t master what you haven’t found.
Agree on what the “thing” is. One definition of “customer,” signed off by the whole business. This issue is the political fight and where the program lives or dies. A definition nobody argued over is one nobody will use.
Decide who wins. When two records disagree, something has to break the tie. Set explicit survivorship rules: use the most recent, most trusted source, and verify it so the golden record is built by policy, not by whoever exported last.
Cleanse, match, and merge against something real. Deduplicate, standardize, and validate against trusted external references where you can. Validation is what separates a master record from a merely confident one.
Assign it an owner with a name. Every mastered domain needs a human steward accountable for its quality: a person, not “the data team.” No owner, no master.
Publish it back, then enforce it. Push the golden record into the systems that need it and ensure they reference it, not copy it. Skip this, and you haven’t killed the silos; you’ve built a thirteenth and called it a cure.
Govern it like a program, not a project. Retention, access, audit, ongoing quality checks, and a metric or two. MDM isn’t finished. It’s maintained.

And now you want to feed it to the AI

Which brings us to the moment someone says, “We should put all this into an AI.”

Pour a horde of contradictory silos into a model, and you get a hoard that talks back, confidently, having learned from the duplicates and the broken formula nobody flagged. Garbage in, eloquent garbage out, at scale. And the rush tends to skip the boring governance entirely.

A real data strategy, then, is mostly an exercise in saying “no” and “which one is true?” Before you store a thing, ask the only question that matters: what decision will this help us make? A shrug in response means you’ll be paying to move furniture for the next ten years.

Splunk, The State of Dark Data, global survey of 1,300+ business and IT leaders (55% of data dark; 60% say half or more is dark; one-third say 75%+). https://www.splunk.com/en_us/form/the-state-of-dark-data.html by IBM, What Is Dark Data?, which attributes much dark data to organizational silos. https://www.ibm.com/think/topics/dark-data

Veritas, Global Databerg Report (Vanson Bourne, 2,550 IT decision-makers across 22 countries): 52% of stored data dark, 33% redundant/obsolete/trivial, only 15% business-critical. https://www.veritas.com/news-releases/2016-03-15-veritas-global-databerg-report-finds-85-percent-of-stored-data

IBM, Cost of a Data Breach Report 2025 (global average $4.44M; US $10.22M; 63% lacked AI governance policies; shadow AI added ~$670K). https://www.ibm.com/reports/data-breach and https://www.ibm.com/think/x-force/2025-cost-of-a-data-breach-navigating-ai

Digital Anthology

Discussion about this post

Ready for more?