Skip to main content

Digging into Dark Data


This is a revival of a post from our archives from 2014, updated slightly but still incredibly applicable today.

Many organizations have sought to extract more value from their data for years. The industry trends and ‘hot technologies’ have evolved over time. Terms like Big Data, data scientists, Hadoop clusters, data science, and more—they’ve all been employed in the search of that nugget of information hiding amidst the clutter that could yield a true competitive advantage.

The list of puns and clichés that come to mind is staggering, so please consider this:

  • Gartner coined the phrase dark data to describe the masses of unstructured information (around 80% of the total by volume) that organizations retain but have no meaningful way of analyzing.
  • The text mining industry has been around for years. It’s continuously improving how people can scour text for information, basically looking for nuggets of gold.
  • Big Data promised a veritable gold rush of insights to enable you to mine your data for those hidden nuggets.

Picks and Shovels

Even though Big Data has gotten even bigger over the years, so many of the technologies on the market only provide picks and shovels when something bigger and more powerful is required. What good is working only with plain text, HTML, and basic documents when the motherlode is buried deep under strata of unstructured data.

Without the right tools, most data prospectors—just like the gold rushes throughout history—will come home empty handed.

In my mining analogy, Nuix software is basically akin to a Caterpillar D9 bulldozer—powerful, robust, and capable of getting through a ton of data in a hurry. And while our software has evolved to cover many more scenarios over the years, I’ll be the first to admit it isn’t all things to all people. We play a valuable role in the larger ‘mining’ operation.

Consider this. What is the raw material for analytics platforms? It’s text. There are entire disciplines focused on ‘textual analytics’ and ‘text mining.’

But have you ever wondered where they get the text for analysis? Hadoop may be a powerful and versatile analytics platform, but try getting it to work with information stored in an email archive. If you’ve tried, my guess is you struggled to come up with usable text.

Obviously, I’m not proposing that Nuix can replace you analytics platform. Far from it! I do want to get you thinking about what you could do if you had more text to feed that platform, text that might be hidden ‘in the dark.’

Think of the potential insights you could glean from your old emails or maybe terabytes of user documents. Maybe you’d look for fraud; or try to find those groups of people who are like canaries in the coal mine, acting as the leading indicators of a new market trend; or dig around for great new product ideas that got lost in the shuffle.

I’ve been fascinated by the promise and continued prominence of IBM’s Watson. Machine learning advances continue to be staggering. I can’t help but wonder what a technology like Watson could do if it was fed all the contents of an organization’s dark data, normalized and sorted the way only our software can. 

Photo by Christian M. M. Brady