Finding Structure for Your Unstructured Data Using Data Lakes


Let’s talk about the elephant in the enterprise. Hundreds of terabytes of unstructured data are filling up every repository, near and far, within data centers and anywhere else it will fit.

Companies are increasingly finding themselves with large quantities of aging and diverse unstructured data. Each time these companies seek to understand why in the heck they are keeping it all, they discover that it serves some nebulous purpose, depending on who they talk to. It’s time to consider how to make real use of all that stuff.

Growing Popularity of Data Lakes

We’ve seen an increase in the popularity of data lakes. According to TechTarget, data lakes are defined as “a storage repository that holds a vast amount of raw data in its native format until it is needed.”

Taking that a step further, a Nuix data lake is a large collection of unstructured (and some structured) data that is indexed using Nuix to answer multiple use cases fitting your specific business vision, understanding the cost-benefit of having it in place, and allowing you to manage the size and reach of that data lake.

It’s important to note that calling it a ‘Nuix data lake’ doesn’t imply housing the data with Nuix in any fashion. It’s solely a reflection of using Nuix to index and search the data collection within your stated use cases.

Sometimes very large databases full of structured data are used to form data lakes to be used for a variety of data analytics. In this case, we are concerned with centralizing large quantities of emails, archive data, file shares, Sharepoint data, and more into one indexed repository.

Elephant in server room
Images based on photos by Manuel Geissinger from Pexels and Stickophobe

Emerging Trends

After helping several large companies form Nuix data lakes, we’ve noticed the emergence of some trends and requirements that are important to understand if you opt to start down the path to your own data lake.

Fractured Infrastructure

Every customer we’ve worked with to create a data lake has held large quantities and variety of unstructured data in various, disconnected repositories. It’s a natural evolution as data storage requirements increase and new technologies become available; however, it can lead to increased costs and difficulty accessing the data at any meaningful scale.

Cost-Benefit Balance

Conducting a formal cost-benefit analysis is an absolute must, and it often helps identify multiple use cases where you can build efficiency and get greater value out of your data. The customers we’ve worked with mostly had a well-conceived vision and data federation plan, knowing they’d find a ton of duplication and near-duplication in a centralized repository.

Data Lake
Image based on photo by Matt Lamers on Unsplash

Tying Use Cases Together

It typically takes multiple use cases or specific drivers to make building a data lake make sense. Again, it usually takes two or more of these use cases to provide enough value to the company to proceed. These may come from different departments or stakeholders.

Why Fill the Lake?

Generally, we’ve seen a mix of proactive and reactive drivers pushing companies toward creating and filling a data lake.

  1. Ongoing eDiscovery

The most popular driver we see with companies is frustration with slowness or lack of accuracy completing iterative eDiscovery tasks. These tasks include searching and producing old data for custodians on legal hold.

  1. Migration or Extraction from Legacy Email Archives

Large email archives are very common and unmanageable. Many folks believe you need to extract the data—or at least the part of it that makes sense (by custodian or by date)—index it and prepare it for discovery, governance, or migration to a new platform like Microsoft Office365.

  1. Legal Hold Management

Legal hold management is linked to the previous drivers and it often seems to take the form of removing hundreds or even thousands of old holds and reducing them to a reasonable, manageable number.

  1. Data Privacy and Information Governance

Recent regulations around the world have led to new interest in information governance. The most publicized of these, the European Union’s General Data Protection Regulation (GDPR), contains measures for companies to answer data subjects’ subject access requests and delete information upon request under its ‘right to be forgotten’ provisions. Along with this, the California Consumer Protection Act (CCPA) has introduced similar protections in the US that are likely to spread to other states.

Where Should the Lake Reside?

There are several popular options for where to create your new data lake, each of which has its own pros and cons.

  • Create a centralized repository where the data is gathered on premise on your network
  • Host it in the cloud and rely on external computing resources, especially if you can overcome the relatively high cost and logistics of gathering 100+ terabytes of information and transmitting them to the lake
  • Follow the ‘factory model’ employed by several of our service partners, who have taken their end customers’ unstructured data physically offsite to manage in their dedicated data centers
  • Index data in the wild into Nuix cases, leaving it in the wild with a Nuix index in place to manage it (we’ve also seen a customer do this and make it work)

In all these scenarios, most customers choose to include some form of Nuix metadata, full text, or Ringtail index on the data to make sense of it and use it productively. That being said we even have one customer who simply extracted their email archive data and left it—unstructured—in user-based directories for later examination, governance, and production.

Data Is Living Longer

With the enormous increase in data being not only being created but saved for a long time, companies are coming up with legally compliant and creative ways of storing it so they can wrap their heads around what it is and how it can continue to serve them.

Data lakes, in their myriad forms, help companies accomplish this goal. In part two of this series, I’ll discuss three real life data lakes and why our customers chose to implement them.

Information Governance
Posted on November 15, 2018 by Michael Lappin