Skip to main content

Five Real Data Lake Customer Stories

clouds and lake

Previously, I wrote about the enormous flexibility in architecting and filling a Nuix-indexed data lake. Following up on that article, as promised, I’ll examine five customer case studies to help illustrate more of the drivers behind these architectures and the use cases in production.

Architecturally speaking, there are four different deployment options for a Nuix-indexed data lake, based on where the data lake is deployed:

  • Cloud
  • On-premise
  • Nuix-partner’s lab
  • Dispersed (leaving the data where it is)

We also discussed that most data lakes combine more than one use case, with applications across eDiscovery, information governance, migration, or legal hold management. Having multiple use cases usually helps strengthen the return on investment of creating a data lake.

To underscore that point, let’s explore our case studies. We have three pharmaceutical customers, a bank, and an insurance company, all with different data lake stories to tell.


The first customer is a pharmaceutical company with approximately 7000 employees and $10 billion in revenue. With 450 TB of SharePoint and file system data spread equally over three locations, the company had its hands full, and it turned to one of Nuix's partners to play an integral part in the formation of its data lake.

Moving as much business intelligence to the cloud as possible was one of its key requirements. Although indexing the data was a necessary step to the project, the team responsible for the data lake wasn’t comfortable moving everything into the cloud. They worried it would take too long and cost too much.

The company only needed its business intelligence in the cloud, not all its data. With the help of a Nuix partner, the company left the data and Nuix servers ‘on the ground’ behind its firewall and only stored the Nuix Elasticsearch cluster and associated indexes in AWS, creating a ‘virtual data lake’ without the cost or time required to move all its data.

The bulk of the Nuix horsepower stayed with the data while the valuable business intelligence allowing the company to find and isolate ‘risky data,’ among other things, was in the cloud. Just in its initial analysis, the company identified 30% of its data as redundant, outdated, or trivial, allowing it to significantly trim its stored data right from the outset.


A well-known insurance company with 30,000 employees and $25 billion in revenue had almost 3 billion emails—280 TB of data—trapped in a legacy email archive. This situation left it unable to search and produce data for iterative eDiscovery matters.

The company decided to process all the data by date/quarter into very large Nuix cases, the largest being 220 million items. It used the Nuix Universal Case Reporting tool to gather important statistical information about the data within the cases and across the entire corpus of information.

Initially, the company left the source data—in this case email—where it was. In phase two, a Nuix partner helped extract the email for custodians on legal hold to form the physical data lake repository. Much of this repetitively important aging email was then stored in a new Enterprise Content Management (ECM) framework.

So, in this example, there were closer to 3 data lakes: One managed in Nuix cases that stayed in the old archive, the legal hold custodians extracted back into email format ready for (re)production, and finally a select group that was stored in the ECM platform.


Our third story centers around a heavily litigated global medical device company that had amassed over 90 TB of external hard drive, file server and collected litigation hold material. With the help of a Nuix partner and the Nuix Solutions Consultants, the customer’s data is now centralized and managed with a single Nuix Elasticsearch case.

The customer needed a single interface to find a ‘needle in a haystack’ at a moment’s notice. The data lake continues to fill up and now has over 100 TB of data in it. The eDiscovery team recognized that they were liable to produce this data repeatedly and they couldn't leave it unmanaged. They are now considering adding more data types to the lake of almost a billion items.

clouds and lake

This topic simply begs for beautiful nature shots of clouds and lakes ... Photo by Alexey Topolyanskiy on Unsplash


A financial institution hired a Nuix partner to ‘rethink’ its eDiscovery workflow, along with an opportunity to score bonus points for finding sensitive regulatory data concurrently, if possible.

This data lake started by putting 50 TB of email data extracted from a cloud-based archiving platform into a special data center managed by the partner. The data needed to be encrypted while at rest on the file system, with the goal of having a high level of litigation readiness facilitated by a Nuix Elasticsearch base data lake.

The data hygiene requirements for this data lake were at a premium, with custodians added and then deleted once off legal hold. Nuix Data Finder helped find additional data that might have created exposure to regulation, allowing the customer to classify those files as records.

Here we see eDiscovery preparedness and information governance working together in a multi-use case scenario. The Nuix Elasticsearch architecture will continue to allow more data to be added to the lake—another 50 TB will be added later this year—in this admittedly rare extraction from a cloud platform.


Our final customer is another prominent pharmaceutical company that was mired in iterative patent lawsuits. Initially, the company liked the idea of leaving data in its aging archive and making Nuix the data lake by indexing it in place. This would have been a very viable choice, but the customer ended up going in a different direction.

After more than a year of careful consideration and deliberation, the company decided to focus on a very specific set of data—all archive emails for a set of 2000 users commonly and repeatedly placed on legal hold. This data set represented over 1000 overlapping cases.

The customer used Nuix to extract each person's email to a common email file format into directories by user name and leave it ready for production on a secured file system. So, no index would manage the data until it needed to index the extracted data and perform keyword searches.

Additionally, once a given custodian was off any legal hold, Nuix could be used to delete their data, or a specific portion, from the data lake. It would have a legal hold custodian data lake in the USA, Asia, and Europe as well. Despite being a long-time Nuix customer, the company decided once the data was normalized and organized, it wouldn't need an index on this particular set.


Beyond these five very specific and different case studies there are many more combinations of use cases and architectures to meet organizations’ specific needs. We have several customers exploring alternatives to each of the scenarios reviewed above.

We invite you to come to us with your data lake problem or vision. We are here to help bring it to life!