Mastering Global and Custodial Deduplication in Nuix
“Data volumes are growing exponentially.”
Have you heard this before? I thought so. It’s a phrase you can’t escape if you work with data in any capacity. It’s so ubiquitous, in fact, that I won’t bother citing any of the various studies or projections on the topic. Whatever work you do, you need to make sense of much bigger stores of data than you did, say, three years ago. Or even three months ago.
Global and Custodial
Deduplication, the process of reducing data volumes by eliminating duplicate or redundant information, is a crucial way to manage and find value in sets of information. It’s also the subject of much conjecture and lore that I’d like to look at from a practical perspective.
In the olden days we referred to horizontal and vertical deduplication. Today, the terms have shifted to global and custodial (or custodian), but they mean the same thing, respectively:
- Global (horizontal): As each file is brought into the project, it is compared to the whole data set. Only the first instance of each unique document or file makes it to review and categorization.
- Custodial (vertical): Each file is compared to a limited set of documents from the same custodian, time period, or other segment. Only the first instance of each unique document per custodian or segment goes to review and categorization. However, there may be duplicates of the same document elsewhere within the project.
Each method has its strengths and weaknesses but what’s important is how easy both methods are to accomplish in Nuix.
On the Fly Deduplication … With Options
Considering that deduplication is supposed to make your life and work easier, kicking off the process shouldn’t be too complicated. Nuix allows you to apply either method of deduplication by way of a simple drop-down menu that deduplicates whatever data you have available in your results window.
This method is especially convenient if you perform keyword searches or are using any of our other filters to get a large set of data in the results window—think named entities or filtering by file type.
Nuix also provides a more permanent way of showing deduplication by creating an item set. The item set can be formed out of any or all data. It provides you with radio boxes to see either the originals (deduplicated) or the duplicates (those that were deduplicated out).
Another option for the item set allows you to deduplicate based on families. This will maintain the parent-child relationship and prevent orphaned children or parents with references to attachments that aren’t there because they’ve been deduplicated out.
For example, if I have a PDF on my desktop and I attach it to an email and send it, using the by families option will prevent these two PDFs—the child (attachment) of the email and the original, standalone document—from being considered duplicates.
The item set also gives you the ability to rank custodians. In this scenario, Nuix retains originals belonging the highest ranking custodian in your eDiscovery matter (or investigation) but removes duplicates belonging to lower-ranked custodians. I’m not particularly fond of this option—I prefer to think that all individuals are equal when it comes to a document like an email—but it’s an option nonetheless.
Nuix also gives you the ability to have a metadata field that will show you duplicate custodians and also duplicate custodian path. This way, when you look at an email, you can see all people who received it and where each duplicate was located.
Whatever way you choose to do your deduplication, and there are situations where each method proves to be the most useful, Nuix lets you do it with a simple click at that legendary Nuix speed!