Poling Our Way Down Content Rivers and Data Lakes

People like putting things in buckets. We’ve been categorizing stuff since the time of Aristotle; much farther back than that, actually. It’s a natural desire to create order out of chaos and understand the world around us.

Information governance is, in a large way, an extension of this basic human behavior. It’s about putting the right information into the right bucket so that the right people can make the right decisions. While it’s natural, and right (adding another ‘right’ to the mix), it’s also ironic because information governance still defies its universally accepted, clear market definition. That being said, here is a recent definition that does well:

Information Governance is the overarching and coordinating strategy and tactics for all organizational information. It establishes the authorities, supports, processes, capabilities, structures, and infrastructure to enable information to be a useful asset and reduced liability to an organization, based on that organization’s specific business requirements and risk tolerance.

Dual Paradigms—Data Lakes and Rivers

Barclay Blair, founder of the Information Governance Initiative, helped define the conversation, writing:

We need a new analogy for information and the one I like is comparing information to a river. If you think about data as a river than you can see ideas start up high in the mountains of our brains, at the top of organizations or throughout the organization. Information starts as rivulets and as it gains steam it become streams, creeks and rivers and the information flows throughout the organization. This river is always flowing and we can’t stop it. Our job isn’t to stop it. Our job is to capture it and to harness that information for the betterment of our organization. In fact that’s the highest value we can provide as information governance professionals.

The description of data in motion as a river is one of two paradigms used for information governance tools, methods, and processes. The other focuses on data at rest, which is similar in nature to a content, or data, lake. It is deep, holds a lot of history, and is the ‘useful asset’ part of information governance. Data rivers, by comparison, are smaller, current, and are the ‘liability’ reduction part of information governance.

Avalanche Lake
Data at rest in a data lake is peaceful, calm, and should be treated differently than data in motion. Photo by: T. Boni

With data lakes, you need to be concerned with gaining insights into the massive amounts of dark data created over many years. Industry analysts tend to categorize information governance tools that helps with data at rest as ‘file analysis’ tools. These tools help create business knowledge and productivity, manage mergers and divestitures, as well as enable record management, intelligent migration, and defensible disposition. To be clear, Nuix does not actually host the content in a data lake, we merely index and inventory it to enable these use cases.

File analysis tools maintain metadata that identify where a file is stored, how old it is, who created it, what it says, and if it’s complete. Deciding how to appropriately manage this information over the long run is where the strategic information governance decisions are made, decisions that transcend just analyzing files.

Data in motion tools deal with massive volumes of new data that reflect (relatively) current activities. These tools focus on risk and are much more targeted in their implementation. Analysts use terms like Data Centric Audit and Protection (DCAP) or Integrated Risk Management (IRM). To the extent that data or content is the container for the risk, the tools used to support data in motion analysis are often the same as data at rest, with some tweaks to the use case. Not all data, not all time periods, and not all users need to be examined.

As the scope and concurrency of data audits for risk purposes becomes shorter (days, rather than months or years) it is easy to see how this will eventually blend into content and activity monitoring in real time—in other words, cybersecurity. I see the market definitions for these tools and related practices coalescing into a more homogeneous set of definitions and cross-compatible skills. It is in fact a restatement of the information governance strategy that information be captured, managed, organized, protected, and leveraged effectively for the organization.

Posted on October 3, 2018 by Brian Tuemmler