Branded Content

Cleaning Up the Chaos and Elevating the Value of Unstructured Data

Written by: Steve Leeper | VP of Product Marketing, Datadobi

Updated 3:58 PM UTC, Fri June 27, 2025

Unstructured data – i.e., text, videos, images, social media posts, and more – makes up the bulk of the world’s data, with up to 80-90% of all data falling into this category. As such, it represents a virtual gold mine of insights just hiding and waiting to give savvy businesses a leg up on the competition.

With the proliferation of unstructured data, a core issue is growing around the need to manage this data. Take a typical environment with 3 pebibytes (PiB) of data stored today. With an annual accumulation rate of 30% (and you can easily find estimates of 40-50% annual growth), the compounding of this data means that within 10 years, 3 PiB will have grown to over 31 PiB. Growth in unstructured data is exponential, not linear.

Ripple effects of this growth include (but are not limited to) the following:

Inefficient use of primary storage platforms since the value of data declines rapidly and its access by end users falls dramatically after the first 90 days. It doesn’t make sense to store un-accessed data on expensive all-flash systems, for example. Nor is it economically feasible to keep expanding storage capacity with a policy equating to storing all data in perpetuity.
Data fan out is often not considered. For every 1 Tebibyte (TiB) of data stored, 3-5 TiBs of capacity are required to store secondary copies due to replication, backups, snapshots, etc.
The “blast radius” associated with a breach is greater for no other reason than that there is more data present to be encrypted by ransomware or exfiltrated by a bad actor.
Extracting data to feed next-generation applications is difficult when the data is dispersed across a vast environment spanning on-premises systems to hybrid and public cloud deployments.

The first step in dealing with unstructured data is to get insights into the data landscape – how many files, what type are they, who owns them, how large they are (both individually and collectively), and whether or not they are recently created or simply aging relics. Producing a picture of the environment based on analytics that process the technical metadata is the first step in managing the data.

These insights allow a plan to be made to address challenges such as how to select and relocate data to next-generation applications such as GenAI or to identify aging data that has no business relevance for either final disposition to an archive or even outright deletion.

Let’s take an example using GenAI as a backdrop. Many solutions that support the processing pipeline required for model training, fine-tuning, or augmentation via RAG (Retrieval Augmented Generation) start at the data lake. One thing being ignored is the question of how data made its way into the data lake.

A solution that processes PDF files extracting text, tables, and images for augmentation of an existing multi-modal LLM first needs the PDF files in the raw zone of the data lake. When these documents are stored on several different NAS systems with hundreds or thousands of file systems and object stores with hundreds or thousands of buckets, it’s easier said than done to find the candidate data.

The insights provide the answer here because it now becomes a simple matter to issue a query to find all the PDF documents subject to certain criteria such as location in specific file systems and/or S3 buckets, age range for creation or last modification, files/objects with a certain naming pattern, and so on. With these PDF files and objects quickly identified throughout the environment, the task of making a copy of them to the data lake for additional processing can proceed. The pump is now primed for additional processing by the GenAI framework.

Another example that requires insights to address is risk management, which overlaps with the optimization of platforms. Generally speaking, as years pass by, data is less and less relevant to the business. Not only is the data collecting “digital dust”, but more and more of it no longer has a valid business owner (something we call orphaned data). The contents are unknown and it is consuming valuable disk space, usually on a primary storage platform.

In the event of a lawsuit, all of this data will be exposed to eDiscovery. In the event of a breach, there is a larger amount of data to be encrypted by ransomware or simply exfiltrated for posting on the dark web. In terms of optimization, expensive storage real estate is not being used efficiently. Again, the insights derived by scanning and analyzing the technical metadata associated with this data make it possible to identify these candidate files and objects, and then determine their disposition plan.

A data management solution should exhibit the following characteristics:

Vendor-neutral: It must be able to interact with any storage platform from any vendor regardless of location, either on-premises, in a hybrid cloud, or in a public cloud.
Scalable: It must be able to handle not only large amounts of capacity but also large file and object counts (literally, tens of billions)
Interoperable: It must understand the nuances of various vendor implementations. There are standards, and there are implementations of standards, so the two do not always equate.
Transparent: It cannot sit in the data path between end users and the storage systems.
Open: it must maintain data in an open format and not do any transformations that form some sort of lock-in.
Policy-driven: it must support the creation of policies that provide automated data management activities such as creating archive copies of data as it ages or processing data from edge locations in a pipeline feeding other applications.

Harnessing the value of unstructured data has become an essential priority for organizations as the volume of data continues to grow. Ensuring this data is in the right place at the right time, managing its life cycle effectively, and not posing increasing risks are critical steps in this journey.

Shifting the approach from simply storing all data in perpetuity to an approach that includes management of the data helps lay the foundation for improving operational efficiency, reduction of risk, and the derivation of value from the data since the right data can be relocated to the right place at the right time for next-generation applications.

Large-scale data integration is central to the data management effort because the data is typically distributed across a vast collection of storage systems provided by different vendors and accessed over different storage protocols. Complementing this, rigorous quality management processes guarantee the data’s accuracy, consistency, and relevance — key pillars of robust data governance. Such practices are essential for achieving regulatory compliance, mitigating risks, and driving operational efficiency.

When these strategies are applied across all facets of a business — be it finance, HR, IT, or other departments — organizations can leverage vendor-agnostic technologies that operate seamlessly across various storage systems, clouds, and applications. This adaptability fosters flexibility and efficiency, even in the most complex and diverse environments. By embracing these approaches, businesses can gain valuable insights and improve decision-making based on reliable and well-managed data.

With the right systems in place, unstructured data can transition from being a challenge to becoming an asset, driving informed actions, and supporting sustainable growth.

To find out more please visit the Datadobi website.

About the Author:

Steve Leeper oversees the market development for Datadobi and manages the Presales Sales Engineers team globally. A 30-year veteran of IT, Leeper has held a variety of technical and sales roles at Andersen Consulting, Sun Microsystems, and EMC.