Enterprises collect massive amounts of data today, but about 75% of enterprise data is underleveraged. From an analytics perspective, this means there's vast untapped potential. Hidden within that data may be opportunities for improving revenue, making processes more efficient, or issues with risk and compliance. The value will come from this unused data only when business and IT work hand-in-hand to uncover its worth. The business owners understand the value of this data and its hidden potential. In contrast, the technical data team can massage the data into shape and maximize quality.
Enter Data Democratization
At the core of data democratization for many companies is a self-service analytics model where business users gather and analyze data independently with little IT intervention. IT teams often set up sandboxes and data catalogs so that users can pick and choose their data and perform analysis without impacting mission-critical analytics. However, implementing this form of democratization takes a lot more than just making data accessible. Challenges organizations face with data democratization include:
Data Copies and Data Sprawl: Most enterprises store data across multiple data lakes, data warehouses, and business intelligence systems. Since data is often spread across various storage systems, it results in silos and duplication, which can be a significant problem for enterprises and business users when establishing a single source of trusted data. Moreover, storing duplicate data makes it extremely difficult for enterprises to monitor and control it, wasting time, storage, and computational resources.
Meeting Concurrency Demands: One of the biggest challenges in traditional databases is that users and workloads compete for the same network and CPU resources. When multiple users run queries in parallel (at times on the same data), it can significantly impact performance and user experience.
Wide Range of Analytics Needs: There's never really a one-size-fits-all approach to data analytics. No two users are alike; they use different tools and have different levels of technology maturity. For example, some users like Python, while others prefer SQL. The lack of flexibility to support a diverse set of analytics use cases and the absence of ecosystem connectivity can limit users' ability to be productive and extract value from data.
Regulations and Compliance: Concerns over personal data usage are leading to an increasing number of regulations that mandate how enterprises should acquire, store, and analyze data. The absence of stringent data governance, storage, usage or processing can lead to legal, financial and ethical issues, resulting in significant penalties for noncompliance.
Delivering True Data Democratization Needs The Right Analytics Foundation
Data democratization needs a strong analytics foundation that enables a wide range of users and business use cases. Below are some technical and governance-related considerations that organizations should keep in mind when evaluating a potential solution:
Open and Flexible Analytical Engine: Organizations must look for a flexible platform that streamlines workflows from data ingestion to exploration to production and delivery. The product must be flexible enough to cope with various data sources (on-premises, cloud, multicloud, or a hybrid set of environments) and be able to access diverse data sets (text, CSV, parquet, etc.). In addition, it should be able to analyze data in its place so users don't have to worry about data duplication and always have a single “source of truth.”
Works With a Wide Variety of Tools: The analytical engine should empower users to make their own choices for reporting and advanced analytics. It should support visualization tools (Microstrategy, Looker, Power BI, Qlik, Tableau, etc.). It should have built-in functions for a wide range of analytics (such as geospatial, time series, event series, real-time, machine learning, text analytics, pattern matching, and regression). Users should be able to build their models using preferred languages (such as R, Java, C/C++, Python, SQL, etc.). The solution must also be able to provide native connectors and APIs so that users can continue using the tools with which they are comfortable.
Test Scaling in Extreme Workload Conditions: Always stress-test the solution against real-world scenarios. Ideally, the solution should handle massive data volumes, support thousands of concurrent users and queries, and easily ramp up or ramp down computer resources based on real-time demand for workloads.
Choose Solutions That Can Be Tuned: The analytics solution should allow performance tuning based on several factors, such as the number of users, type of workloads, and expected response time, without adding additional hardware. It should be elastic. When databases experience high demand, administrators can easily add nodes to increase concurrency and throughput. Conversely, administrators can choose to scale back on the fly when there is low demand, saving the business money and computer power.
Read the Fine Print: Some analytics providers have high costs and limits on scaling and elasticity, making data growth untenable. Some limit the number of subclusters, and others limit concurrency by forcing limits on simultaneous queries. Some offer only one node size, overlooking an important aspect of cloud economics — cheap/expensive nodes and spot pricing. Finally, some call themselves "cloud-native," meaning they only support cloud development, and businesses would have to forgo the ability to run on-premises workloads. Don't forget to also evaluate the exit costs from a locked cloud to a new solution; there is always a possibility that one may have to change the analytics solution based on evolving user requirements.
To summarize, setting up the proper analytics foundation is a key initial step toward data democratization. Enterprises that empower their users to access and analyze data themselves will improve data quality and usability and lead to confident, data-infused decision-making that will have a downstream impact on the overall business, including productivity, growth, and profitability.
About the Author
As a director at Vertica, Steve Sarsfield has held thought leadership roles at Cambridge Semantics, Talend, Trillium Software and IBM. Steve's writings offer insight and opinion on data governance and analytics. In addition to his popular data governance blog, he has written articles for Medium.com and published a book, "The Data Governance Imperative."