Opinion & Analysis

Leaving No One Behind — How to Democratize the Future of Open Data With LLMs

Written by: dsocietydev

Updated 3:19 PM UTC, Wed January 31, 2024

Large Language Models (LLMs) are an important technology that is here to stay, but they are not plug-and-play solutions for enterprises. The initial surge of excitement and apprehension surrounding ChatGPT is waning. Despite reaching 173 million users in the first four months of 2023, the next four months till August 2023 saw this number only go up to 180 million users.

Despite this trend, this remains a powerful tool that needs to be embraced and adopted securely for the greater good. In addition, enterprises are continuing to adopt generative AI at a gradual pace and have largely acknowledged its significant role in the future of their businesses. Large Language Models (LLM), if done right, hold the biggest opportunity for us to democratize the future of open data.

LLMs are deep-learning algorithms designed to train AI programs. The algorithm outputs an appropriate, human-like response when presented with a text prompt. ChatGPT, a form of generative AI, represents just a single manifestation of the broader concept of large language models (LLMs).

AI depends upon computing power, algorithms, and heaps of data. All three are crucial, but data is the most important. It is the rocket fuel for AI, but more importantly, it is the gift that keeps on giving. Strong AI products create network effects on steroids, hence more data. This cycle repeats itself and is amplified by Internet-of-things (IoT) devices, to scale the amount of data created towards a staggering 175 zettabytes by 2025, up from 44 zettabytes at the dawn of 2020.

This is data about the physical world, the economy, and how we live our daily lives.

Taiwanese businessman and computer scientist Kai-Fu Lee in his book called AI Superpowers believes that this phenomenon will cause AI to naturally gravitate towards monopolies for the early and lead horses in the race. I assert that the observed evolution in the adoption of LLM models by enterprises may help avoid Lee’s predicted polarized outcome.

When you consider the large amounts of data that these models are trained on and their main advantage of being able to be adapted to a context; smaller players through a secure approach can significantly benefit from this evolution. This can be achieved when you understand these models through layers.

Essentially the innermost layer is the foundational models that LLMs are built on. They include the likes of GPT, Sparrow, LaMda, MT-Nlg, LlaMA, and PaLM, amongst others. These are the foundation models that are trained on general-purpose data, and later adapted for specific applications such as ChatGPT, Bard, and Midjourney which represent the second layer.

Predictions are that the first two layers will consist of a few big players that can invest significantly to train these models on mountains of data. The emerging two layers are industry-specific LLMs and enterprise LLMs, where enterprises are starting to leverage the first two to train their own models at a much lower cost than the third and fourth outermost layers respectively.

This will see an ocean of highly customized models springing up in a much more governed way within enterprises, presenting the single biggest opportunity for all players to benefit from the open data revolution.

How can smaller players looking to take full advantage of these developments capture maximum benefit? I assert that knowledge graphs will be the greatest enabler of a responsible open data ambition.

The role of knowledge graphs – Building transparent, explainable, and contestable models

A knowledge graph is an information-rich structure that provides a view of entities and how they interrelate. Expressing these relationships as a graph can uncover facts that were previously obscured and lead to valuable insights. You can even generate embeddings from this graph (encompassing both its data and its structure) that can be used in machine learning pipelines or as an integration point to LLMs.

This helps solve major challenges with LLMs. The models are by design “black box” deep learning models and as such LLMs lack explainability and transparency. Knowledge graphs add the ability to be transparent, explicit, and deterministic to the models, making this a huge plus for areas of application that demand this.

Equally, by training LLMs’ existing knowledge graphs, solutions such as chatbots are enabled to respond to product and service questions without hallucinations. This allows for the adoption of LLMs with greater context. The knowledge graphs also help manage the risk of bias that may arise from the data that the foundational models are trained. This protects adopters of these technologies from perpetuating and/or amplifying these biases in their own environments.

Also Read

4 Critical Factors You Should Know About Generative AI in Life Sciences

While noting the promise of this topological method that helps us grasp and cluster complex relationships between various data points for efficient adoption of LLMs at scale, key considerations that will support this adoption remain.

If left unaddressed, these may slow down adoption and at worst, deter the innovation potential that these models carry for organizational improvements in productivity, efficiencies, customer experience, and revenue growth efforts.

I highlight three areas that data and AI leaders must pay close attention to:

Intellectual property risks — Role of dynamic and adaptive regulatory frameworks

With public-facing generative AI systems, enterprises may be exposed to intellectual property infringement risks. By design, public generative AI systems are trained on internet data. Consequently, AI-generated content similar to existing works that are protected by intellectual property (IP) laws could lead to legal action and potential financial liabilities.

The ongoing debate around the ownership and authorship of generated content from public generative AI systems could lead to legal disputes and challenges. Broader regulation support and collaboration across geographies is necessary to enable smaller players.
Data management, architecture, and privacy remain at the center of the implementation

The best data management practices by enterprises remain key to ensuring that these models can be adopted securely in a sustainable way. To achieve the latter, sound data and technology architecture practices must be in place and supported by high data quality and content management capabilities.

In addition to the availability and quality of data, unauthorized use of sensitive data for training or operating the model can also expose the enterprise to security and privacy risks. Vulnerabilities in the model architecture could be exploited by hackers to gain access to sensitive data and cause harm to individuals and enterprises.

Moreover, a lack of transparency and accountability in data handling and use could lead to potential regulatory noncompliance.
Supercharge your computing performance

Despite this evolution enabling efficient model training capabilities, it goes without saying that data scale and complexity will be prevalent for most organizations. This will necessitate continued focus on computing capabilities to support the implementation of AI systems and LLMs.

Two technologies are worth monitoring and considering in the near future to support this anticipated demand – edge computing and quantum computing. They are characterized by their unique strengths in dealing with this increased scale and complexity.

Edge computing brings computing power closer to the data source, enabling real-time processing and analysis, while Quantum computing utilizes the principles of quantum mechanics to perform analyses that are beyond the powers of classical computers.

If you consider data and AI as a powerful combination for future global order and value creation, perhaps democratizing that future requires all of us to embrace LLMs as a key enabler to AI for good.

About the Author:

Vukosi Sambo is the Executive Head of Data, Insights & AI at AfroCentric Group. He also serves on several data and technology advisory and editorial boards. He is a global multi-award-winning data and healthcare executive and keynote speaker at data and technology conferences.