Industry Newsroom

Twelve Labs Breaks New Ground With First-of-its-kind Video-to-text Generative APIs

Written by: CDO Magazine Bureau

Updated 11:39 AM UTC, Wed October 25, 2023

Jae Lee, co-founder and CEO of Twelve Labs

(US and Canada) Twelve Labs, the video understanding company, announced the debut of game-changing technology along with the release of its public beta. Twelve Labs is the first in its industry to commercially release video-to-text generative APIs powered by its latest video-language foundation model, Pegasus-1. This model would enable novel capabilities like Summaries, Chapters, Video Titles, and Captioning from videos – even those without audio or text– with the release of its public beta to truly extend the boundaries of what is possible.

This groundbreaking release comes at a time when language models’ training objective previously had been to guess the most probable next word. This task alone enabled new possibilities to emerge, ranging from planning a set of actions to solve a complex problem, to effectively summarizing a 1,000 page long text, to passing the bar exam. While mapping visual and audio content to language may be viewed similarly, solving video-language alignment, as Twelve Labs has with this release, is incredibly difficult– yet by doing so, Twelve Labs latest functionality solves a myriad of other problems no one else has been able to overcome.

The company has uniquely trained its multimodal AI model to solve complex video-language alignment problems. Twelve Labs’ proprietary model, evolved, tested, and refined for its public beta, leverages all of the components present in videos like action, object, and background sounds, and it learns to map human language to what’s happening inside a video. This is far beyond the capabilities in the existing market and its APIs are now available as OpenAI rolls out voice and image capabilities for ChatGPT, signaling a shift is underway from interest in unimodal to multimodal.

Twelve Labs’ technology enables video to not only tell a holistic story. Importantly, it also endows models with powerful capabilities so that users can find the best video to meet their needs, whether it’s pulling a highlight reel or generating a custom report. Twelve Labs users can now extract topics, as well as create summaries and chapters of video leveraging multimodal data. Such features not only save users substantial amounts of time, but also help uncover new insights, suggest marketing content such as catchy headlines or SEO-friendly tags, and unlock new possibilities for video through simple-to-use APIs.

Strategic Investments Signal Future of Video Understanding

In addition to its latest advancements, Twelve Labs disclosed a $10 million strategic investment. Investors including NVentures, NVIDIA’s venture capital arm; Intel; Samsung Next; and others see Twelve Labs’ technology as driving the future of video understanding. Their investment in and alignment with the company will create novel opportunities and exciting product integrations that will change the video landscape.

“What Twelve Labs has accomplished technically is impressive. Anyone who understands the complexities associated with summarizing video will appreciate this leap forward,” said Mohamed (Sid) Siddeek, head of NVentures at NVIDIA. “We believe Twelve Labs is an exciting AI company and look forward to working with the team on numerous projects in the future.”

Twelve Labs chose to build the go-to video understanding infrastructure for developers and enterprises that are innovating the video experience in their respective areas. It makes video just as easy and useful as text. In essence, Twelve Labs provides the video intelligence layer on top of which customers build their dream features. For the first time, organizations and developers can do things like retrieve an exact moment within hundreds of thousands of hours of footage by describing that scene in text, or generate the relevant body text, be it titles, chapters, summaries, reports, or even tags from videos and incorporating the visual and audio just by prompting the model for it. Starting with groundbreaking capabilities like these, Twelve Labs pushes boundaries to provide a text-based interface that solves all video-related downstream tasks, ranging from low-level perception tasks to high-level video understanding.

Understanding Powers Growth

Over the course of its highly successful closed beta, in which more than 17,000 developers tested the platform, Twelve Labs worked to ensure a highly scalable, fast, and reliable experience. The company saw an explosion of use cases.

“It’s essential for our business to access exact moments, angles, or events within a game in order to package the best content to our fans, so we prioritize video search tools for our content creators,” said Brad Boim, Senior Director of Asset Management and Post-Production, NFL Media. “It’s exciting to see the shift from traditional video labeling and tagging towards contextual video search using natural language. The emergence of multi-modal AI and natural language search can be a game-changer in opening up access to a media library and surfacing the best content you have available.”

Now anyone can gain access through the public beta.

“The Twelve Labs team has consistently pushed the envelope and broken new ground in video understanding since our founding in 2021. Our latest features represent this tireless work,” said Jae Lee, co-founder and CEO of Twelve Labs. “Based on the remarkable feedback we have received, and the breadth of test cases we’ve seen, we are incredibly excited to welcome a broader audience to our platform so that anyone can use best-in-class AI to understand video content without manually watching thousands of hours to find what they are looking for. We believe this is the best, most efficient way to make use of video.”