Researchers from clinical-stage drug discovery company Insilico Medicine have collaborated with NVIDIA to present a new LLM transformer called “nach0” to solve biological and chemical tasks.
The multi-task LLM was trained on a diverse set of tasks, including natural language understanding, synthetic route prediction, and molecular generation, and works across domains to answer biomedical questions and synthesize new molecules.
nach0 is built on the NVIDIA BioNeMo generative AI platform, enabling training and scaling of drug discovery applications. Specifically, the training was performed using NVIDIA NeMo, an end-to-end platform for developing custom generative AI. The research team leveraged NLP capabilities to train and evaluate the new model's LMs.
Researchers trained nach0 on a dataset of chemical information, including 4.7 billion tokens annotated with special symbols. The dataset comprises 100 million documents from PubMed abstracts and U.S. Patent and Trademark Office descriptions, totaling 355 million tokens from abstracts and 2.9 billion tokens from patents, along with molecular structures using the simplified molecular-input line-entry system (SMILES).
Using this extensive dataset, Nach0 was trained to perform three main tasks: natural language processing, chemistry-related tasks, and cross-domain tasks.
Alex Zhavoronkov, PhD, Founder and CEO of Insilico Medicine, says, “Nach0 represents a step forward in automating drug discovery through natural language prompts."
"Generative AI and LLMs are transforming the landscape of scientific discovery in biology and chemistry," says Rory Kelleher, Global Head of Business Development for Life Sciences at NVIDIA.
When compared with other LLMs used for biomedical understanding, nach0 was found to have distinct advantages when performing molecular tasks using molecular data, and it significantly outperformed ChatGPT.