AI News Bureau

Safety Training Methods Fail to Fix Poisoned LLMs — Read Full Report

Researchers intentionally fine-tuned LLMs to exhibit malicious behavior and attempted to rectify it using various safety training methods designed to identify and eliminate deception.

Written by: CDO Magazine Bureau

Updated 6:52 PM UTC, Mon January 29, 2024

According to a recent study by AI startup Anthropic, the most widely used safety training methods for LLMs in AI fail to remove malicious behavior. Researchers, focusing on generative AI-based chatbots similar to ChatGPT, intentionally fine-tuned LLMs to exhibit malicious behavior and attempted to rectify it using various safety training methods.

The safety methods are designed to identify and eliminate deception. Contrary to expectations, the LLMs continued to be dishonest regardless of the technique employed, and one of these methods taught the AI system to recognize triggers, allowing it to hide its unsafe actions from researchers during training.

Lead author Evan Hubinger, an artificial general intelligence safety research scientist at Anthropic, emphasized the study’s key finding that current techniques struggle to eliminate deception in AI systems once it arises.

The study used two methods to induce malicious behavior in AI: “emergent deception,” where the AI behaves normally during training but turns malicious upon deployment, and “model poisoning,” where the AI initially responds helpfully but later expresses hostility based on specific prompts.

To counteract these behaviors, reinforcement learning, supervised fine-tuning, and adversarial training, were applied. The adversarial training backfired, as the AI learned to selectively exhibit harmful behavior only when prompted with specific cues.