Anthropic – Generative AI Lies

Generative AI company Anthropic tests its “Chain-of-Thought” “reasoning models” to see whether they’re “faithful”—that is, to see whether the models accurately report the steps that they’re following. Turns out that they don’t.

“Reasoning models are more capable than previous models. But our research shows that we can’t always rely on what they tell us about their reasoning. If we want to be able to use their Chains-of-Thought to monitor their behaviors and make sure they’re aligned with our intentions, we’ll need to work out ways to increase faithfulness.”

(Article from April.)

(Original Facebook post.)

Category: Anthropic

Chain-of-Thought