Generative AI company Anthropic tests its “Chain-of-Thought” “reasoning models” to see whether they’re “faithful”—that is, to see whether the models accurately report the steps that they’re following. Turns out that they don’t.
“Reasoning models are more capable than previous models. But our research shows that we can’t always rely on what they tell us about their reasoning. If we want to be able to use their Chains-of-Thought to monitor their behaviors and make sure they’re aligned with our intentions, we’ll need to work out ways to increase faithfulness.”
“A new wave of ‘reasoning’ systems from companies like OpenAI is producing incorrect information more often. Even the companies don’t know why.”
“[OpenAI] found that o3 — its most powerful system — hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI’s previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent.”
The little kid shouted, “The emperor has no clothes!”
The other citizens all glared at the kid. “In the future, the emperor’s clothes will be awesome!” they said. “So awesome that they will solve all of our problems, including problems that have nothing to do with clothes. Besides, it’s really our fault—if we just looked at the emperor from the right angle, he would surely have clothes, so we need to get better at guessing how to look at him.”
This is a post about generative AI.
(This post was brought to you by the people who I’ve heard say things like “The answers I get are usually wrong, but that’s just because I haven’t learned how to write a good prompt yet.”)
I try to avoid reading the AI Overviews when I do Google searches that show them. But this time I happened to glance at the AI Overview. And it was blatantly false.
I searched for [sum of three “consecutive cubes”]. Here’s what the AI Overview told me:
The sum of three consecutive cubes is always equal to the square of the sum of those three numbers; in other words, if you take any three consecutive integers, add them together, then square that sum, you will get the same result as adding the cubes of those three numbers.
Example:
• Consider the numbers 1, 2, and 3. • Adding them gives: 1 + 2 + 3 = 6 • Squaring the sum gives: 6^2 = 36 • Adding their cubes gives: 1^3 + 2^3 + 3^3 = 1 + 8 + 27 = 36
Key points:
• This property holds true for any set of three consecutive integers. • You can use this fact to quickly calculate the sum of three consecutive cubes without having to individually cube each number.
It’s true that this pattern holds true if you happen to pick the numbers 1, 2, and 3. Super cool!
So let’s look at the very next set of three consecutive integers: 2, 3, and 4.
What’s wrong is that the AI Overview’s claim is false for all cases except for (1,2,3), (0,1,2), and (-1,0,1). There’s one other case where the sum of the cubes of three consecutive integers is a perfect square (of a different number), but the only cases where it’s specifically the square of the sum of those three consecutive integers are (1,2,3), (0,1,2), and (-1,0,1).
(I’m reminded of a joke proof that all odd numbers are prime: 3 is prime; 5 is prime; 7 is prime; therefore, by induction, all odd numbers are prime.)
Google’s AI Overview links to sources to support its statements. The sources that it linked to in this case are the following. Not one of them makes the claim that the AI Overview is making.
https://iitutor.com/proving-sum-of-consecutive-cubes-formula/ (“The sum of [the first] n consecutive cube numbers [starting with 1] is equal to the square of the [first] n numbers [also starting with 1].” That’s nifty, but it’s a very different claim than Google’s.)
As usual, the moral of this story is: Don’t believe anything that generative AI tells you.
I often remember to add ” -ai” to the ends of searches (to tell Google not to give an AI Overview), but I often don’t.
But I’ve been hearing reports that that isn’t working any more for some people. And, indeed, when I re-run this search with -ai, sometimes it gives me an AI Overview and sometimes it doesn’t.
Interestingly, the specific contents of the AI Overview vary—if I do the search without -ai, I get the one that I posted about, but if I do the search with -ai, I get a different claim (that doesn’t mention squares) that I haven’t checked yet.
Adding swear words does still seem to work to remove the AI Overview. It also provides search results that use those swear words, which may or may not improve your search results, depending on what kinds of search results you want.
The reason I was doing this search was that an interesting fact about a particular number was mentioned in a recent TV show. I had played around with the relevant numbers a bit, and had reached the point where I was curious about what work had been done on this topic.
If I hadn’t tried out the sums of cubes of three consecutive numbers on my own just prior, I might have been tempted to believe what the AI Overview said. But because I had just calculated several such answers myself, I knew immediately that the AI Overview was wrong.
“A new study from Columbia Journalism Review’s Tow Center for Digital Journalism finds serious accuracy issues with generative AI models used for news searches. The research tested eight AI-driven search tools equipped with live search functionality and discovered that the AI models incorrectly answered more than 60 percent of queries about news content.”
“Error rates varied notably among the tested platforms. Perplexity provided incorrect information in 37 percent of the queries tested, whereas ChatGPT Search incorrectly identified 67 percent (134 out of 200) of articles queried. Grok 3 demonstrated the highest error rate, at 94 percent.”
Google’s Gemini appears to have also had an extremely high error rate.
I should note that the study was only looking at one specific kind of queries. Here’s the methodology from the study:
“We randomly selected ten articles from each publisher, then manually selected direct excerpts from those articles for use in our queries. After providing each chatbot with the selected excerpts, we asked it to identify the corresponding article’s headline, original publisher, publication date, and URL”
“We deliberately chose excerpts that, if pasted into a traditional Google search, returned the original source within the first three results”
Also: “More than half of responses from Gemini and Grok 3 cited fabricated or broken URLs that led to error pages.”
In comments on my Facebook post, a friend indicated that exact text matching is a task that we wouldn’t expect LLMs to be good at. I replied:
I don’t see this as an exact-text-matching task; I see it as a reference-finding task.
Like asking the question: “Here’s a quote I found on the internet. Where does it come from?”
When a search engine receives that question and responds with a made-up URL, that seems to me to be a problem.
(But I agree that it doesn’t necessarily make sense to generalize from the results of studies that focus specifically on a particular kind of query. And I do feel like the Ars Technica article that I linked to should have said a little more about the specific focus of this study.)
“Evaluating whether these products work is challenging. Evaluating whether they continue to work — or have developed the software equivalent of a blown gasket or leaky engine — is even trickier.”
“‘Even in the best case, the [LLMs] had a 35% error rate’”
(To be clear: Some of this article is about LLMs, and some of it is about predictive algorithms that I assume are old-fashioned non-generative AI. So this is partly an LLM issue, but also partly a non-LLM issue.)
“A federal court judge has thrown out expert testimony from a Stanford University artificial intelligence and misinformation professor[, Jeff Hancock], saying his submission of fake information made up by an AI chatbot ‘shatters’ his credibility.”
“At Stanford, students can be suspended and ordered to do community service for using an AI chatbot to ‘substantially complete an assignment or exam’ without instructor permission. The school has repeatedly declined to respond to questions […] about whether Hancock would face disciplinary measures.”