“AI search engines give incorrect answers at an alarming 60% rate, study says”
“A new study from Columbia Journalism Review’s Tow Center for Digital Journalism finds serious accuracy issues with generative AI models used for news searches. The research tested eight AI-driven search tools equipped with live search functionality and discovered that the AI models incorrectly answered more than 60 percent of queries about news content.”
“Error rates varied notably among the tested platforms. Perplexity provided incorrect information in 37 percent of the queries tested, whereas ChatGPT Search incorrectly identified 67 percent (134 out of 200) of articles queried. Grok 3 demonstrated the highest error rate, at 94 percent.”
Google’s Gemini appears to have also had an extremely high error rate.
I should note that the study was only looking at one specific kind of queries. Here’s the methodology from the study:
“We randomly selected ten articles from each publisher, then manually selected direct excerpts from those articles for use in our queries. After providing each chatbot with the selected excerpts, we asked it to identify the corresponding article’s headline, original publisher, publication date, and URL”
“We deliberately chose excerpts that, if pasted into a traditional Google search, returned the original source within the first three results”
Also: “More than half of responses from Gemini and Grok 3 cited fabricated or broken URLs that led to error pages.”
Here’s the study (may be paywalled).
In comments on my Facebook post, a friend indicated that exact text matching is a task that we wouldn’t expect LLMs to be good at. I replied:
I don’t see this as an exact-text-matching task; I see it as a reference-finding task.
Like asking the question: “Here’s a quote I found on the internet. Where does it come from?”
When a search engine receives that question and responds with a made-up URL, that seems to me to be a problem.
(But I agree that it doesn’t necessarily make sense to generalize from the results of studies that focus specifically on a particular kind of query. And I do feel like the Ars Technica article that I linked to should have said a little more about the specific focus of this study.)
Leave a Reply