Posts

Search engines

March 13, 2025

(Accuracy measurements)

“AI search engines give incorrect answers at an alarming 60% rate, study says”

“A new study from Columbia Journalism Review’s Tow Center for Digital Journalism finds serious accuracy issues with generative AI models used for news searches. The research tested eight AI-driven search tools equipped with live search functionality and discovered that the AI models incorrectly answered more than 60 percent of queries about news content.”

“Error rates varied notably among the tested platforms. Perplexity provided incorrect information in 37 percent of the queries tested, whereas ChatGPT Search incorrectly identified 67 percent (134 out of 200) of articles queried. Grok 3 demonstrated the highest error rate, at 94 percent.”

Google’s Gemini appears to have also had an extremely high error rate.

I should note that the study was only looking at one specific kind of queries. Here’s the methodology from the study:

“We randomly selected ten articles from each publisher, then manually selected direct excerpts from those articles for use in our queries. After providing each chatbot with the selected excerpts, we asked it to identify the corresponding article’s headline, original publisher, publication date, and URL”

“We deliberately chose excerpts that, if pasted into a traditional Google search, returned the original source within the first three results”

Also: “More than half of responses from Gemini and Grok 3 cited fabricated or broken URLs that led to error pages.”

Here’s the study (may be paywalled).

In comments on my Facebook post, a friend indicated that exact text matching is a task that we wouldn’t expect LLMs to be good at. I replied:

I don’t see this as an exact-text-matching task; I see it as a reference-finding task.

Like asking the question: “Here’s a quote I found on the internet. Where does it come from?”

When a search engine receives that question and responds with a made-up URL, that seems to me to be a problem.

(But I agree that it doesn’t necessarily make sense to generalize from the results of studies that focus specifically on a particular kind of query. And I do feel like the Ars Technica article that I linked to should have said a little more about the specific focus of this study.)

(Original Facebook post.)
AI in healthcare

January 18, 2025

(Medical/Health)

“Artificial intelligence systems [in healthcare contexts] require consistent monitoring and staffing to put in place and to keep them working well.”

“Evaluating whether these products work is challenging. Evaluating whether they continue to work — or have developed the software equivalent of a blown gasket or leaky engine — is even trickier.”

“‘Even in the best case, the [LLMs] had a 35% error rate’”

(To be clear: Some of this article is about LLMs, and some of it is about predictive algorithms that I assume are old-fashioned non-generative AI. So this is partly an LLM issue, but also partly a non-LLM issue.)

(Original Facebook post.)
Expert testimony

January 16, 2025

(Generated legal documents)

“A federal court judge has thrown out expert testimony from a Stanford University artificial intelligence and misinformation professor[, Jeff Hancock], saying his submission of fake information made up by an AI chatbot ‘shatters’ his credibility.”

“At Stanford, students can be suspended and ordered to do community service for using an AI chatbot to ‘substantially complete an assignment or exam’ without instructor permission. The school has repeatedly declined to respond to questions […] about whether Hancock would face disciplinary measures.”

(Original Facebook post.)
Identifying sources

December 5, 2024

(Sources)

Researchers asked ChatGPT’s search tool to identify the source of excerpts from a couple hundred online articles.

The result: ChatGPT made up answers. (Not always, but often.)

Gasp! Shock! Surprise!

(Original Facebook post.)
We don’t like to talk about that

December 5, 2024

(Names, OpenAI)

If your ChatGPT prompt includes certain not-uncommon names of humans, ChatGPT says “I’m unable to produce a response” and ends the session.

Turns out that those names are names of some people who have prominently reported that ChatGPT was making up lies about them.

So apparently, on learning that ChatGPT is lying about specific people, OpenAI has decided to prevent ChatGPT from responding to any prompt that mentions those people’s names.

Of course, usually there’s more than one human who has a particular name, so OpenAI is also preventing ChatGPT from talking about anyone who has the same name as someone who ChatGPT has previously prominently lied about.

(Original Facebook post.)
Album release date

November 24, 2024

(Authoritative-sounding, Music)

Today I did a Google search for [“field of stars” mccutcheon] and I forgot to append “-AI” to leave out the AI Overview. When I forget to leave out the Overview, I normally try to not even look at the Overview; but this time the Overview caught my eye. It starts out:

“Field of Stars is an album by American folk singer-songwriter John McCutcheon. The album was released on January 10, 2024.”

McCutcheon had an online concert today to celebrate the release of the album, so I spent several seconds wondering why he waited 10+ months after its release to have the concert. And then I realized that of course the AI Overview is just wrong, once again. The album will be officially released on January 10, 2025. (But is available now in various pre-official-release contexts.)

But this is one of the reasons that I usually try not to even look at the Overview, because they often read to me as so authoritative that even though I know they include false information, I still sometimes believe them.

(Original Facebook post.)
Kosher bacon

November 17, 2024

(Food)

If you do a Google search for [salt pork substitute kosher], the AI Overview tells you to try pancetta or bacon as a kosher substitute for salt pork.

Yet another example of why you should never believe anything that generative AI tells you.

(Original Facebook post.)

(Update: Sometime in the year after I posted this, Google stopped returning an AI Overview in response to that query.)
Transcription

October 26, 2024

(Medical/Health)

“Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said”

This is about Whisper, which I’ve heard praised in other contexts. 🙁

“Whisper has a major flaw: It is prone to making up chunks of text or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers. Those experts said some of the invented text — known in the industry as hallucinations — can include racial commentary, violent rhetoric and even imagined medical treatments.”

“medical centers [are using] Whisper-based tools to transcribe patients’ consultations with doctors”

“While most developers assume that transcription tools misspell words or make other errors, engineers and researchers said they had never seen another AI-powered transcription tool hallucinate as much as Whisper.”

(Original Facebook post.)
Scaling and reliability

October 10, 2024

(Accuracy measurements)

“A common assumption is that scaling up [LLMs] will improve their reliability—for instance, by increasing the amount of data they are trained on, or the number of parameters they use to process information. However, more recent and larger versions of these language models have actually become more unreliable, not less, according to a new study.”

“This decrease in reliability is partly due to changes that made more recent models significantly less likely to say that they don’t know an answer, or to give a reply that doesn’t answer the question. Instead, later models are more likely to confidently generate an incorrect answer.”

(Article from Oct. 3.)

(Original Facebook post.)
Apple Intelligence

September 10, 2024

(Apple)

In Apple’s launch event for the iPhone 16 yesterday, I was not thrilled with the amount of emphasis they put on the new “Apple Intelligence” features. But I did think that if those features work well, some of them could be pretty useful.

Unfortunately, this review makes me even more dubious.

“In the preview I’m using, Apple Intelligence does an uncomfortable amount of making things up.”

“like the time it alerted me that Donald Trump had endorsed Tim Walz for president. (Ha.) And the time it made up the idea that I’m teaching at UC Berkeley. (No.) And the time it elevated an obvious Social Security scam to my ‘priority’ inbox. (Yikes). And the time it edited a selfie to make me bald. (Double yikes.)”

“it feels weird […] to see fabrications and misinterpretations of your life appear on your lock screen, inbox and other core parts of your iPhone.”

“I told Apple about the many times I saw Apple Intelligence get facts wrong (I’ve had at least five to 10 laugh-out-loud moments per day). It says it is working to improve accuracy. But so is every other AI company — and that has proved to be a giant challenge.”

(Original Facebook post.)