Generative AI Lies

Examples of generative AI making stuff up

Category: Accuracy measurements

  • Hallucinations are getting worse

    ()

    A.I. Is Getting More Powerful, but Its Hallucinations Are Getting Worse

    “A new wave of ‘reasoning’ systems from companies like OpenAI is producing incorrect information more often. Even the companies don’t know why.”

    “[OpenAI] found that o3 — its most powerful system — hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI’s previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent.”

    (Original Facebook post.)


  • Search engines

    ()

    AI search engines give incorrect answers at an alarming 60% rate, study says

    “A new study from Columbia Journalism Review’s Tow Center for Digital Journalism finds serious accuracy issues with generative AI models used for news searches. The research tested eight AI-driven search tools equipped with live search functionality and discovered that the AI models incorrectly answered more than 60 percent of queries about news content.”

    “Error rates varied notably among the tested platforms. Perplexity provided incorrect information in 37 percent of the queries tested, whereas ChatGPT Search incorrectly identified 67 percent (134 out of 200) of articles queried. Grok 3 demonstrated the highest error rate, at 94 percent.”

    Google’s Gemini appears to have also had an extremely high error rate.


    I should note that the study was only looking at one specific kind of queries. Here’s the methodology from the study:

    “We randomly selected ten articles from each publisher, then manually selected direct excerpts from those articles for use in our queries. After providing each chatbot with the selected excerpts, we asked it to identify the corresponding article’s headline, original publisher, publication date, and URL”

    “We deliberately chose excerpts that, if pasted into a traditional Google search, returned the original source within the first three results”

    Also: “More than half of responses from Gemini and Grok 3 cited fabricated or broken URLs that led to error pages.”

    Here’s the study (may be paywalled).


    In comments on my Facebook post, a friend indicated that exact text matching is a task that we wouldn’t expect LLMs to be good at. I replied:

    I don’t see this as an exact-text-matching task; I see it as a reference-finding task.

    Like asking the question: “Here’s a quote I found on the internet. Where does it come from?”

    When a search engine receives that question and responds with a made-up URL, that seems to me to be a problem.

    (But I agree that it doesn’t necessarily make sense to generalize from the results of studies that focus specifically on a particular kind of query. And I do feel like the Ars Technica article that I linked to should have said a little more about the specific focus of this study.)


    (Original Facebook post.)


  • Scaling and reliability

    ()

    “A common assumption is that scaling up [LLMs] will improve their reliability—for instance, by increasing the amount of data they are trained on, or the number of parameters they use to process information. However, more recent and larger versions of these language models have actually become more unreliable, not less, according to a new study.”

    “This decrease in reliability is partly due to changes that made more recent models significantly less likely to say that they don’t know an answer, or to give a reply that doesn’t answer the question. Instead, later models are more likely to confidently generate an incorrect answer.”

    (Article from Oct. 3.)

    (Original Facebook post.)


  • Take a deep breath

    (, , )

    We’ve known for a while that telling a generative AI system (“LLM”) to work step by step on a math problem improves the accuracy of results.

    Now researchers have found a specific sentence that works particularly well to improve accuracy (at least when using Google’s PaLM 2 LLM): “Take a deep breath and work on this problem step by step.”

    But just because that improved accuracy doesn’t mean it resulted in super high accuracy:

    “The phrase achieved the top accuracy score of 80.2 percent in tests against GSM8K, which is a data set of grade-school math word problems. By comparison, PaLM 2, without any special prompting, scored only 34 percent accuracy on GSM8K, and the classic ‘Let’s think step by step’ prompt scored 71.8 percent accuracy.”

    So these kinds of phrases result in a big improvement, but they’re still only getting 70% to 80% accuracy (at least on PaLM 2).

    On the one hand, the fact that an LLM can achieve 80% accuracy on answering mathematical word problems is neat and impressive from an AI-theory point of view. On the other hand, from an answering-questions-accurately point of view, that means that even at its best, it gets the answer wrong one time in five.

    So the moral of my post here is the same as the moral of most of my posts about LLMs:

    Don’t trust the answers that LLMs provide. They are often false.

    (Original Facebook post.)


  • Decline in accuracy

    ()

    Over just a few months, [GPT-4] went from correctly answering a [particular] math problem 98% of the time to just 2%, study finds”

    More specifically:

    “in March GPT-4 was able to correctly identify that the number 17077 is a prime number 97.6% of the times it was asked. But just three months later, its accuracy plummeted to a lowly 2.4%. Meanwhile, the GPT-3.5 model had virtually the opposite trajectory. The March version got the answer to the same question right just 7.4% of the time—while the June version was consistently right, answering correctly 86.8% of the time.”

    Also, it looks like they asked GPT-4 to give step-by-step reasoning for the primes question; in March, it gave good step-by-step answers, but in June, it ignored the step-by-step part of the prompt.

    Here’s the paper that the article is talking about (not yet peer-reviewed, I think).

    (Original Facebook post.)