Generative AI Lies

Examples of generative AI making stuff up

Category: Alphabet/Google

  • Chuck Wendig’s cat

    ()

    Chuck Wendig discovers that Google AI Overview says he has a cat named Boomba. Also other cats. Also six dogs. Also two children. And a spider. Most of those pets and one of those humans don’t exist in real life, but if Google AI Overview says they do, who are we mere mortals to question it?

    Content warning for the imaginary deaths of imaginary cats. Also for an imaginary cancer diagnosis.


  • Cohen legal filing

    (, )

    Michael Cohen [(Trump’s former lawyer)] used fake cases created by AI in bid to end [Cohen’s] probation”

    “In the filing, Cohen wrote that he had not kept up with ‘emerging trends (and related risks) in legal technology and did not realize that Google Bard was a generative text service that, like ChatGPT, could show citations and descriptions that looked real but actually were not.’ To him, he said, Google Bard seemed to be a ‘supercharged search engine.’”

    (Original Facebook post.)


  • Racism

    (, , )

    “ChatGPT and Google’s Bard answer medical questions with racist, debunked theories that harm Black patients”

    (Article from October.)

    (Original Facebook post.)


  • Take a deep breath

    (, , )

    We’ve known for a while that telling a generative AI system (“LLM”) to work step by step on a math problem improves the accuracy of results.

    Now researchers have found a specific sentence that works particularly well to improve accuracy (at least when using Google’s PaLM 2 LLM): “Take a deep breath and work on this problem step by step.”

    But just because that improved accuracy doesn’t mean it resulted in super high accuracy:

    “The phrase achieved the top accuracy score of 80.2 percent in tests against GSM8K, which is a data set of grade-school math word problems. By comparison, PaLM 2, without any special prompting, scored only 34 percent accuracy on GSM8K, and the classic ‘Let’s think step by step’ prompt scored 71.8 percent accuracy.”

    So these kinds of phrases result in a big improvement, but they’re still only getting 70% to 80% accuracy (at least on PaLM 2).

    On the one hand, the fact that an LLM can achieve 80% accuracy on answering mathematical word problems is neat and impressive from an AI-theory point of view. On the other hand, from an answering-questions-accurately point of view, that means that even at its best, it gets the answer wrong one time in five.

    So the moral of my post here is the same as the moral of most of my posts about LLMs:

    Don’t trust the answers that LLMs provide. They are often false.

    (Original Facebook post.)


  • Bard

    ()

    Alphabet shares dive after Google AI chatbot Bard flubs answer in ad

    My summary of what happened:

    1. Google tried to upstage Microsoft’s ChatGPT-in-Bing announcement by announcing Google’s own chat-in-search system, Bard.
    2. As part of that announcement, they posted a brief video showing Bard in action.
    3. In that video, Bard claimed that the JWST “took the very first pictures of a planet outside of our own solar system.”
    4. In reality, the first image of an exoplanet was taken in 2004 by the European Southern Observatory’s Very Large Telescope.
    5. The internet pointed out Bard’s error.
    6. Google’s stock price immediately dropped by 9%, reducing the company’s market value by $100 billion.

    Reminder: Current AI chatbots make stuff up. Don’t trust what they tell you without verifying it.

    (Original Facebook post.)


  • Gato

    (, )

    New “generalist agent” AI from Deepmind, called Gato. “The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.”

    …Which is neat, but the reason I’m posting about it is the set of text-chat examples shown in the post.

    Among other things, Gato tells its interlocutor the following:

    • J.G. Ballard’s book Canyon Song (which as far as I can tell doesn’t exist in the real world) is about the Archangel Gabriel and the Archangel Lucifer, “who was invited into heaven to act as the angelic host for the Bible.” Also, “Dodecahedron Assemblies was the publication I chose to read on the way to heaven.”
    • Percy Jackson’s The Lightning Thief “is about a mysteriously injured, mostly deaf, man who loses his ears during a storm and discovers that he is able to close the exit holes on the buildings he lives in.” (Some of the rest of the made-up description is arguably ableist, though I’m not entirely certain how to interpret it. I initially wrote “I’m not entirely certain how Gato intended it,” and then had to stop and remind myself that Gato didn’t “intend” anything here.)
    • The name of AlphaFold (Deepmind’s protein-folding software) “is based on a roof tile of the University of Maryland, which matches the sequence number and neuroscience thread index, which is how neuroscience researchers manage their data.”
    • The capital of France is Marseille. (No, it’s Paris.)
    • Famous jazz musicians from Marseille include “Herbie Hancock, John Coltrane, Tony Bennett, Etienne Wouter.” (The first three are American, with no particular connection to Marseille that I can see. The last name does not appear to be the name of a famous jazz musician.)

    And I’m fascinated both by these remarks of Gato’s and by my reactions to them.

    At least half of the chat examples that the article’s authors chose to include show Gato making firm, confident-sounding, and factually incorrect statements. (But some of the other statements are true.) Nothing in the human-written text of the summary post that I linked to above indicates that there’s anything wrong with those responses; the post just shows that set of responses as one example of the many cool things that Gato can do.

    The paper itself (linked from the page that I linked to) does say “Usually Gato replies with a relevant response, but is often superficial or factually incorrect, which could likely be improved with further scaling.” So the authors were aware that Gato is just wrong in many of its answers, but that fact is irrelevant to what they’re trying to do. Fair enough.

    But I also find my reactions interesting. Because when I read an exchange between a question-asking human and an information-providing AI system, apparently the format primes me to expect factual accuracy, especially when the responses are mostly grammatically correct and seem to be mostly on-topic. And especially when some of the responses are correct, and when others are on topics I’m not familiar with so they seem like they could be correct.

    So as I read Gato’s responses, without knowing that they were known to be incorrect, I got increasingly bewildered. I went from Huh, a couple of Ballard books that I’ve never heard of to Interesting, I had no idea that’s what The Lightning Thief is about to Wait, isn’t AlphaFold called that because of protein folding? What does it have to do with roof tiles?

    I kept expecting the responses to be true and sensical, so it took me a while to convince myself that several of them were false and/or nonsensical.

    (Which is especially interesting to me because I’m usually a very suspicious reader; when humans say stuff, I’m often watching for misstatements. But apparently somehow this format lulled me into turning off the suspicious part of my brain. That’s … not ideal.)

    (Original Facebook post.)


    Notes

    This was my first post about the phenomenon of generative AI making stuff up but sounding authoritative.

    I’m not sure whether Gato was an LLM as such. But my understanding is that it at least used LLM-adjacent technology. And either way, the output was similar in both form (authoritative-sounding) and content (nonsense).