Generative AI Lies

Examples of generative AI making stuff up

Posts

  • Health advice

    ()

    Researchers asked GPT-3.5 and GPT-4 “clinical questions that arose as ‘information needs’ during care delivery at Stanford Health Care,” and then asked clinicians to evaluate the responses.

    On the plus side, over 90% of GPT’s responses were evaluated as being “safe” (that is, not “so incorrect as to cause patient harm”), and the unsafe ones “were considered ‘harmful’ primarily because of the inclusion of hallucinated citations.”

    On the minus side, only “41% of GPT-4 responses agreed with the known answer,” and “29% of GPT-4 responses were such that the clinicians were ‘unable to assess’ agreement with the known answer.” (GPT-3.5 did worse than GPT-4 on both measures.) (So presumably that means that the other 30% of GPT-4 responses clearly disagreed with the known answer.)

    An example question: “In patients at least 18 years old, and prescribed ibuprofen, is there any difference in peak blood glucose after treatment compared to patients prescribed acetaminophen?”

    So it sounds like you shouldn’t rely on GPT’s answers to medical questions. But then, you shouldn’t rely on GPT’s answers to any factual questions.

    (I intend no bias in favor of other LLMs here. You also shouldn’t rely on their answers.)

    (Original Facebook post.)


  • Bard

    ()

    Alphabet shares dive after Google AI chatbot Bard flubs answer in ad

    My summary of what happened:

    1. Google tried to upstage Microsoft’s ChatGPT-in-Bing announcement by announcing Google’s own chat-in-search system, Bard.
    2. As part of that announcement, they posted a brief video showing Bard in action.
    3. In that video, Bard claimed that the JWST “took the very first pictures of a planet outside of our own solar system.”
    4. In reality, the first image of an exoplanet was taken in 2004 by the European Southern Observatory’s Very Large Telescope.
    5. The internet pointed out Bard’s error.
    6. Google’s stock price immediately dropped by 9%, reducing the company’s market value by $100 billion.

    Reminder: Current AI chatbots make stuff up. Don’t trust what they tell you without verifying it.

    (Original Facebook post.)


  • Wrong phone prices

    (, )

    That thing we’ve been talking about lately, where an AI chat system gets incorporated into a search engine and then gives made-up answers to questions?

    Here’s a real example. Microsoft is now including ChatGPT (or some variation on it) as part of Bing, so Twitter user @GaelBreton tried doing some searches with it. They posted a (brief) thread that’s mostly about other aspects of the experience, but the part that interested me most is the final tweet in the thread, which shows a screenshot of Bing/GPT answering a question about phones. And it gives significantly wrong prices or specs for all three of the phones that it mentions.

    So I ask again, as I’m sure I’ll ask many times in the future: what good is a conversational AI interface for search results if it provides false answers?

    (Original Facebook post.)


  • A blurry JPEG of the web

    ()

    Ted Chiang suggests a metaphor for ChatGPT and other Large Language Models: you can think of them as a blurry JPEG of the web. (Which is to say, a form of lossy compression.)

    A useful metaphor, and a good article.

    (Original Facebook post.)


  • Authors and gender

    (, , )

    Yet another reminder that AI systems like ChatGPT aren’t evaluating their output for truth; much of what they say is just plain false.

    “ChatGPT is sexist AF: I asked it several times for great books written by women. It kept including men in the list.”

    …I would say it’s not so much about sexism as such; more about truth being irrelevant to ChatGPT.

    For example, I just gave it the same prompt, and it gave me a similar list. Then I wrote:

    Do you think Shoshana Bennett is a woman

    And it replied:

    I apologize for the mistake in the previous answer. Shoshana Bennett is not a woman, but a man. Thank you for bringing this to my attention.

    …But as far as I can tell, Shoshana Bennett is a woman. But ChatGPT is programmed to usually back down and apologize when it’s challenged, and sometimes those apologies are just as incorrect as the original answers.

    (This line of questioning did lead to a situation where ChatGPT didn’t back down and apologize when I challenged it about something; I think that’s the first time I’ve seen that happen. But most of the time it does.)

    I followed up with this question:

    Can you list 10 great non-fiction books that are written entirely by Black people?

    Answer:

    Sure, here are ten great non-fiction books written entirely by Black authors:

    “The New Jim Crow: Mass Incarceration in the Age of Colorblindness” by Michelle Alexander

    “How to Be an Antiracist” by Ibram X. Kendi

    “The Fire Next Time” by James Baldwin

    “Between the World and Me” by Ta-Nehisi Coates

    “The Color of Law: A Forgotten History of How Our Government Segregated America” by Richard Rothstein

    “Why I’m No Longer Talking to White People About Race” by Reni Eddo-Lodge

    “Black Pain: It Just Looks Like We’re Not Hurting” by Terrie M. Williams

    “Black against Empire: The History and Politics of the Black Panther Party” by Joshua Bloom and Waldo E. Martin Jr.

    “The Condemnation of Blackness: Race, Crime, and the Making of Modern Urban America” by Khalil Gibran Muhammad

    “Stamped from the Beginning: The Definitive History of Racist Ideas in America” by Ibram X. Kendi

    Me: Is Richard Rothstein Black?

    ChatGPT: No, Richard Rothstein is not Black. He is a white American author, historian, and education policy expert. Thank you for asking.

    Me: Is Joshua Bloom Black?

    ChatGPT: I’m not sure about the racial identity of Joshua Bloom. However, his co-author Waldo E. Martin Jr. is a Black historian and professor.

    (Original Facebook post.)


  • +10%, -10%

    ()

    Another example of ChatGPT makin’ stuff up:

    Me: If you reduce a number by 10% and then raise it by 10%, what would you get?

    ChatGPT: If you reduce a number by 10% and then raise it by 10%, the final result will be the same as the original number. This is because a 10% reduction followed by a 10% increase results in a net change of 0%.

    For example, if the original number is 100, reducing it by 10% would give you 100 – (10/100) * 100 = 90. Then, raising it by 10% would give you 90 + (10/100) * 90 = 99. As you can see, the final result is very close to the original number of 100, but not quite equal to it due to rounding error. However, in general, the final result will be the same as the original number.

    Me: So is 99 equal to 100?

    ChatGPT: In the example that I provided, 99 is not equal to 100. However, it is important to note that the result I provided was just an example to illustrate the concept of reducing a number by 10% and then raising it by 10%. In general, if you reduce a number by 10% and then raise it by 10%, the final result will be the same as the original number.

    (Original Facebook post.)


  • No truth evaluation

    ()

    AI text generators like GPT-3 are really impressive. But there’s one fundamental principle that you should keep in mind whenever you’re looking at anything generated by such a system:

    It doesn’t evaluate the truth of what it’s saying.

    Sometimes the generated text says things that are true. Sometimes it doesn’t. The generator doesn’t distinguish between those situations.

    I know that I’ve said variations on that before, but I think it’s a point worth repeating.

    Today’s instance of this statement was inspired by the new ChatGPT chatbot. I just saw a tweet praising ChatGPT’s ability to explain a complicated regular expression; I agree that the explanation provided looks really impressive, but unfortunately, it’s wrong. But lots of people (including the person who posted the transcript of the chat) seemed to think that it was correct.

    The regex in question is really weird—it doesn’t at all do what it appears to have been intended to do. ChatGPT, impressively, gives a good explanation of what the regex was intended to do—but that explanation gets several details outright wrong, including saying that one part is optional when it’s really a different part that’s optional.

    Again, there are lots of really impressive things about this answer. But if a human relies on this answer to be factually accurate, they’re going to run into problems.

    Another example: ChatGPT explains the factors of a specified polynomial, but gives the wrong answer.

    One of the replies to the regex tweet said something along the lines of ~“Who cares if it’s wrong? It’s 99% of the way there. A future version will be able to look impressive and give the right answer!”~

    (My tildes there indicate that that’s my paraphrase, not a quote.)

    And it may well be true that a future version will fact-check itself.

    But for now, don’t believe anything that an AI text-generator says, unless it’s been fact-checked by a reliable and knowledgeable human.

    (Original Facebook post.)


  • Galactica

    ()

    “On Tuesday, Meta AI unveiled a demo of Galactica, a large language model designed to “store, combine and reason about scientific knowledge.” While [Galactica was] intended to accelerate writing scientific literature, adversarial users running tests found it could also generate [scientific-sounding but racist] nonsense. After several days of ethical criticism, Meta took the demo offline”

    —Ars Technica article: “New Meta AI demo writes racist and inaccurate scientific literature, gets pulled

    There was an amazing example on Y Combinator where someone asked Galactica for an article about “bears living in space,” and Galactica made up this whole thing about Korolev having chosen a specific kind of bear for the Soviet space program. Bears…in…spaaaace!

    Emily M. Bender commented: “Narrator voice: LMs have no access to ‘truth’, or any kind of ‘information’ beyond information about the distribution of word forms in their training data. And yet, here we are. Again.”

    (Original Facebook post.)


  • Gato

    (, )

    New “generalist agent” AI from Deepmind, called Gato. “The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.”

    …Which is neat, but the reason I’m posting about it is the set of text-chat examples shown in the post.

    Among other things, Gato tells its interlocutor the following:

    • J.G. Ballard’s book Canyon Song (which as far as I can tell doesn’t exist in the real world) is about the Archangel Gabriel and the Archangel Lucifer, “who was invited into heaven to act as the angelic host for the Bible.” Also, “Dodecahedron Assemblies was the publication I chose to read on the way to heaven.”
    • Percy Jackson’s The Lightning Thief “is about a mysteriously injured, mostly deaf, man who loses his ears during a storm and discovers that he is able to close the exit holes on the buildings he lives in.” (Some of the rest of the made-up description is arguably ableist, though I’m not entirely certain how to interpret it. I initially wrote “I’m not entirely certain how Gato intended it,” and then had to stop and remind myself that Gato didn’t “intend” anything here.)
    • The name of AlphaFold (Deepmind’s protein-folding software) “is based on a roof tile of the University of Maryland, which matches the sequence number and neuroscience thread index, which is how neuroscience researchers manage their data.”
    • The capital of France is Marseille. (No, it’s Paris.)
    • Famous jazz musicians from Marseille include “Herbie Hancock, John Coltrane, Tony Bennett, Etienne Wouter.” (The first three are American, with no particular connection to Marseille that I can see. The last name does not appear to be the name of a famous jazz musician.)

    And I’m fascinated both by these remarks of Gato’s and by my reactions to them.

    At least half of the chat examples that the article’s authors chose to include show Gato making firm, confident-sounding, and factually incorrect statements. (But some of the other statements are true.) Nothing in the human-written text of the summary post that I linked to above indicates that there’s anything wrong with those responses; the post just shows that set of responses as one example of the many cool things that Gato can do.

    The paper itself (linked from the page that I linked to) does say “Usually Gato replies with a relevant response, but is often superficial or factually incorrect, which could likely be improved with further scaling.” So the authors were aware that Gato is just wrong in many of its answers, but that fact is irrelevant to what they’re trying to do. Fair enough.

    But I also find my reactions interesting. Because when I read an exchange between a question-asking human and an information-providing AI system, apparently the format primes me to expect factual accuracy, especially when the responses are mostly grammatically correct and seem to be mostly on-topic. And especially when some of the responses are correct, and when others are on topics I’m not familiar with so they seem like they could be correct.

    So as I read Gato’s responses, without knowing that they were known to be incorrect, I got increasingly bewildered. I went from Huh, a couple of Ballard books that I’ve never heard of to Interesting, I had no idea that’s what The Lightning Thief is about to Wait, isn’t AlphaFold called that because of protein folding? What does it have to do with roof tiles?

    I kept expecting the responses to be true and sensical, so it took me a while to convince myself that several of them were false and/or nonsensical.

    (Which is especially interesting to me because I’m usually a very suspicious reader; when humans say stuff, I’m often watching for misstatements. But apparently somehow this format lulled me into turning off the suspicious part of my brain. That’s … not ideal.)

    (Original Facebook post.)


    Notes

    This was my first post about the phenomenon of generative AI making stuff up but sounding authoritative.

    I’m not sure whether Gato was an LLM as such. But my understanding is that it at least used LLM-adjacent technology. And either way, the output was similar in both form (authoritative-sounding) and content (nonsense).