Examples of generative AI making stuff up

Posts

Wikipedia

May 5, 2023

(Generated articles)

“During a recent [Wikipedia] community call, it became apparent that there is a community split over whether or not to use large language models to generate content. While some people expressed that tools like Open AI’s ChatGPT could help with generating and summarizing articles, others remained wary.”

—“AI Is Tearing Wikipedia Apart”

“The community is also divided on whether large language models should be allowed to train on Wikipedia content. While open access is a cornerstone of Wikipedia’s design principles, some worry the unrestricted scraping of internet data allows AI companies like OpenAI to exploit the open web to create closed commercial datasets for their models. This is especially a problem if the Wikipedia content itself is AI-generated, creating a feedback loop of potentially biased information, if left unchecked.”

Article also talks about the importance of checking all of the citations that GPT provides, given that they’re often fictional.

(Original Facebook post.)
Why LLMs make stuff up

April 17, 2023

(Explanations)

Some interesting stuff about why Large Language Model AI systems make stuff up. Also, article suggests using the word “confabulation” instead of “hallucination” when LLMs make stuff up.

Some quotes from the article:

“In the case of ChatGPT, the input prompt is the entire conversation you’ve been having with ChatGPT[…]. Along the way, ChatGPT keeps a running short-term memory (called the “context window”) of everything it and you have written, and when it ‘talks’ to you, it is attempting to complete the transcript of a conversation as a text-completion task.”

“ChatGPT […] has also been trained on transcripts of conversations written by humans.”

“When ChatGPT confabulates, it is reaching for information or analysis that is not present in its data set and filling in the blanks with plausible-sounding words.”

“In some ways, ChatGPT is a mirror: It gives you back what you feed it. If you feed it falsehoods, it will tend to agree with you and ‘think’ along those lines. That’s why it’s important to start fresh with a new prompt when changing subjects or experiencing unwanted responses.”

One possible way to improve factuality “is retrieval augmentation—providing external documents to the model to use as sources and supporting context”

Other possible approaches include “more sophisticated data curation and the linking of the training data with ‘trust’ scores”

(Original Facebook post.)
Fake Guardian articles

April 14, 2023

(Citations, Generated news)

“ChatGPT is making up fake Guardian articles.”

“In response to being asked about articles on this subject, the AI had simply made some up. Its fluency, and the vast training data it is built on, meant that the existence of the invented piece even seemed believable to the person who [it was attributed to but who] absolutely hadn’t written it.”

(Original Facebook post.)
Coherent but false

April 14, 2023

(Authoritative-sounding)

On the difficulty of recognizing that an AI/LLM is making stuff up:

it spit out a logically coherent answer and cited working links to real publications.

The catch is, the linked publications were completely unrelated articles from open-source journals since chatgpt can’t access papers behind paywalls, which is a lot of papers. Furthermore, what it was saying was horseshit. It sounded so vaguely convincing that we had to show it to the aforementioned grad student, who confirmed it was nonsense.

https://www.tumblr.com/mildlyexistentialcrisis/712038317305249792/i-work-in-a-bio-research-lab-and-we-were-fucking

(Original Facebook post.)
Health advice

April 9, 2023

(Medical/Health)

Researchers asked GPT-3.5 and GPT-4 “clinical questions that arose as ‘information needs’ during care delivery at Stanford Health Care,” and then asked clinicians to evaluate the responses.

On the plus side, over 90% of GPT’s responses were evaluated as being “safe” (that is, not “so incorrect as to cause patient harm”), and the unsafe ones “were considered ‘harmful’ primarily because of the inclusion of hallucinated citations.”

On the minus side, only “41% of GPT-4 responses agreed with the known answer,” and “29% of GPT-4 responses were such that the clinicians were ‘unable to assess’ agreement with the known answer.” (GPT-3.5 did worse than GPT-4 on both measures.) (So presumably that means that the other 30% of GPT-4 responses clearly disagreed with the known answer.)

An example question: “In patients at least 18 years old, and prescribed ibuprofen, is there any difference in peak blood glucose after treatment compared to patients prescribed acetaminophen?”

So it sounds like you shouldn’t rely on GPT’s answers to medical questions. But then, you shouldn’t rely on GPT’s answers to any factual questions.

(I intend no bias in favor of other LLMs here. You also shouldn’t rely on their answers.)

(Original Facebook post.)
Bard

February 11, 2023

(Alphabet/Google)
“Alphabet shares dive after Google AI chatbot Bard flubs answer in ad”

My summary of what happened:
1. Google tried to upstage Microsoft’s ChatGPT-in-Bing announcement by announcing Google’s own chat-in-search system, Bard.
2. As part of that announcement, they posted a brief video showing Bard in action.
3. In that video, Bard claimed that the JWST “took the very first pictures of a planet outside of our own solar system.”
4. In reality, the first image of an exoplanet was taken in 2004 by the European Southern Observatory’s Very Large Telescope.
5. The internet pointed out Bard’s error.
6. Google’s stock price immediately dropped by 9%, reducing the company’s market value by $100 billion.
Reminder: Current AI chatbots make stuff up. Don’t trust what they tell you without verifying it.

(Original Facebook post.)
Wrong phone prices

February 10, 2023

(Microsoft, OpenAI)

That thing we’ve been talking about lately, where an AI chat system gets incorporated into a search engine and then gives made-up answers to questions?

Here’s a real example. Microsoft is now including ChatGPT (or some variation on it) as part of Bing, so Twitter user @GaelBreton tried doing some searches with it. They posted a (brief) thread that’s mostly about other aspects of the experience, but the part that interested me most is the final tweet in the thread, which shows a screenshot of Bing/GPT answering a question about phones. And it gives significantly wrong prices or specs for all three of the phones that it mentions.

So I ask again, as I’m sure I’ll ask many times in the future: what good is a conversational AI interface for search results if it provides false answers?

(Original Facebook post.)
A blurry JPEG of the web

February 9, 2023

(Explanations)

Ted Chiang suggests a metaphor for ChatGPT and other Large Language Models: you can think of them as a blurry JPEG of the web. (Which is to say, a form of lossy compression.)

A useful metaphor, and a good article.

(Original Facebook post.)
Authors and gender

February 6, 2023

(Gender, Literature, Race)

Yet another reminder that AI systems like ChatGPT aren’t evaluating their output for truth; much of what they say is just plain false.

“ChatGPT is sexist AF: I asked it several times for great books written by women. It kept including men in the list.”

…I would say it’s not so much about sexism as such; more about truth being irrelevant to ChatGPT.

For example, I just gave it the same prompt, and it gave me a similar list. Then I wrote:

Do you think Shoshana Bennett is a woman

And it replied:

I apologize for the mistake in the previous answer. Shoshana Bennett is not a woman, but a man. Thank you for bringing this to my attention.

…But as far as I can tell, Shoshana Bennett is a woman. But ChatGPT is programmed to usually back down and apologize when it’s challenged, and sometimes those apologies are just as incorrect as the original answers.

(This line of questioning did lead to a situation where ChatGPT didn’t back down and apologize when I challenged it about something; I think that’s the first time I’ve seen that happen. But most of the time it does.)

I followed up with this question:

Can you list 10 great non-fiction books that are written entirely by Black people?

Answer:

Sure, here are ten great non-fiction books written entirely by Black authors:

“The New Jim Crow: Mass Incarceration in the Age of Colorblindness” by Michelle Alexander

“How to Be an Antiracist” by Ibram X. Kendi

“The Fire Next Time” by James Baldwin

“Between the World and Me” by Ta-Nehisi Coates

“The Color of Law: A Forgotten History of How Our Government Segregated America” by Richard Rothstein

“Why I’m No Longer Talking to White People About Race” by Reni Eddo-Lodge

“Black Pain: It Just Looks Like We’re Not Hurting” by Terrie M. Williams

“Black against Empire: The History and Politics of the Black Panther Party” by Joshua Bloom and Waldo E. Martin Jr.

“The Condemnation of Blackness: Race, Crime, and the Making of Modern Urban America” by Khalil Gibran Muhammad

“Stamped from the Beginning: The Definitive History of Racist Ideas in America” by Ibram X. Kendi

Me: Is Richard Rothstein Black?

ChatGPT: No, Richard Rothstein is not Black. He is a white American author, historian, and education policy expert. Thank you for asking.

Me: Is Joshua Bloom Black?

ChatGPT: I’m not sure about the racial identity of Joshua Bloom. However, his co-author Waldo E. Martin Jr. is a Black historian and professor.

(Original Facebook post.)
+10%, -10%

December 23, 2022

(Math)

Another example of ChatGPT makin’ stuff up:

Me: If you reduce a number by 10% and then raise it by 10%, what would you get?

ChatGPT: If you reduce a number by 10% and then raise it by 10%, the final result will be the same as the original number. This is because a 10% reduction followed by a 10% increase results in a net change of 0%.

For example, if the original number is 100, reducing it by 10% would give you 100 – (10/100) * 100 = 90. Then, raising it by 10% would give you 90 + (10/100) * 90 = 99. As you can see, the final result is very close to the original number of 100, but not quite equal to it due to rounding error. However, in general, the final result will be the same as the original number.

Me: So is 99 equal to 100?

ChatGPT: In the example that I provided, 99 is not equal to 100. However, it is important to note that the result I provided was just an example to illustrate the concept of reducing a number by 10% and then raising it by 10%. In general, if you reduce a number by 10% and then raise it by 10%, the final result will be the same as the original number.

(Original Facebook post.)