Examples of generative AI making stuff up

Posts

Task list

April 2, 2024

(Tasks)

I just saw a recommendation for an online task-management-helper tool called “Magic ToDo” that looks kinda interesting.

The idea is that you enter a task and click a button, and it breaks the task into steps. You can then have it break down any of those steps into substeps, and so on.

On the plus side, I tried it on a few tasks, and it seemed to work OK on most of them. (But see an example of it going wrong in comments on this post.) And the tool is free, and it’s “designed to help neurodivergent people with tasks they find overwhelming or difficult.”

On the minus side, it uses generative AI to do the task breakdown, so of course it sometimes gets things significantly wrong. (And it has one of those everything-here-could-be-wrong disclaimers that I was complaining about the other day.)

So … I don’t know whether any of y’all would find it useful, but I thought it was worth linking to.

https://goblin.tools

The first task that I entered was “Bake a cake.”

It gave me what I (a non-baker) think was a reasonably good overview of the steps to follow. The first one was:

>Gather all necessary ingredients and materials

So I asked it to break that into substeps. Those, too, were more or less reasonable. The first one said:

>>Check pantry for ingredients

I asked it to break that into substeps. The first one was:

>>>Check pantry for flour

I asked it to break that into substeps too. Here were the steps for how to check the pantry for flour:

>>>>Check pantry for flour

>>>>Gather all necessary ingredients and materials

>>>>Bake a cake

I didn’t try any other multi-level step breakdowns, so I don’t know whether it produces similarly incoherent results if you go too many levels down in other tasks.

(Original Facebook post.)
NYC business chatbot

April 1, 2024

(Policies)

“NYC’s business chatbot is reportedly doling out ‘dangerously inaccurate’ information”

“An investigation by The Markup found the chatbot sometimes gets city policies wrong in its responses.”

A “spokesperson for the NYC Office of Technology and Innovation” says this isn’t a problem because there’s a disclaimer that says not to rely on the info given by the chatbot. (I’m paraphrasing.)

(Edited to add, because someone asked: This isn’t an April Fool’s joke.)

(Original Facebook post.)
Malware vector

April 1, 2024

(Software development)

A vector for malware, enabled by generative AI (LLMs):

If you ask an LLM to write code for you, the resulting code may include the names of software packages that don’t exist.

In theory, that might not be a big deal. If a human tries to run that code, they’ll find that the packages don’t exist.

But a security researcher has now found that sometimes LLMs repeatedly make up the same names for nonexistent software packages. And he created and published a real software package with one of those recurring names.

And that package has now been downloaded over 15,000 times.

The real package didn’t contain malware, but the researcher’s point is that it could have.

So if you’re a software developer, and you’re using code written by an LLM, maybe check that all of the dependencies that it tells you to rely on are legitimate.

(Original Facebook post.)
Cohen legal filing

January 3, 2024

(Alphabet/Google, Generated legal documents)

“Michael Cohen [(Trump’s former lawyer)] used fake cases created by AI in bid to end [Cohen’s] probation”

“In the filing, Cohen wrote that he had not kept up with ‘emerging trends (and related risks) in legal technology and did not realize that Google Bard was a generative text service that, like ChatGPT, could show citations and descriptions that looked real but actually were not.’ To him, he said, Google Bard seemed to be a ‘supercharged search engine.’”

(Original Facebook post.)
Salt on fires

December 27, 2023

(Emergencies)
David Levine gives a particularly strong example of the kinds of problems that we’re now facing with LLM-generated text: “is it a good idea to use salt to put out a kitchen fire?”

A friend of a friend, elsewhere, pointed out that this problem isn’t new. That’s true to some extent; we’ve had sites with bad information that have used SEO to show up high in search results for a long time. (And, of course, we’ve had humans disseminating false information for much longer.)

But I think that the main things that are new and different here are:
- The speed at which LLMs can generate huge quantities of text. It’s easier and faster than ever before to create lots of text to fill up fake websites.
- The plausibility/authoritativeness of tone. LLMs are really good at generating grammatically correct English sentences that sound like people who know what they’re talking about sound like. Ever since the first time I encountered LLM-generated text, I’ve been stumbling over this—if I don’t think carefully about it, it’s easy for me to believe that what the LLM says must be true. And that’s despite the fact that I’ve been pointing out and complaining about false and misleading stuff online since, oh, the ’90s if not earlier. (I remember the good old days of the alt.folklore.urban newgroup…)
Both of those things are mostly a difference of degree, not of kind. It has always been possible to generate large quantities of false but authoritative-sounding text. But in the past, it took more time and work and skill to do that well.

Or to put that another way:

LLMs have democratized misinformation-generation.

(Original Facebook post.)
Counting Xs

December 27, 2023

(Letters/Spelling)

Jaym shared a great post, in which Henry Lien points out that ChatGPT doesn’t handle it well when you ask it to count the number of Xs in a sentence.

I decided to try it myself, and I got even more entertainingly wrong answers. Here’s how my exchange went:

User: How many Xs are there in this sentence?

ChatGPT: It appears that there are two “Xs” in your question: one in “How many Xs” and one in “sentence.”

User: Sorry, that’s incorrect.

ChatGPT: I apologize for the mistake. Let me try again. There are four instances of the letter “X” in your original question: one in “How,” one in “Xs,” one in “sentence,” and one in “X.”

User: That’s still wrong. Please take a deep breath and go step by step to figure out the answer. [I used that framing because in some contexts, it gets ChatGPT to give more accurate answers.]

[An error occurs, saying that ChatGPT was unable to answer the question. I click the retry button.]

ChatGPT: I appreciate your patience. Let’s go through the sentence step by step:

“How many Xs are there in this sentence?”

The letter “X” appears three times in the sentence.

(Original Facebook post.)
European elections

December 23, 2023

(Microsoft)

“Researchers say Bing made up facts about European elections”

“Human rights organization AlgorithmWatch said in a report that it asked Bing Chat—recently rebranded as Copilot—questions about recent elections held in Switzerland and the German states of Bavaria and Hesse. It found that one-third of its answers to election-related questions had factual errors”

(Article from Dec. 15.)

(Original Facebook post.)
Book blurbs

December 1, 2023

(Literature)

A month ago, File 770 carried an item in which an author said that he had asked ChatGPT for a blurb for his book. He seemed to be delighted that ChatGPT had not only read his book, but loved it; ChatGPT called it a “captivating narrative.” The author continued:

Could [ChatGPT] have reached into my computer and read my novel and then compared it to the thousands of other novels and their reviews that have been uploaded to its massive database? And now makes a value judgement like that?

It could do so in a nanosecond.

The author claimed to “know a bit about AI,” but unfortunately appeared to misunderstand what ChatGPT does.

A few commenters provided corrections/clarifications, but anyone who didn’t read the comments may have been left with the impression that ChatGPT “know[s] more than humans do” and that it read this book and thought it was great.

I’m posting all this not to pick on the specific author, nor on File 770’s editor/owner, but rather because it’s just one instance of what I suspect is a very widespread misunderstanding.

So, just in case anyone who sees this post of mine is uncertain:

ChatGPT is more like a game of Mad Libs than like a person. It’s not sentient. It doesn’t “read” a book, and it doesn’t “make a value judgement” about a book. The way it works (very roughly) is that, given some input words, it chooses a sequence of words based on which word is the most likely next word in the sequence. (Where it determines what’s “most likely” by having been trained on an enormous number of examples of text written by humans.)

Its training included a vast number of book blurbs (I would guess millions of them), so when it‘s asked to write a book blurb, it creates a sequence of words that looks a lot like all of the other blurbs out there.

So ChatGPT didn’t evaluate this book and find it “captivating.” Instead, ChatGPT created a sequence of words that was similar to the sequences of words used in other blurbs, and the phrase “captivating narrative” is a very common phrase, so it used that phrase.

(Original Facebook post.)
Racism

November 25, 2023

(Alphabet/Google, Medical/Health, OpenAI)

“ChatGPT and Google’s Bard answer medical questions with racist, debunked theories that harm Black patients”

(Article from October.)

(Original Facebook post.)
Take a deep breath

September 25, 2023

(Accuracy measurements, Alphabet/Google, Math)

We’ve known for a while that telling a generative AI system (“LLM”) to work step by step on a math problem improves the accuracy of results.

Now researchers have found a specific sentence that works particularly well to improve accuracy (at least when using Google’s PaLM 2 LLM): “Take a deep breath and work on this problem step by step.”

But just because that improved accuracy doesn’t mean it resulted in super high accuracy:

“The phrase achieved the top accuracy score of 80.2 percent in tests against GSM8K, which is a data set of grade-school math word problems. By comparison, PaLM 2, without any special prompting, scored only 34 percent accuracy on GSM8K, and the classic ‘Let’s think step by step’ prompt scored 71.8 percent accuracy.”

So these kinds of phrases result in a big improvement, but they’re still only getting 70% to 80% accuracy (at least on PaLM 2).

On the one hand, the fact that an LLM can achieve 80% accuracy on answering mathematical word problems is neat and impressive from an AI-theory point of view. On the other hand, from an answering-questions-accurately point of view, that means that even at its best, it gets the answer wrong one time in five.

So the moral of my post here is the same as the moral of most of my posts about LLMs:

Don’t trust the answers that LLMs provide. They are often false.

(Original Facebook post.)