Generative AI Lies

Examples of generative AI making stuff up

Category: Medical/Health

  • Factor Fexcectorn

    ()

    An article in Scientific Reports (one of many journals published by Nature Portfolio), published a week ago, includes yet another laughably bad AI-generated graphic.

    Among other things, the graphic includes text like:

    MISSING
    VALUE
    &runctitional
    features

    and:

    Historical
    Medical frymmblal
    & Environental features

    and:

    To/
    Line
    storee

    and:

    Factor Fexcectorn

    and:

    RELU
    DROP-OUT
    Totalbottl,
    REMECH N

    …To view the graphic in context, scroll down about a third of the way through the article, or search for the caption “Overall working of the framework presented as an infographic.”


    According to Wikipedia, “Scientific Reports is a peer-reviewed open-access scientific mega journal published by Nature Portfolio, covering all areas of the natural sciences. The journal was established in 2011. The journal states that their aim is to assess solely the scientific validity of a submitted paper”

    (Three years ago, someone who said they were a member of the editorial board described the peer review process for the journal as “pretty standard.”)


    (Original Facebook post.)


  • Delusions and reality checks

    ()

    They thought they were making technological breakthroughs. It was an AI-sparked delusion

    Article about a couple of people whose interactions with LLM chatbots resulted in mental-health issues.

    Here’s one example of what not to do when you’re interacting with a chatbot:

    “Multiple times, Brooks asked the chatbot for what he calls ‘reality checks.’ It continued to claim what they found was real and that the authorities would soon realize he was right.”

    (You can’t get valid reality checks from a chatbot. If a chatbot appears to be trying to convince you of something, please get a reality check from a human.)

    …Content warning for the article mentioning cases of suicide and murder related to chatbots, but that’s not its focus.

    (Original Facebook post.)


  • Summarizing medical info

    ()

    About some of the problems with having generative AI summarize medical information.

    I summarize medical information for doctors, researchers, and patients every day for a living, and I can promise you that any summary you get from chatGPT will have at least one significant error. And how could you possibly know? If you don’t understand what your doctor is telling you, how could you effectively vet the summary for errors?

    (Original Facebook post.)


  • AI in healthcare

    ()

    Artificial intelligence systems [in healthcare contexts] require consistent monitoring and staffing to put in place and to keep them working well.”

    “Evaluating whether these products work is challenging. Evaluating whether they continue to work — or have developed the software equivalent of a blown gasket or leaky engine — is even trickier.”

    “‘Even in the best case, the [LLMs] had a 35% error rate’”


    (To be clear: Some of this article is about LLMs, and some of it is about predictive algorithms that I assume are old-fashioned non-generative AI. So this is partly an LLM issue, but also partly a non-LLM issue.)


    (Original Facebook post.)


  • Transcription

    ()

    Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said

    This is about Whisper, which I’ve heard praised in other contexts. 🙁

    “Whisper has a major flaw: It is prone to making up chunks of text or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers. Those experts said some of the invented text — known in the industry as hallucinations — can include racial commentary, violent rhetoric and even imagined medical treatments.”

    “medical centers [are using] Whisper-based tools to transcribe patients’ consultations with doctors”

    “While most developers assume that transcription tools misspell words or make other errors, engineers and researchers said they had never seen another AI-powered transcription tool hallucinate as much as Whisper.”

    (Original Facebook post.)


  • Medical-journal image

    ()

    Obviously AI-generated image from an article in the journal Medicine.

    The article has now been retracted “after concerns were raised over the integrity of the data and an inaccurate figure,” but you can still view the whole article or just the image.

    The AI-generated image is figure 2 in the article, labeled “Mechanism diagram of alkaline water treatment for chronic gouty arthritis.” It seems clear that nobody reviewed that image in any way.

    Medicine’s website says: “The Medicine® review process emphasizes the scientific, technical and ethical validity of submissions.”

    (via Mary Anne)

    (Original Facebook post.)


  • Racism

    (, , )

    “ChatGPT and Google’s Bard answer medical questions with racist, debunked theories that harm Black patients”

    (Article from October.)

    (Original Facebook post.)


  • Health advice

    ()

    Researchers asked GPT-3.5 and GPT-4 “clinical questions that arose as ‘information needs’ during care delivery at Stanford Health Care,” and then asked clinicians to evaluate the responses.

    On the plus side, over 90% of GPT’s responses were evaluated as being “safe” (that is, not “so incorrect as to cause patient harm”), and the unsafe ones “were considered ‘harmful’ primarily because of the inclusion of hallucinated citations.”

    On the minus side, only “41% of GPT-4 responses agreed with the known answer,” and “29% of GPT-4 responses were such that the clinicians were ‘unable to assess’ agreement with the known answer.” (GPT-3.5 did worse than GPT-4 on both measures.) (So presumably that means that the other 30% of GPT-4 responses clearly disagreed with the known answer.)

    An example question: “In patients at least 18 years old, and prescribed ibuprofen, is there any difference in peak blood glucose after treatment compared to patients prescribed acetaminophen?”

    So it sounds like you shouldn’t rely on GPT’s answers to medical questions. But then, you shouldn’t rely on GPT’s answers to any factual questions.

    (I intend no bias in favor of other LLMs here. You also shouldn’t rely on their answers.)

    (Original Facebook post.)