Major Study Finds Many Mistakes in AI-Generated News Summaries

AI in Media & Entertainment image generated by Writecream.com — (Image credit: Writecream.com)

A groundbreaking new study by the BBC and the European Broadcasting Union (EBU) has found serious problems with news summaries generated by AI assistants.

In nearly half (45%) of the AI-generated news summaries, researchers found at least one significant error. In addition, some AI tools performed even worse, though all showed significant problems. Google’s Gemini performed worst with significant issues in 76% of responses, more than double the other assistants, largely due to its poor sourcing performance.

The finding raise serious doubts about the widespread use of AI-generated news summaries by consumers and plans to use those summaries by news organization.

Tech giants like Google, Facebook and Amazon have been investing hundreds of billions of dollars in AI, in part because AI-generated summaries could potentially replace existing search tools and provide them with the kind of massive advertising revenue currently enjoyed by Google.

The study is also notable for the scale and scope of the research, which was coordinated by the European Broadcasting Union (EBU) and led by the BBC.

In arguably the largest study of its kind, the research involved 22 public service media (PSM) organizations in 18 countries working in 14 languages. Professional journalists from participating PSM evaluated more than 3,000 responses from ChatGPT, Copilot, Gemini, and Perplexity against key criteria, including accuracy, sourcing, distinguishing opinion from fact, and providing context, the BBC reported.

The study also found that the poor performance of the AI-assistants was in fact a slight improvement over an earlier BBC study. “First – there have been improvements since the earlier BBC study,” the report noted. “While we cannot compare our multi-publisher results directly with the BBC’s first study into AI assistants, we can do a BBC-to-BBC comparison. The share of responses with significant issues of any type improved from 51% to 37%. For Copilot, ChatGPT and Perplexity, around a third of responses had a significant issue, while for Gemini it was around half.”

In a second major conclusion, the study noted that, “despite the improvement seen in the BBC-to-BBC comparison, the multi-market research shows errors remain at high levels, and that they are systemic, spanning all languages, assistants and organizations involved. Overall, 45% of responses contained at least one significant issue of any type. Sourcing is the single biggest cause of significant issues (31%)."

"Of particular concern for publishers are sourcing errors that misrepresent them, such as when a response misattributes an incorrect claim to them," the study found. "Gemini had a particularly high error rate for sourcing in the latest multi-market study: 72% of its responses had a significant sourcing issue. All other assistants were below 25%.”

The full study can be found here, with additional commentary and tools from the BBC here.

TOPICS