Major Study Finds High Levels of Mistakes in AI-Generated News Summaries
45% of all AI answers had at least one significant issue, according to a BBC, EBU study
A groundbreaking new study by the BBC and the European Broadcasting Union (EBU) has found serious problems with news summaries generated by AI assistants.
In nearly half (45%) of the AI-generated news summaries, researchers found at least one significant error. In addition, some AI tools performed even worse, though all showed significant problems. Google’s Gemini performed worst with significant issues in 76% of responses, more than double the other assistants, largely due to its poor sourcing performance.
The finding raise serious doubts about the widespread use of AI-generated news summaries by consumers and plans to use those summaries by news organization.
Tech giants like Google, Facebook and Amazon have been investing hundreds of billions of dollars in AI, in part because AI-generated summaries could potentially replace existing search tools and provide them with the kind of massive advertising revenue currently enjoyed by Google.
In arguably the largest study of its kind, the research involved 22 public service media (PSM) organizations in 18 countries working in 14 languages. Professional journalists from participating PSM evaluated more than 3,000 responses from ChatGPT, Copilot, Gemini, and Perplexity against key criteria, including accuracy, sourcing, distinguishing opinion from fact, and providing context, the BBC reported.
The study also found that the poor performance of the AI-assistants was in fact a slight improvement over an earlier BBC study. “First – there have been improvements since the earlier BBC study,” the report noted. “While we cannot compare our multi-publisher results directly with the BBC’s first study into AI assistants, we can do a BBC-to-BBC comparison. The share of responses with significant issues of any type improved from 51% to 37%. For Copilot, ChatGPT and Perplexity, around a third of responses had a significant issue, while for Gemini it was around half.”
The professional video industry's #1 source for news, trends and product and tech information. Sign up below.
In a second major conclusion, the study noted that, “despite the improvement seen in the BBC-to-BBC comparison, the multi-market research shows errors remain at high levels, and that they are systemic, spanning all languages, assistants and organizations involved. Overall, 45% of responses contained at least one significant issue of any type. Sourcing is the single biggest cause of significant issues (31%)."
"Of particular concern for publishers are sourcing errors that misrepresent them, such as when a response misattributes an incorrect claim to them," the study found. "Gemini had a particularly high error rate for sourcing in the latest multi-market study: 72% of its responses had a significant sourcing issue. All other assistants were below 25%.”
The full study can be found here, with additional commentary and tools from the BBC here.
George Winslow is the senior content producer for TV Tech. He has written about the television, media and technology industries for nearly 30 years for such publications as Broadcasting & Cable, Multichannel News and TV Tech. Over the years, he has edited a number of magazines, including Multichannel News International and World Screen, and moderated panels at such major industry events as NAB and MIP TV. He has published two books and dozens of encyclopedia articles on such subjects as the media, New York City history and economics.

