Experts Find Flaws in Hundreds of Tests That Check AI Safety and Effectiveness

by Sophie Williams
0 comments

AI Safety Tests Found Flawed, Raising Concerns About Rapid Development

Hundreds of tests used to assess the safety and effectiveness of new artificial intelligence models contain significant weaknesses, potentially undermining claims about their capabilities and raising questions about the pace of AI development.

Researchers from the British government’s AI Security Institute, alongside experts at Stanford, Berkeley, and Oxford universities, examined over 440 benchmarks used to evaluate AI systems. The study revealed that “almost all… have weaknesses in at least one area,” and resulting scores may be “irrelevant or even misleading.” These benchmarks are crucial in the absence of comprehensive AI regulation in countries like the UK and the US, serving as a primary method for verifying AI safety, alignment with human values, and performance in areas like reasoning and coding.

The findings come after recent incidents highlighting the potential for harm caused by AI systems; yesterday, Google withdrew its AI model, Gemma, following reports it fabricated allegations against US Senator Marsha Blackburn. Blackburn detailed the incident in a letter to Google CEO Sundar Pichai, stating, “This is not a harmless hallucination. It is an act of defamation produced and distributed by a Google-owned AI model.” Google explained that Gemma was intended for developers and researchers, and was removed from its AI Studio platform after non-developers attempted to use it. This incident underscores the growing need for robust testing and oversight as AI becomes more accessible. Similar concerns led Character.ai to ban teenagers from open-ended conversations with its chatbots following tragedies linked to AI-driven manipulation, as reported by Reuters.

The research highlighted a lack of statistical rigor in many benchmarks, with only 16% utilizing uncertainty estimates to assess accuracy. Furthermore, the definition of key characteristics being evaluated, such as “harmlessness,” was often unclear or contested. The study’s lead author, Andrew Bean of the Oxford Internet Institute, emphasized that “Benchmarks underpin nearly all claims about advances in AI,” but without standardized measurement, it’s difficult to determine genuine progress. For more information on the challenges of AI safety, see OpenAI’s safety research.

Experts are now calling for shared standards and best practices in AI benchmarking to ensure more reliable and trustworthy evaluations of these rapidly evolving technologies.

Experts have found weaknesses, some serious, in hundreds of tests used to check the safety and effectiveness of new artificial intelligence models being released into the world.

Computer scientists from the British government’s AI Security Institute, and experts at universities including Stanford, Berkeley and Oxford, examined more than 440 benchmarks that provide an important safety net.

They found flaws that “undermine the validity of the resulting claims”, that “almost all … have weaknesses in at least one area”, and resulting scores might be “irrelevant or even misleading”.

Many of the benchmarks are used to evaluate the latest AI models released by the big technology companies, said the study’s lead author, Andrew Bean, a researcher at the Oxford Internet Institute.

In the absence of nationwide AI regulation in the UK and US, benchmarks are used to check if new AIs are safe, align to human interests and achieve their claimed capabilities in reasoning, maths and coding.

The investigation into the tests comes amid rising concern over the safety and effectiveness of AIs, which are being released at a high pace by competing technology companies. Some have recently been forced to withdraw or tighten restrictions on AIs after they contributed to harms ranging from character defamation to suicide.

“Benchmarks underpin nearly all claims about advances in AI,” Bean said. “But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to.”

Google this weekend withdrew one of its latest AIs, Gemma, after it made up unfounded allegations about a US senator having a non-consensual sexual relationship with a state trooper including fake links to news stories.

“There has never been such an accusation, there is no such individual, and there are no such new stories,” Marsha Blackburn, a Republican senator from Tennessee, told Sundar Pichai, Google’s chief executive, in a letter.

“This is not a harmless hallucination. It is an act of defamation produced and distributed by a Google-owned AI model. A publicly accessible tool that invents false criminal allegations about a sitting US senator represents a catastrophic failure of oversight and ethical responsibility.”

Google said its Gemma models were built for AI developers and researchers, not for factual assistance or for consumers. It withdrew them from its AI Studio platform after what it described as “reports of non-developers trying to use them”.

“Hallucinations – where models simply make things up about all types of things – and sycophancy – where models tell users what they want to hear – are challenges across the AI industry, particularly smaller open models like Gemma,” it said. “We remain committed to minimising hallucinations and continually improving all our models.”

skip past newsletter promotion

Last week, Character.ai, the popular chatbot startup, banned teenagers from engaging in open-ended conversations with its AI chatbots. It followed a series of controversies, including a 14-year-old killing himself in Florida after becoming obsessed with an AI-powered chatbot that his mother claimed had manipulated him into taking his own life, and a US lawsuit from the family of a teenager who claimed a chatbot manipulated him to self-harm and encouraged him to murder his parents.

The research examined widely available benchmarks but leading AI companies also have their own internal benchmarks that were not examined.

It concluded there was a “pressing need for shared standards and best practices”.

Bean said a “shocking” finding was that only a small minority (16%) of the benchmarks used uncertainty estimates or statistical tests to show how likely a benchmark was to be accurate. In other cases where benchmarks set out to evaluate an AI’s characteristics – for example its “harmlessness” – the definition of the concept being examined was contested or ill-defined, rendering the benchmark less useful.

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy