A new benchmark designed to push the limits of artificial intelligence is challenging the notion of an imminent AI takeover. Dubbed “Humanity’s Last Exam” (HLE), the test was developed by researchers at the Center for AI Safety and Scale AI to move beyond simple memorization and assess true understanding. By presenting leading AI models with problems that even search engines struggle to solve, scientists are revealing that despite their computational power, AI still falls short when it comes to advanced, specialized knowledge.
A Rigorous Selection Process
Humanity’s Last Exam isn’t a general knowledge quiz; it’s a formidable intellectual challenge. The project involved over 1,000 experts from 500 prestigious institutions who initially submitted approximately 70,000 questions. A strict criterion was applied: each question had to have a single, verifiable answer and be absent from the internet. This key requirement prevents models from simply regurgitating information learned during training – a limitation often found in current AI benchmarks.
The unique approach also leverages AI against itself. Each question proposed by the experts was first vetted by top-performing models like GPT-4o and Claude 3.5 Sonnet. If a machine answered correctly, the question was immediately rejected as too easy. Only 2,500 questions, requiring expertise at the doctoral level, were retained, forming a dataset so complex it poses significant difficulty even for graduate students in fields like law and theoretical physics.
Initial results, released in January 2025, were telling. OpenAI’s o1 model, widely considered a leader in reasoning capabilities, achieved a score of just 8.3%. This performance underscored the gap between information processing and genuine comprehension in specialized domains. HLE forces AI to move beyond statistical patterns and venture into the realm of pure reasoning and scientific abstraction.
The Limits of Artificial Intelligence
As of February 12, 2026, progress has been made, but the peak remains distant. Google’s Gemini 3 Deep Think currently holds the world record with a score of 48.4%. While this represents significant improvement over the past year, it still falls far short of the 90% success rate achieved by human experts in their respective fields. This gap demonstrates that while AI is advancing in its ability to handle complex concepts, it still lacks the nuanced analytical skills of a seasoned researcher.
The distinction between success on this exam and the arrival of Artificial General Intelligence (AGI) is a point emphasized by the study’s authors, published in Nature. Achieving a high score on HLE would demonstrate advanced scientific knowledge, but wouldn’t prove autonomous research capabilities. As neuroscientist Manuel Schottdorf points out, success on the exam is a necessary, but not sufficient, condition for claiming machines have reached true intelligence.
The significance of “Humanity’s Last Exam” extends beyond technical competition. It’s about defining what makes the human mind unique: the ability to solve novel problems without relying on a vast catalog of pre-established answers. While developers hope to surpass the 50% mark by the end of 2026, this test will likely remain the most reliable measure of whether AI can truly think for itself, or if it remains a sophisticated mirror of our own memorized knowledge. The benchmark highlights the ongoing quest to create AI that doesn’t just process information, but understands it.