Introduction
The pursuit of Artificial General Intelligence (AGI) and forecasts of when it might be achieved have been of perennial interest since the beginning on modern AI research. Despite early optimism in the 1950s that it would be achieved in just a few decades, the majority of AI success over that time has come in the form of expert systems – specialized models that excel in a narrow, well-defined area. With the advent of foundational large language models (LLMs) that can be applied across a wide range of tasks, AGI has returned to the foreground of public discussion.
What is AGI, anyway?
Artificial general intelligence is a hypothetical form of artificial intelligence that can match the intelligence of human beings across any task. There is continued debate on how exactly to define intelligence at all, but a helpful distinction can be made between crystalized intelligence and fluid intelligence. First introduced by the psychologist Raymond Cattell, crystallized intelligence refers to the accumulation of knowledge, facts and learned procedures. Fluid intelligence, on the other hand, involves the ability to solve entirely novel problems with a process of reasoning. Cattell proposed that general intelligence can be decomposed into these two components. While some definitions of intelligence extend beyond cognitive intelligence to emotional, social and other forms of intelligence, it would seem reasonable that any sufficiently ambitious definition of AGI should at least include these two types.
Frontier AI models are now commonly tested on benchmarks that require graduate levels of knowledge and beyond, including law and medical school admission tests. High scores can be achieved on these tests by memorizing answers or at least memorizing processes that lead to the answers. The latest generation of AI models score impressively well on a range of these tests across diverse areas of academia, and consequently there is a growing view that benchmark performance is saturating. Considering this, it should be puzzling that these same models were unable to count the number of “r”s in the word “strawberry”, or do basic arithmetic. Studies show that the latest reasoning models lack common sense, e.g. believing that an astronaut on the dark side of the moon would be able to see the earth. Without diminishing the huge potential of LLMs for use in a range of tasks, it seems that something fundamental is missing from them to truly be described as a human level of intelligence, and that the benchmarks currently favoured by LLM providers are painting an unrealistic picture of their intelligence, as the term would traditionally be understood.
Bold predictions
Despite these caveats, some AI leaders continue to make bold AGI forecasts. OpenAI CEO Sam Altman is now confident that his company knows how to build AGI. Anthropic CEO Dario Amodei claims AI could surpass “almost all humans at almost everything” within 3 years. Could it be that their definitions of AGI have significant blind spots? Intriguingly, OpenAI seem to have agreed a working definition of AGI as any AI system that can generate at least $100 billion in profits. A prominent skeptic of these types of claims is François Chollet, an ex-Google computer scientist known as the creator of the Keras deep learning library. He believes that the consensus definition of AGI, “a system that can automate the majority of economically valuable work”, is incorrect. A better definition is a system that can efficiently acquire new skills outside of its training data. To illustrate how far we are from this type of system, Francis developed the ARC-AGI benchmark.
ARC-AGI
Introduced in 2019, ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark explicitly designed to measure progress toward general intelligence. ARC-AGI is built on the premise that intelligence involves the capacity to adapt to novel situations with minimal prior experience. This mirrors human learning: we often grasp new concepts from just a few examples, relying on abstraction and reasoning rather than rote memorization. To this end, ARC-AGI offers very limited training data—just a handful of examples per task. This prevents the use of brute-force pattern matching and requires models to exhibit genuine reasoning.
The benchmark presents visual puzzles requiring abstract reasoning and pattern recognition. Each puzzle consists of a few input-output examples, and the AI must deduce the rule and generate the correct output for a test case. The idea is to simulate skill acquisition in unfamiliar domains—a key trait of general intelligence. When it was introduced, prominent frontier models performed poorly; GPT-3 scored 0%, and GPT-4o just 4.5%.
Recent breakthroughs
After years of little progress, recently, some AI systems have made headlines with high scores on ARC-AGI. OpenAI’s o3 model scores 53% on the first version of the benchmark, and unreleased versions have scored as high as 76%. These performances are extremely impressive, and underpinned by a shift in strategy for scaling reasoning performance. Rather than scaling training-time compute, i.e. training progressively larger models on more expansive datasets, they scale inference-time compute on models of the same size. Reasoning models generate a long internal chain of thought before responding to a user, enabling the model to perform a series of intermediate reasoning steps before attempting a final answer. Running o3 on the 400 public ARC-AGI puzzles in “high-efficiency” mode cost around $6,677. OpenAI have shown that there is a log-linear relationship between model accuracy and test-time compute, i.e. the amount of thinking out loud that the model is allowed to do before giving a final answer. This means that continued improvements in accuracy require exponentially more computational resources. While effective in boosting the reasoning ability of the model, this could be described as a brute-force approach to reasoning.
Remaining challenges
As Chollet puts it: “You’ll know AGI is here when it becomes impossible to design tasks that are easy for humans but hard for AI.” By that measure, there is still a long way to go. The ARC Prize Foundation has already released a second iteration of the benchmark with a slightly different class of problems, on which the top variation of o3 scores 4%. While it is a ground-breaking model, it seems unlikely to be the final paradigm that scales all the way to AGI.
So, how close are we to AGI? The honest answer is that no one really knows. Benchmarks like ARC-AGI are valuable tools to encourage and shape progress in the field, but its creators are quick to point out that human-level performance on any one such benchmark doesn’t mean AGI has arrived, nor should strong performance be taken as an indicator of how close we are to achieving it. Even the robust definition of the term remains elusive and controversial. To be clear, this uncertainty around AGI does not diminish the transformative potential of current AI systems. Productivity gains and novel applications will continue as LLMs grow in capability, extend across new modalities, and exhibit increasing degrees of agency. However, for the foreseeable future, it remains essential to carefully design, augment and align models for their intended purpose.
This is where the expertise of ProCogia’s Data Science team lies—supporting organizations in applying advanced AI tools thoughtfully and effectively. From evaluating model capabilities to building tailored solutions, our consulting services help teams make practical, well-informed decisions that align with their goals.