Why AI Needs Better Benchmarks

Today on AI Daily Brief, why AI needs better benchmarks and before that in th...

is Apple planning on distilling Google's Gemini models?

“The AI Daily Brief is a daily podcast in video about the most important news and discussions in AI.”

All right friends, quick announcements, before we dive in.

First of all, thank you to today's sponsors, KPMG,

robots and pencils, blitzie and super intelligent. To get an ad free version of the show, go to patreon.com/aidelebrief or you can subscribe on Apple Podcasts. If you are interested in sponsoring the show, send us a note at [email protected] and while you're at aidelebrief.ai, check out everything going on in the ecosystem, including the return of our newsletter, which has all the links that I mentioned in the show. Apple's AI partnership with Google apparently goes much deeper than previously thought,

including the ability to distill Gemini into smaller models. The unveiling of the new AI series is a little over two months away, and we're starting to get a steady drip of information around what the product will look like. On Tuesday, Bloomberg's Apple Insider Mark German ran through what he knows about features in UX. Apple has reportedly backed down on their view that Siri should remain voice-only, now building a standard chatbot interface with optional voice

“controls. German also reported that Siri will be deeply integrated into iOS 27,”

allowing it to take actions and draw context from apps running on a user's device. It sounds as though Apple will try to launch Siri with full computer use, delivering the features they advertise with the launch of Apple Intelligence two years ago. Now we already knew that Siri would be driven by Google's Gemini models, but new reporting from the information suggests that Apple has much more freedom in how they use Gemini than originally

thought. Previous reports said that Apple would find tune to Gemini model for their purposes, and that the models would be hosted on Apple's servers to ensure user privacy. However, sources speaking with the information said that Apple has full access to the Gemini models, meaning they're able to distill large versions of Gemini into their own smaller proprietary models. Model distillation is the process of using the reasoning traces from one model to train another,

essentially a cheat code to develop powerful models. Many of the Chinese labs have been accused of

distilling models from anthropic and open AI as a way to catch up quickly. The information sources

“said that the process isn't straightforward as Apple's vision for Siri is very different to the”

way Gemini works. Gemini is optimized for chat bots, enterprise tests, and coding, while the source implied Apple is less interested in these functions. The source was skeptical the models would actually be that much use to Apple's foundation models team for that reason. Maybe the main takeaway is that Apple hasn't entirely given up on training their own models, and could use the Google partnership to bootstrap their approach. The most obvious target

in most people's minds would be training small capable models to run locally on an iPhone, which seems to be the core vision of where Apple wants to go with AI. If in model sums up the wait and see attitude of most people on this news, tweeting, "Huh, I'm not sure distilling Gemini models to run on phones is going to result in the generally capable agents that people will expect soon, but we shall see." Speaking of Google, the company has published

a research paper describing a new compression algorithm that could dramatically improve the performance of small models. Called TurboQuant, the process allows researchers to quantize model contexts with almost zero losses. During long conversations or long horizon tasks, contexts can blow to use even more memory than model weights. Functionally, quantization means context is stored with less fidelity. For example, 16-bit data might be compressed into

form it. Current quantization methods are quite lossy and noticeably reduced performance. Some believe, for example, that this is the reason Anthropics models can seem a little off during demand spikes. Google researchers say their new process massively reduces the loss associated with quantization and could make the technique far less of a trade-off. They claim their process results in a 6x reduction of the amount of memory a given model uses for storing context,

while delivering an 8x speed boost compared to current methods. This could result in a 50% reduction in inference costs and help ease the bottleneck around memory chips. Giving a concrete demonstration of what the algorithm can do, Google's researchers tested it on Lama 3.1-8B and Mr. 7B, with TurboQuant implemented, both models achieved perfect scores on needle in a haystack test.

Cloud Flare CEO Matthew Prince tried to explain the gravity of this breakthrough commenting,

"This is Google's deep seek, so much more room to optimize AI inference for speed, memory usage, power consumption, and multi-tenant utilization." Others reached for a more relatable analogy, comparing this moment to when a scrappy startup cracked middle-out compression with a wiseman score of 5.2. Chevang writes, "So basically, TurboQuant is pie-piper." Now, Google isn't just shipping groundbreaking research, they also have a new music model with

Lyria3 Pro. The first version of Lyria3 was to some folks underwhelming. It wasn't that the model was bad, it just couldn't produce production quality music like Suno, and was limited to 30 seconds, making it seem like it was more for novelty purposes than professional use cases. Lyria3 Pro definitely addresses some of those issues. It can now create full tracks up to three minutes long, and seems to have a much better understanding of Lyrics and song structure.

Rohan Paul writes, "The hard part of AI music is not making pleasant sound for 10 seconds. It's keeping a piece coherent as it moves from intro to verse to chorus without collapsing into a loop.

Now, Rohan noted that Google is also pushing it in Vertex AI, AI Studio, and ...

so the bigger story is probably less about the model, and more of the fact that it is available

“via API, which could mean it finds its way into a lot more use cases. Over in the world of AI”

politics, Senator Birdie Sanders has unveiled his data-centered moratorium bill within assist from AOC. The legislation would pause all data-centered construction nationwide until strong national safeguards their words are in place. The bill requires Congress to establish protections for workers and consumers, address environmental harms, and defend civil rights before lifting the moratorium." Sanders said, "AI has received far too little serious discussion here in

our nation's capital. I fear that Congress is totally unprepared for the magnitude of the changes that are already taking place. Now, the presence of AOC as a co-sponsor seems fairly relevant. Until now, Sanders has been pushing the moratorium largely by himself, with support from certain elements of the AI safety community. It hadn't found meaningful traction among elected progressives. AOC personally has been pretty much silent on the issue.

Her ex-feet has zero mentions of data centers, and only a single post about AI, regarding the dangers of deepfakes. By supporting the bill, AOC is declaring a position for the broader progressive movement, and could at least theoretically carry that position into a

“presidential run in 2028. Meanwhile, the bill seems to have very little support from mainstream”

Democrats. Senator Mark Warren, for example, said the idea was, in his words, idiocy. He continued, "A data center moratorium simply means China is going to move quicker. The idea that we're going to stop this back into the bottle, that's a ridiculous premise." Now, despite thinking the moratorium is the wrong solution, Warner certainly still has strong views on AI policy. He's currently supporting a bill to codify anthropic's red lines around using AI for domestic surveillance

and autonomous weaponry, referring to Secretary of War Pete Heggsethy added, "those should be policy decisions not left to a single individual." Warner also raised a alarm about AI job replacement, commenting, "the recent college graduate unemployment is 9%. I'll bet anyone in the room it goes to 30 or 35% before 2028." He said he now believes the scope of the economic disruption is going to be exponentially larger than he thought just a few months ago. Commenting on the Sanders

AOC policy, James Rosenberg writes, "I see why it's called populism now, never liking the term.

Every part of this is detrimentally performative. It's arbitrary AF. The ban on upgrades means no energy efficiency or sustainability improvements can happen. There is nothing progressive about it.

“On the other end of this spectrum, New York Times tech reporter Mike Isaac writes,”

"People can certainly take issue with his positions in plan of action, but Sanders seems to be one of the few members of Congress seriously reckoning with what the labor consequences of the coming AI age could be. Now joined by AOC." One of the things that I'm not sure on is the extent to which a moratorium is A, something that Bernie and AOC actually think is good policy, B, something they think is good politics, given increasing American antipathy towards data centers

in their community, or three, a way to anchor the conversation on the far end of one extreme, so there's more room to find compromise in the middle. One could certainly hope it's number three, but right now it's not at all clear. Now speaking of the China Boogie Man, in our final story, Manus Co-founders have been banned from leaving China as the CCP cracks down. The financial Times reports that Manus's CEO and Chief Scientist have both been barred from leaving the country

while Metas $2 billion acquisition is reviewed. We heard rumblings of this earlier in the

month, as Manus and Meta executives were summoned to Beijing for a meeting with regulators. The theory of the cases that Manus circumvented China's export controls on tech by relocating their headquarters from Beijing to Singapore. CEO Shaohang, and Chief Scientist, GE Chao, reported the attended the meeting and were told after its conclusion that they would be unable to leave China but were free to travel within the country.

Sources said that no formal investigation has been open and no charges have been brought, but Manus is said to be seeking legal resenteation to help resolve the issue. The entire situation is messy because it deals with the intersection of actual laws and the unspoken rules that govern doing business in China. China has strict laws to control foreign investment in export of technology. However, both Manus and Meta maintain their

transaction was in full compliance. The relocation of the headquarters is an obvious gray area. Made even more gray by the fact that Manus still maintains an offshore entity, which was used to develop early versions of the product. As for the unspoken rules, Chinese officials have become increasingly concerned about losing AI talent and technology to the West. They've even adopted the euphemism of selling young crops to describe the

poaching of human capital and strategic industries. Sources suggested that the extreme outcome would be a forced "unwined" of the Meta deal, but noted that that would be messy because the technology is already being integrated into Meta's platforms. What's more, this isn't the only sign that Beijing is tightening its grip on its domestic AI industry. AI researcher Tao Hu, shared that the China computer federation has warned researchers not to participate in the

NUR IPS conference. Chinese entrepreneur Alina Huah argued that this is all to be expected, writing. They thought they were being clever for circumventing China's tech export controls, but you don't mess with the CCP like that. You will be made an example of so others don't get tempted to betray the motherland. So what's going to happen? China won't jail them because they don't want to look evil. Instead, they're going to freeze the founders assets in China and

give them a travel ban while the quote unquote "probe" is ongoing. The probe will likely be deliberately prolonged to inflict psychological damage, create uncertainty for potential copycats, and make the public forget about this case. And once the topic is out of the public's mind, CCP can

Strike hard with a financial penalty that wipes out most of their gains and t...

them in China. They'll bish up the host of cynicism rights. I didn't think the Manus top execs would be so naive as to go back to the PRC. Expect they will have to spit back out a lot of what they made. On the flip side, some Western observers thought the crackdown will probably backfire. For more White House adviser, Dean Ball commented, "If we were smart, we'd see this as a major cell phone by China as Natsek brain public policy so often is. The message of the government is sending

is, if you ever want to found a company, especially one that makes money on software,

moved to Singapore first. Easier to get GPUs too. Never a dull day in AI but for now,

that does it for the headlines. Next up, the main episode. Alright folks, quick pause. Here's the Uncomfortable Truth. If your enterprise AI strategy is we bought some tools, you don't actually have a strategy. KPMG took the harder route and became their own client-zero. They embedded AI in agents across the enterprise, how work it's done, how teens collaborate, how decisions move, not as a tech initiative but as a total operating

model shift. And here's the real Unlock. That shift raised the ceiling on what people could do. Human State firmly at the center while AI reduced friction, surface din site, and accelerated momentum.

“The outcome was a more capable, more empowered workforce. If you want to understand what that actually”

looks like in the real world, go to www.kpmg.us/ai. That's www.kpmg.us/ai. Today's episode is brought to you by Robots and Pencils. A company that is growing fast. Their work is a high growth AWS and Databricks partner, means that they're looking for a lead talent ready to create real impact at velocity. Their teams are made up of AI native engineers, strategists, and designers who love solving hard problems and pushing how AI shows up in real products.

They move quickly using RoboWorks, their agentic acceleration platform, so teams can deliver meaningful outcomes in weeks, not months. They don't build big teams. They build high impact number ones. The people there are wicked smart with patents, published research, and work that's helped shape entire categories. They work in velocity pods and studios that stay focused and move with intent. If you're ready for career defining work with peers who challenge you and have your back,

Robots and Pencils is the place. Explore open roles at robotsandpensils.com/careers, that's robotsandpensils.com/careers.

“Want to accelerate enterprise software development velocity by 5x?”

You need Blitzi, the only autonomous software development platform built for enterprise code bases. Your engineers to find the project, a new feature, refactor, or greenfield build.

Blitzi agents first ingest and map your entire code base, then the platform generates a bespoke

agent action plan for your team to review and approve. Once approved, Blitzi gets to work autonomously generating hundreds of thousands of lines of validated and tested code. More than 80% of the work completed in a single run. Blitzi is not generating code, it's developing software at the speed of compute. Your engineers review, refine, and ship. This is how Fortune 500 companies are compressing multi-month projects into a single sprint,

accelerating engineering velocity by 5x. Experience Blitzi first hand at blitzi.com. That's blytzu.com. It is a truth universally acknowledged that if your enterprise AI strategy is trying to buy the right AI tools, you don't have an enterprise AI strategy. Turns out that AI adoption is complex. It involves not only use cases, but systems integration,

data foundations, outcome tracking, people and skills and governance. My company, Super Intelligent, provides voice agent-driven assessments that map your organizational maturity against industry benchmarks against all of these dimensions.

“If you want to find out more about how that works, go to bsuper.au. When you fill out the”

get started form, mention maturity maps. Again, that's bsuper.au. Welcome back to the AI Daily Brief. Today we are looking at the launch of Arc-AGI-3. It's a new benchmark from Arc Prize that is specifically designed to test the interactive reasoning capability of AI agents. Now, it is the latest in a sequence of benchmarks that are meant to deal with some of the problems of benchmarks, but to better understand what they are

trying to respond to, it's worth going back and actually understanding what benchmarks purpose is, what the problems with them are, and how people have tried to address those problems. Benchmarks are effectively two things. They're a way to compare AI's performance in various areas, as well as a way to see how models are progressing over time. Historically, there have been two major categories of benchmarks that you see included with every new model release. The two

categories are benchmarks around knowledge and benchmarks around function. Knowledge was the first big hill to climb, with benchmarks like MLU for general knowledge, and GPQA which measures scientific knowledge. Over time, more difficult benchmarks were introduced like humanities last exam, which features obscure knowledge not typically found in the training data. As models developed however, function became more important. Sweet Bench is one of the best known of the functional

benchmarks, testing the knowledge required to solve typical coding problems from GitHub. As a gender coding has risen in importance in the AI space, terminal bench has arguably overtaken

sweet bench as the most important coding benchmark. Terminal bench tests are not only coding

Reasoning but also the model's ability to use a terminal.

have followed this pattern, starting off as a test of knowledge and then implicitly or explicitly

“also adding an element of testing functional capacity. Humanities last exam for example began as a”

pure test of pre-trained knowledge, but now it's typically measured with web search tools enabled making it a proxy for competency and tool use as well. Now very early on, in the modern post-chat BTR of AI, Benchmark's saturation became a problem. All the way back in May of 2024 with the release of GPT 40, all major models were already above 80% on MMLU with GPT 40 scoring 88.7%. Now at the time some other benchmarks were a little bit less saturated. 40 was a big breakout for

example in GPQA scoring 53.6%, but of course with all of these benchmarks it was only a matter of time. By last summer, the saturation problem had gotten much worse. At the time, O3 was opening AI's daily driver. More difficult questions had been added to GPQA Diamond, and O3 still achieved 83.3% without using tools. By that stage, most of the 2024 benchmarks had been abandoned or updated because of saturation. For example, the math benchmark was long gone, replaced by the AIME math test,

which uses questions from a real-world math competition. O3 would score 88.9% on AIME math, foreshadowing that a specifically trained open AI model would achieve a gold medal performance

“at the International Math Olympiad a few months later. Fast forward to today, and once again,”

many of these benchmarks are getting saturated. GPT 5.4 is now up to 52.1% on humanities last exam, with tools and 39.8% without, which is very close to OPS4, 6 is 53 and 40% respectively. Sweetbench was once again upgraded with GPT 5.4 scoring 57.7% on sweetbench Pro. For OPS4 6, and tropical for today, 81.4% on sweetbench verified, but chose to highlight terminal bench 2.0 more prominently, where they scored a 65.4%.

Now, it's difficult to keep track of all these numbers, but this chart shows the example of how performance on sweetbench verified progressed over the past year. Models from Anthropic,

Google, OpenAI, and Minimax who produce the chart are basically all up into the right.

They each began at different points in the middle of 2025, ranging from 55 to 70% however they've all arrived now at up near 80%. Benchmark saturation then means that Benchmark's no longer show particularly meaningful progress between each model generation. They also don't show meaningful differences between the models, and making this problem worse is the issue of Benchmark Maxing. Benchmark Maxing refers to when a lab trains the model specifically to beat the Benchmark

even if it has little relevance in the real world. This happens because the Benchmark's either completely known or semi-public, meaning model labs can train specifically for the test in order to have more impressive numbers when they come out. One common perception in critique of Chinese labs is Benchmark Maxing in the extreme, which frequently leaves their models with a huge gap between their Benchmark scores and real world performance. In February,

a variant coding Benchmark called Sweet Rebench was released, containing a different set of problems, and most of the Chinese models dived in the rankings, suggesting they were specifically trained against the narrow set of Sweet Bench verified problems. The Western models did drop as well, but not by nearly as much. Another example was Meta with the release of Lama Forma Verklast April. Meta was accused of testing multiple model variants on Lama Arena,

which is a crowdsourced taste test platform for Lama performance. Platform users are presented with two samples in vote for the best one. Meta was accused of having tested models until they found the one that clicked most with users and launched as the

second ranked model on Lama Arena. You will recall that when people got their hands on Lama

4, they did not in almost any case think it was the second best model available. Between Benchmark Maxing and Benchmark's saturation, the net effect is the diminished significance of Benchmark's as a tool for people to understand which models are good for and at what. Now on top of all of that, there is just an inherent problem with traditional Benchmark's. Most of these Benchmark's tend to be narrowly focused on solving one particular type of task,

some are about recalling knowledge and some are about more complex skills, but they are focused on doing one thing within very narrowly defined set. We've talked in some episodes this week about the idea of task AGI, that at this point AI is really good at a huge array of knowledge work tasks, but where it struggles is in bringing tasks together, and in that light, it would be reasonable

“I think to say that while Benchmark's might be good at demonstrating task AGI,”

they're not particularly useful in helping understand how AI does outside of that very narrow task. Math is a particularly good example of this. With last year's models,

basically solving the very narrow field of competition mathematics.

This was demonstrated in the IMO gold medal performances from Open AI in Google. That is, of course, a completely different skill set than real world mathematics. To the extent the practical reality of deploying AI is understanding and dealing with its jagged frontiers, most traditional Benchmark's just aren't all that helpful with that. Now, everything I'm discussing today are known longstanding problems, and there have been a

ton of attempts to fix Benchmark's over the years. One of the brute force methods is simply making the questions harder. We've seen this with sweet Bench and GPQA, which remained relevant deep into 2025 by simply changing the difficulty level. This gave the Benchmark's at least a little more life and kept them relevant for hill climbing performance, but it didn't really

Address the core underlying problem.

from more practical tests. A key example here is the transition from sweet Bench to

“terminal Bench as the major coding Benchmark. Terminal Bench was intended to be a closer match”

to the way people actually use the models. It put models in a standard harness and tested their ability to use a terminal and other tools to solve coding problems. On some level, it was an improvement, but it still is dealing with saturation issues, and there are also ads more complex variables. Particularly early on, for example, good coding models would fail tasks, because they couldn't execute the tool calls properly. Another approach has been trying to

simulate real-world tasks. An early version of this idea was the sweet Lancer Benchmark, developed by OpenAI last February. It tested coding ability against real-world tasks from upwork that paid an aggregate of a million dollars. This allowed OpenAI to express their models coding ability in dollar terms. The spiritual successor was GDPVAL released by OpenAI last September. It extended the real-world problem set beyond coding to encompass various types of white

color work, like making spreadsheets and slide decks. One of the interesting quirks of GDPVAL was that it required the agent to build and deliver a polished work product. It quickly became

clear that models were failing tasks not always because they couldn't do them, but because the

tool calls were failing. Now GDPVAL also has other challenges. For example, OpenAI went out and actually worked with experience professionals to do a combination of AI and human review. Other evaluators like artificial analysis have gone and modified GDPVAL to be a strictly automated AI

“only version, and it remains one of the benchmarks that I think people are most interested in”

relative to all the others. Now another major approach was looking at continuous agent performance, with meters long task benchmark being the most well-known. This is that chart that we joked during a lot of last years as the bubble talk was increasing was effectively holding up the entire global market. The way that this test works involves giving models a set of coding problems that human coders could complete in a set interval of time, ranging from a few minutes to several hours.

The resulting chart is become one of the clearest demonstrations of model improvement. In this base of two years, we went from agents that could only complete tasks that take humans five minutes, in the case of GPT 40, to agents that can complete tasks that take humans 10 hours in the case of purpose 4.6. Now the big problem with meters test, and one that they fully admitted, is that they're running out of tasks to test against. Their original tasks that included very few tasks that take

more than a few hours. Now that agents can complete complex tasks that take 10 hours, meter is struggling to find a useful test set, realistically tasks that take human developers 10 hours aren't really tasks anymore. Their full-ons offer a build that introduce far more complexity into the test, and in other words, meter can't really extend their benchmark without turning it into something fundamentally different, meaning that this test even is effectively saturated.

Which brings us to Arc-AGI. It began as the arc prize in the summer of 2024, based on former Google computer scientist, friend Swashalay's approach to measuring machine intelligence. Introducing the prize Arc wrote at the time, AGI progress has stalled. New ideas are needed. Modern LLMs have shown to be great memorization engines. They are able to memorize high-dimension patterns in their training data and apply those patterns

into adjacent contexts. This is also how their apparent reasoning capability works. LLMs are not actually reasoning. Instead, they memorize reasoning patterns and apply those reasoning patterns into adjacent contexts, but they cannot generate new reasoning based on novel situations. More training data lets you buy performance on memorization based benchmarks, but memorization alone is not general intelligence. General intelligence is the ability to

efficiently acquire new skills. Arc prizes answer to this, is a test that contains a series of abstract visual logic puzzles. The tasks are presented as a series of colored squares on a grid, with squares added or removed according to a particular pattern. Two clues are given to teach the pattern, then the task is to apply that pattern to a problem square. For example, the problem might require a yellow square to be placed next to a line of blue squares in various

orientations. These are problems that are relatively easy for humans to solve but proved to be difficult for LLMs. The tasks are also kept hidden so the logic couldn't be trained into the models. Instead, the test was trying to measure an LLMs ability to learn new logic within context and apply it to a novel problem. Basically, it set out to be a pure test of reasoning ability rather than memorization of how to reason. Early results were pretty compelling that this was

a solid approach. At the time that Arc AGI1 was released, no models had come within 50% of human performance. Subsequent releases improved on this score but the model seemed to be making genuine progress through reasoning. Then, in December of 2024, open AI dropped a bombshell. A preview version of their O3 model had achieved a 76% score on low inference settings, exceedingly human score for the first time. On high settings, the score was 88%. The O3 model

had been trained on the public data set, but tested on a private data set to achieve this score, so there was no risk the logic was trained into the model. Arc wrote at the time,

“this is a surprising and important step-function increase in AI capabilities, showing novel”

task adaptation ability never before seen in the GPT family models. At the same time,

arc announced that they would be updating their benchmark for 2025 with Arc AGI2. The new benchmark looked superficially similar to the first, it contained the same colored squares, and was once

Again designed to be easy for humans and harder for LLMs.

the innovation that allowed the O series models to outperform, which is test time compute.

“It kind of seems quaint now, but at the time the idea of making a model reason for longer”

was a paradigm-shifting innovation. With O3, open AI had extended test time compute enough to maintain context between problems and learn iteratively throughout the test. In order to pressure test this approach, arc added a new twist to the problems, rather than simply adding a square according to the pattern there were now three new styles of tests. Symbolic interpretation were the LLM was tasked with interpreting more meaning within the symbols,

for example tasks where shapes needed to be colored differently according to how many holes they have,

a second new set of tasks required applying multiple rules within the same problem set,

which they called compositional reasoning, and a final new set of tasks added context to the problem, where logic was no longer universally applied, but dependent on context. For example, shapes with a redboarder need to be shifted to the right, while shapes with a blueboarder need to be shifted to the left. Again, all of these problems remained fairly simple for humans, but the additional complexities were designed to overload LLM contexts and test

pure reasoning ability. The test held up well for most of 2025, most model releases scored below 30%. At the very end of the year and as this year got underway, things escalated dramatically. Gemini 3.1 Pro scored 77.1% and 96 cents per task in February. In March, Opus 4.6 achieved a 68.8% score, GBD5 for Pro achieved 83.3% and Gemini 3 deep think is the current

leader at 84.6% and 13.62 per task. Basically, once again, as the benchmark got saturated,

we needed something new, which gets us to arc AGI 3. In an x post introducing the test on Wednesday, arc writes announcing arc AGI 3. The only unsaturated agentic general intelligence benchmark in the world, humans score 100% AI less than 1%. This human AI gap demonstrates we do not yet have AGI. Most benchmarks test what models already know, Arc AGI 3 tests how they learn. Now, the test is a complete rethink on the Arc AGI formula. The static grids of colored squares are gone. In their

place, arc has designed a series of 135 simple graphical games that require the LLM to manipulate the grid in real time. They have no instructions, so the model needs to explore the environment, figure out how it works, execute a plan, and adapt on the fly to what it sees. In their early testing, arc observed models failing by mistaking one game for another, carrying over theories between games and failing to forecast cause an effect. Arc wrote, Arc AGI 3 gives us a formal measure

to compare human and AI skill acquisition efficiency. Humans don't root force, they build mental models, test ideas, and refine quickly. How close AI is to that? Spoiler, not close. And unlike Arc AGI 2, we are starting at Ground Zero. None of the frontier models can complete this test with any level of competency, each scoring less than 1%. Google DeepMines shall share one of Gemini's playbacks, which are all publicly available in the replace

section of the arc website. She wrote, "Poor Gemini's straight thought it was playing Activision Tennis." Not everyone is a fan of how this is set up. The Sunal Gives Scaling A1 writes, "The scoring of Arc AGI 3 doesn't tell you how many levels the model is completed, but how efficiently they completed them compared to humans, actually using square deficiency." Meaning, if a human took 10 steps to solve it, and the model 100 steps, then the model gets a score

of 1%. The implication they write, that this means scores are not comparable to the first two arc tests. On the other end of this vector may I research her brand in Hancock, commented on the elegance of the benchmark. He writes, "An alien species with zero knowledge of human language could

“ace Arc AGI 3 on day one, and I think that's beautiful." At a time when AI is dominated by language”

models, it's refreshing to have a frontier benchmark, the only one that I'm aware of, that requires zero language ability or cultural knowledge to solve. Intelligent does not mean speaks English or speaks Python. I'm reminded of classic first encounter sci-fi story lines where intelligence species are able to communicate, well before they hash out a common spoken or written language, simply based on universal math science and reasoning concepts. AI has gotten complex enough

that it behaves much more like an alien species than a next token predictor at this point. "Friendsquash LA, one of the creators of Arc AGI, warned that this won't be the one benchmark to rule them all," commenting. Keep in mind, Arc AGI is not a final exam that you pass to claim AGI. The benchmarks target the residual gap between what's hard for AI and what's easy for humans. It's meant to be a tool to measure AGI progress, and to drive researchers towards the most

important open problems on the way to AGI. So it's a moving target designed to track the frontier, as AI evolves the benchmark evolves to spotlight the exact problems we haven't solved yet.

“And I think maybe that's the big takeaway. The idea of trying to "solve benchmark saturation,”

probably as simple as not assuming that benchmarks are going to last all that long, just as we need innovation in the way that we build these models, we're going to need innovation in the way that we measure them." It'll be interesting to see how fast we have models that actually jump from one to some meaningful percent on Arc AGI 3, but of course before long, we'll need some other new thing to measure some other new capability.

For now that is going to do it for today's AI Daily Brief,

I appreciate you listening or watching, as always, and until next time, peace!

Transcript

Today on AI Daily Brief, why AI needs better benchmarks and before that in th...

Now, Rohan noted that Google is also pushing it in Vertex AI, AI Studio, and ...

Strike hard with a financial penalty that wipes out most of their gains and t...

Reasoning but also the model's ability to use a terminal.

Address the core underlying problem.

Again designed to be easy for humans and harder for LLMs.

(gentle music)

Compare and Explore

More Transcripts You Might Like

Why AI Needs Ethics First | Nekia Nichelle & Shekhar Natarajan | Live at CES 2026

Success, Spirituality & Staying Real | Shilpa Shetty Kundra x Shekhar Natarajan

Will.i.am on AI, Data Ownership & The Future of Nations | Shekhar Natarajan Podcast

AI Needs Intention, Not Fear | will.i.am on Leadership, Creativity & the Future of Artificial Intelligence

Learning Resources Inc. v. Trump

Dames & Moore v. Regan