The AI Daily Brief: Artificial Intelligence News and Analysis
The AI Daily Brief: Artificial Intelligence News and Analysis

Anthropic Accidentally Revealed Their Most Powerful Model Ever

3/27/202627:555,345 words
0:000:00

Intercom and Cursor have both shown that post-training open-weight models on domain-specific interaction data can match or beat the best frontier models — cheaper and faster. It's a development th...

Transcript

EN

Today on the AI Daily Brief, are we entering the era of vertical AI models, b...

in the headlines, a big leak with anthropic confirming the existence of cloud mythos,

what they call by far the most powerful AI model we've ever developed.

The AI Daily Brief is a daily podcast and video about the most important news and discussions

in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG, Blitzie, Assembly in Robots and Pencils. To get an ad-free version to show go to patreon.com/aidilybrief and if you are interested in sponsoring the show, send us a note at [email protected].

Late breaking one last night, a data leak revealed that anthropic is testing a new model referred to as cloud mythos. Anthropic has confirmed the existence of this model, with a spokesperson saying that it was a "step change" there words, "in performance" and "the most capable we've built to date." They said the model is currently being trialed by early access customers.

So here's what happened. On Thursday evening, a draft blog post describing the model was left in an unsecured publicly searchable database. The blog post says we've finished training a new AI model, cloud mythos, it's by far the most

powerful AI model we've ever developed.

Mythos they write is a new name for a new tier of model, larger and more intelligent

than our opus models, which were until now are most powerful. It shows the name to evoke the deep connective tissue that links together knowledge and ideas. Compared to our previous best model, cloud opus 4.6, mythos gets dramatically higher scores on tests of software coding, academic reasoning, and cybersecurity among others. In preparing to release cloud mythos however they say, we want to act with extra caution

and understand the risk of poses, even beyond what we learn in our own testing. In particular, we want to understand the model's potential near-term risks in the realm of cybersecurity and share the results to help cyber defenders prepare. Mythos is also a large, compute-intensive model. It's very expensive for us to serve and will be very expensive for our customers to use.

We're working to make the model much more efficient before any general release. For those reasons we're taking a slower, more gradual approach to releasing mythos than we have with our other models. We're beginning with a small number of early access customers who will explore the model cybersecurity applications and report back what they find.

Now this blog post is very undercooked. It ends not too long after that. Now if you hear the term "Copy Barathon" around, apparently the model was also referred to as that, I'm not sure if "Copy Baral" was the codename and mythos of the intended launch name, but regardless, this draft blog post was in a cache of unsecured documents.

In total Fortune reports there appear to be close to 3,000 assets linked to anthropics blog that had not previously been published. Now there is a lot of chatter about this one, not least of which is the choice of name, which many people associate with the Kthulhu mythos, which given how much the AI safety folks use those sort of literary reference points to describe their concerns about

AI may not be the most advised name. People also compare it to the recently revealed "Spud" from Open AI with Jason Bauderale writing. I like how inthropics mysterious boochy new model is codenamed mythos, while Open AI named there's after a freaking potato.

So, the broader sentiment was captured by Gavin Purcell who says it will only go faster from here. Obviously, there will be a lot to watch with this one.

Unfortunately for those of us who want to get our hands on the most powerful models,

at any given time, it kind of looks like the blog post was not even an announcement of the release of the model, just in advance warning about it, so who knows how long it will take before we actually see it in practice. Now one model that is available now, Google has dropped a small voice model that could have big implications.

The model is Gemini 3.1/live, which brings real-time dialogue to voice models. Up until now, most voice models have been turned based, causing awkward stumbles and terrible interruption handling. Flash live is designed to work more like a human conversation, with a continuous back and forth, rather than a jarring still to experience.

The model apparently shows a step-change improvement in multiple unmountable audio benchmarks, including one designed to measure multi-step function calling. That's the feature that converts voice commands into complex agentic actions. Some customers like Home Depot have already deployed the model, and Google noted a big improvement in handling complex details like alphanumeric product codes and noisy environments.

So the obvious implication is the quality of personal voice agents on mobile devices, and especially given that Apple is looking to Gemini to power the new version of Siri, along winter of our discontent of Siri not understanding a single-dem word we say may finally be coming to an end. One small product announcement from Shopify that I actually think could be fairly significant,

one of my weirder or more out there predictions for 2026 was that I thought that Shopify has kind of an outsized role to play in the positive normalization of AI. The reason for that is that Shopify is where a ton of small business entrepreneurship lives. Shopify's tools have already even in the pre-AI era, given people who felt overwhelmed

by what they needed to do to start a business, enough help to get over the hump. Although as you well know, I am not a jobs-dumer, I do think that we're going to see a lot of shifts in the average way that people get employed and make money.

One piece of that I believe will be an increase in small business entrepreneurship.

If Shopify is the home of where a lot of that new energy goes, the way that they use AI

To provide value for their people could make a big difference in people's per...

of it.

The one thing when the only thing you hear about AI is that it's going to take your

job and it uses all the water, it's another thing when you see your income rise 30% from

the month before because of the tools you were able to use through your store's hosting platform. So what Tinker is is a free mobile app with more than 100 AI tools for ecommerce. Merchants can generate logos, product photos, advertising videos, and much more. It's an iterative experimental playful canvas where you can try out all sorts of different

brand identities, product placements, and more. The entire concept is about flattening the learning curve. Apps are arranged by outcomes so merchants only need to select what they want to create. Once inside an app, they can see a range of examples demonstrating what it can do and how to use it.

They can then describe a desired outcome in natural language, drop in a reference image, and Tinker automatically turns those inputs into high-quality prompts on the back end. Shopify's director of product for SOCASI said, "If you want more artists, lower the cost of paint." And cost isn't just money, it's the time spent keeping up, the friction of signing up

for everything separately, and the learning curve of figuring it all out. We wanted to lower all of it. So like I said, may seem small, but I really do believe that Shopify potentially has an outsized role to play in the positive integration of AI into the broader economy, and

I think Tinker from my first glance is looks awesome.

By the way, hopefully this goes without saying, but this is a completely unsponsored opinion. Over an open AI land, Codex gets a big upgrade with the integration of plugins, the open AI devs account rights, with plugins, Codex can now support more real work, including the planning, research, and coordination that happens before you write code and the workflows that follow.

The team at OpenAI also use the occasion of the plugins launch. To go for Anthropics Throat, around some controversy of recent changes from Clawd, to reach from the Clawd code team rights, to manage growing demand for Clawd, where adjusting our five-hour session limits for free pro max subs during peak hours. During weekdays between 5am and 11am, specific time, you'll move through your five-hour

session limits faster than before. Every bullword, not happy about that, and OpenAI took full advantage. T-bo from the Codex team rights, hello. We have reset Codex usage limits across all plans to let everyone experiment with the magnificent plugins we just launched.

You can just build unlimited things with Codex. Have fun.

Speaking of OpenAI, the company has made a decision which I think is extremely the right

one, putting their erodica plans on hold. The financial time reports that OpenAI has decided to shelve plans for adult mode indefinitely as they consolidate resources around coding and enterprise sales. This is to put it mildly not all that surprising. Earlier this month, the Wall Street Journal reported that OpenAI's independent advisory

council was unanimously against the feature. Reportedly, their aged detection system had a 12% failure rate, and the experts on the council weren't even satisfied adult mode would be safe for adults. Warning it could encourage an unhealthy emotional dependence on chat GPT. The feature was also controversial among staff, with some departing the company over the

issue. Speaking with the financial times, sources said that OpenAI wanted to have more long-term research on the effects of sexually explicit chatbots and emotional attachment to AI before they released the product. Now my feeling about this is I said last fall, is that on the one hand, I have a very

socially libertarian bent that basically thinks that adults should be able to do whatever

they want as long as it's not hurting other people. That said, viewing this question from an entrepreneurs lens, it did not make sense to me for OpenAI to be the one to offer this. Another is going to be I promise you, no shortage of adult AI experiences that are available to any adult who want them.

And I just think that all of the costs of going down this route were so obviously going to be higher than the upside for OpenAI. So one other thing that I did want to note about OpenAI's recent moves, there is a lot of chatter right now about how many products are being killed by OpenAI, instant checkout, Sora, the erotic chatbot, with people seeming to suggest that it's the company flailing.

I think in many ways it's the opposite, it would be the worst business decision that OpenAI could make, to stick with something that wasn't the right move, even if it looked like the right move just a couple of months ago. Nothing will kill a business faster than sunk cost fallacy, an OpenAI being willing to scrap efforts, even where a lot of effort went in, is net net a good thing for that company.

And it couldn't come at a better time because Boyle boy is the competition going to do nothing but heat up. Latest rumors suggested anthropic is discussing going public as soon as the fourth quarter, we follow up Bloomberg reporting, saying that they might be looking to IPOs soon as October. That of course puts OpenAI on the clock as Sam Altman has reportedly said he would prefer

to go first, meaning all in all, I think my prediction that we actually don't get IPOs

this year might be one that is wrong, Noel Mulvay writes, according to the Zodiac, 2026 is the year of the mega IPL. Indeed. For now that is going to do it for the headlines, next up the main episode. All right, folks, quick pause.

Here's the uncomfortable truth. If your enterprise AI strategy is we bought some tools, you don't actually have a strategy. KPMG took the harder route and became their own client zero. They embedded AI in agents across the enterprise, how work it's done, how teams collaborate, how decisions move, not as a tech initiative but as a total operating model shift.

Here's the real unlock.

That shift raised the ceiling on what people could do, human state firmly at the center

while AI reduced friction, surface din site, and accelerated momentum. The outcome was a more capable, more empowered workforce.

If you want to understand what that actually looks like in the real world, go to www.kpmg.us/AI

that's www.kpmg.us/AI. Blitzie is driving over five X engineering velocity for large scale enterprises. A publicly traded insurance provider leveraged Blitzie to build a bespoke payments processing application, an estimated 13-month project, and with Blitzie, the application was completed in live in production in six weeks.

A publicly traded vertical SaaS provider used Blitzie to extract services from a 500,000 line monolith without disrupting production, 21 times faster than their pre Blitzie estimates. These aren't experiments. This is how the world's most innovative enterprises are shipping software in 2026. You can hear directly about Blitzie from other Fortune 500 CTOs on the modern CTO or CIO

classified podcasts. To learn more about how Blitzie can impact your SDLC, book a meeting with an AI solutions consultant at Blitzie.com, that's BLI, TZY.com. You've heard me talk about assembly AI and they're insanely accurate voice AI models, but they just ship something big.

Universal 3 Pro is a first of its kind class of speech language model that lets you prompt

speech recognition with your own domain context and vocabulary, instead of fixing transcripts in post-processing. It's more flexible than traditional ASR and more deterministic than LLMs, so you get accurate output at the source and can capture the emotion behind human speech that transcripts often miss, all without custom models or post-processing hacks.

And to celebrate the launch, they're making it free to try for all of February. If you're building anything with voice, this one's worth a look. Head to assembly AI.com/freeoffer to check it out. Most companies don't struggle with ideas. They struggle with turning them into real AI systems that deliver value.

Robots and Pencils is a company built to close that gap. They design and deliver intelligent, cloud-native systems powered by generative and agentic AI, with focus, speed, and clear outcomes. Robots and Pencils work in small, high-impact pods, engineers, strategists, designers, and applied AI specialists working together to move from idea to production without unnecessary

friction. Powered by RoboWorks, their agentic acceleration platform, teams deliver meaningful results including initial launches in as little as 45 days depending on scope. If your organization is ready to move faster, reduce complexity, and turn AI ambition into real results, Robots and Pencils is built for that moment.

Start the conversation at robotsandpensils.com/aideallybrief. That's robotsandpensils.com/aideallybrief. Robots and Pencils impact at velocity. Welcome back to the AI-ideallybrief. I noticed this really interesting story yesterday, where Intercom announced that their new

dedicated customer service-focused model Finn had achieved something very significant. CEO UN McCabe called it "Objectively the highest performing, fastest, and cheapest model for customer service, beating the very best models in the industry, including GPT5.4 and Opus 4.5." Now, it has been a persistent question in AI about how much customer models would matter.

You might remember way back in the immediate post-Chat GPT fever, a number of companies

figured, "Well, since we have such unique proprietary data, training our own model on that data surely will outperform."

Maybe the best note of those efforts was Bloomberg GPT, which they called a 50-billion-parameter

large-language model purpose-built from scratch for finance. Now, it turned out that in practice, that model got absolutely smoked by the general models, reminding everyone once again of the bitter lesson. The bitter lesson is a very famous essay from Computer Scientist Rich Sutton from back in 2019.

He writes, "The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective and by a large margin. He gave as one of his first examples computer chess. He says in Computer Chess the methods that defeated the World Champion Casper off in 1997 were based on massive deep search.

At the time, this was looked upon with this may by the majority of Computer Chess researchers, who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human knowledge-based chess researchers were not good losers.

They said that brute force search may have won this time, but it was not a general strategy. In any way, it was not how people played chess. These researchers wanted methods based on human input to win, and were disappointed when they did not.

So basically what this essay is arguing is that throughout AI history, and as a reminder,

because this often surprises people, AI as a field, at least as a named field, is older than computer science. If you go back to the fifties and look at the laboratories at places like MIT, there were already back then artificial intelligence labs, but the idea of computer science as a field

Wouldn't come into a little bit later.

In any case, what the bitter lesson is arguing is that throughout AI history, researchers have tried two basic approaches.

The first is encoding human knowledge and clever tricks into systems, essentially trying

to teach computers how humans think. The second is giving computers massive amounts of data and compute and letting them figure things out on their own through search and learning. The bitter lesson is that the second approach wins every single time. It's bitter because it's a blow to human ego.

Researchers spend years crafting elegant domain-specific solutions. In coding chest strategy or linguistic rules or visual perception models, and then a brute-force method powered by more compute just steam rolls all that careful work. Now we have that example from chess, but that example repeated across go, speech recognition, computer vision, and now language.

The systems that scale with more computation always eventually beat the systems built on

human design shortcuts. And so taking the bitter lesson and applying it to LLMs, kind of explains why Bloomberg's highly specialized model ultimately got beat by much bigger and more computationally intensive models. And yet, coming into 2026, there was an interesting question.

Specifically of whether a specific type of data might actually change this equation. That data that people were interested in is the last-mile usage data. Basically user interaction data at the very edge of the experience. And the specific place where many were watching this was around AI coding. The question was whether a company like cursor could ultimately have some advantage in their

own proprietary model because they had such a tremendous amount of experiential data around the actual interaction point. Now it wasn't really so much a question of whether that data is valuable, obviously it is. But there's a difference in it being valuable for product design versus model design.

Inevitably, that data is extremely useful in figuring out the right products or the right harnesses for models.

That was never in question.

But what was a question is whether all that information could actually change the destiny of customized vertical models. Late in space wrote about this last year in November, and their piece titled The Agent Labs thesis. The point that swix and late in space were making was that if it is the case that we are

close to hitting the limits of pre-training data, that perhaps shifts the future of model performance to post-training. The Agent Labs thesis asked, can post-training make up the gap between the best open

models in the best frontier models, and how long until they start exceeding?

In other words, the tweak here is that a company like cursor isn't training a model from scratch, they are taking the best available open-wates models that are out there, which are admittedly a little bit behind the state of the art, and adding in this post-training process with the idea of actually performing better in a specific domain than the general state of the art model can.

Now cursor plays to pretty high importance on this. The company had said explicitly that they needed to train state-of-the-art coding models to keep up with competitors, which some reports suggested was a financial imperative, with cursor burning too much money reselling a PI access to open AI and anthropic. Now earlier this month, we got the release of their Composer 2 model.

The model was in the same ballpark as GPT 5.4, and actually beat Opus 4.6 on coding benchmarks while being much cheaper to run, meaning of course that it fit cursor's needs extremely well. However, an X-user called Flynn triggered a controversy, revealing that Composer 2 was just, and boy is this just doing a lot of heavy lifting, Kimi K2.5 with some extra reinforcement

learning applied. cursor themselves did not deny this. DevRelations reply Robinson commented, "Yup, Composer 2 started from an open-source base. We will do full pre-training in the future. Only a quarter of the compute spent on the final model came from the base the rest is from

our training. This is why E-VALS are very different." Now some amount of the controversy was about cursor in the eyes of some failing to disclose their use of an open-source base model, but others seemed genuinely dismissive of the practice.

As Flynn had done they wrote off the model as "just" Kimi K2.5 without a second thought.

Others thought though that maybe something important was going on here.

Lead LLM rights? As someone who basically lives in Opus 4.6, seeing an open-weight Kimi 2.5 fine tune actually beat it on coding benchmarks is wild. If Composer 2 could really perform that well, cursor seemed to have demonstrated that reinforcement learning on a quality dataset can actually go quite a long way, vaulting an adequate

base model into the top tier. This of course in some ways seems to run counter to the bitter lesson, but if it's correct would suggest that there's a lot of fertile ground for training models around particular verticals. Which gets us to the announcement yesterday from Intercom.

Intercom's Chief Product Officer Paul Adams tweets, "We have a very significant announcement here that will change how we think about the AI landscape. We have built a brand new model for Finn, called Apex, which has a higher resolution rate, fewer hallucinations, and is far cheaper than any other model provided by any other company in the world, and it isn't close."

This is an incredibly hard thing to achieve and is only possible with the domain-specific proprietary e-values from our billions of human and agent customer service interaction

Data points.

We also have a flywheel here where we will continue to get better at the edges.

This is you might recognize exactly what we were talking about in my 2026 predictions

when we talked about the lab loop, and the importance of this last-mile usage data. Paul continues, "So what does this mean? It means that vertical models can and will outperform general models. It means that many successful companies in the future will need to be full stack, app layer, AI layer, and model layer.

And critically, as it becomes much easier to copy and clone at the app layer, durable differentiation

will move down the stack and ultimately to the model layer."

Now, this got a ton of chatter. Being a phoG writes, "The story isn't that Apex beat frontier models. It's the domain-specific post-training close-the-gap this fast. Any vertical SaaS with enough labeled interaction data is sitting on an untapped fine-tuning asset.

The infrastructure mode is eroding faster than most real eyes." AbyG, who's on the board of Intercom, but does new products at OpenAI writes, "Model quality depends a lot on judgment, and that judgment lives in proprietary e-values, real-world usage, and fast-feedback loops, being close to the work." This creates all kinds of opportunities for companies that are willing to think big and

bet on themselves. Now, while he doesn't seem worried for his main employer, OpenAI, the implications for them

is certainly where many people's heads went.

The obloche writes, "Very cool feat from Intercom, though reading this makes me wonder what value the frontier lab companies actually deliver long-term. With every industry, cursor for coding, now fin for C.S., can build better and cheaper specialized models from open-source bases." And interestingly, this wasn't the only story around these themes.

Decagon co-founder Ashwin Shrine of Us writes, "Over 80% of model traffic at Decagon now runs on models we've trained in house, structured as a network of specialized models handling different parts of the interaction." Now, this is a little bit different, because there is actually an architectural change here. In their announcement host they write, "Instead of relying on a single model, we built a network

of specialized models each responsible for a specific part of the interaction, detection or constration response generation and evaluation. That separation lets us optimize each layer independently and drive better speed and quality across the system." Regardless though, the point is that here you have another company that is shifting off-reliance

on the major close-foundation models and towards models that they've trained at least in part, themselves.

Decaur says, "I think this is a trend we'll see going forward. The reliance on general purpose

for interior models will hit a wall for domain-specific tasks. Custom post-training pipelines will be the way forward." Climbed along from hugging face agrees, writing, "After Pinterest, Airbnb, notion, cursor, today it's you and an intercom publicly sharing that they're finding it better, cheaper, faster to use and train open models themselves, rather than use APIs for many tasks, and

hundreds of other companies are doing the same without sharing."

Ultimately, I believe the majority of AI workflows will be in-house based on open source

versus API. It took much more time than we anticipated, but it's happening now. Now obviously if this is the case, there are significant business model implications. Adriana Sabata writes, "The API tax is starting to look like the cloud markup of 10 years ago. Once teams realize they can run fine-tuned open models for a fraction of the cost, the switch becomes obvious. You and from Intercom, agrees that this is the beginning of something

bigger, writing a companion post called the Age of Vertical Models is here. He reinforces that the model just is better across numerous dimensions. It has a 2.8% higher resolution rate, but he writes importantly, it's also dramatically faster, has fewer hallucinations, in fact a 65% reduction in hallucinations, and is far cheaper than all other available models." In his post, you and reference the recent interview with Andre Carpathy, where Carpathy

said, "I do think we should expect more speciation in the intelligences. The animal kingdom is extremely diverse in the brains that exist, and there's lots of different niches of nature, and I think we should be able to see more speciation. And you don't need this oracle that knows everything, you kind of speciated, and then you put it on a specific task.

And we should be seeing some of that because you should be able to have much smaller models

that still have the cognitive core. From there you and picks up, the frontier labs still have the very best models, but the open-weight models are not that far behind. So it's not hard to see pre-training as a commodity of sorts. Where we think the frontier will move next is to post-training. Carpathy's prediction is exactly what we're seeing with apex and cursors composer 2, and what we're going to see significantly going forward.

As such, the labs are in an interesting position where, on one hand, the horizontal general purpose models are actually over-serving the market for specific use cases, e.g. their models are more generally intelligent that is needed for customer service, and on the other hand, the open-weight models are more than good enough where high-quality domain-specific post-training can make the resulting model superior at the special purpose jobs, and in

the way that matters to that particular job. Personally, I'm still very bullish on the labs. We remain very heavy customers of anthropic. Yet, classic disruption is now at their door. The only way out is to disrupt themselves by building cheaper specialized models too, and the only way to do that is to acquire the e-values, or the companies with the e-values, needed for that specific task, which means there will be some interesting data partnerships

Or M&A consolidation, and you're going to see some hyper-specific model provi...

go at a loan, and compete with the labs head-to-head, likely all of the above.

Now going back to the bitter lesson, it kind of feels at first glance, like this would

run counter to that, right? That in the long run, the sure additional volume of computational data should be at the specialized knowledge and data of the edge providers. Except the bitter lesson isn't just about the amount of data, it's about brute force data and compute as opposed to human knowledge, but we're not exactly talking about human knowledge here. Instead, we're talking about experience. The data that a cursor has, or an intercom has,

is not the data of some human expert, instead its millions of interactions which show how

things actually happen in the real world. It turns out that Richard Sutton himself actually discussed this very thing, as an example of the next phase of the bitter lesson on the darkest podcast last year. Will they reach the limits of the data and be superseded by things that can get more data just from experience rather than from people? In some ways, it's a classic case of the bitter lesson, with the more human knowledge we put

into the largest language models, the better they can do, and so it feels good. And yet, one, I, in particular, expect there to be systems that can learn from experience, and those could well perform much, much better and be much more scalable, in which case it will be another instance of the bitter lesson, with the things that used human knowledge were eventually superseded by things that just trained from experience and competition.

Putting it simply, this new model Apex Composer 2, are post-trained from experience,

exactly a Sutton said. Now, this might feel like an inside baseball kind of story, but I think

that the implications could be massive in terms of how the whole industry evolves. One thing I don't think that this means is that every company that has any sort of customer data is all of a sudden going to be successfully able to spend their own model. There

are ultimately not that many people who are good at doing post-training, and so I don't

think that we're going to see this massive fragmentation of vertical models, but you better believe that these results are encouraging enough that many, many more companies who do have this type of data and the post-training talent or the ability to get it are going to be doing some experimenting in this area. It's something we will continue to watch and explore, but for now, that's going to do it for today's AI Daily Brief. I appreciate you listening

and watching as always, and until next time, peace!

Compare and Explore