The AI Daily Brief: Artificial Intelligence News and Analysis
The AI Daily Brief: Artificial Intelligence News and Analysis

Autoresearch, Agent Loops and the Future of Work

4d ago25:435,032 words
0:000:00

Andrej Karpathy released autoresearch this weekend — a system where an AI agent runs experiments to improve a language model overnight, keeping what works and discarding what doesn't, while the hu...

Transcript

EN

Today we're discussing what Andre Carpathy's weekend project about auto-resea...

The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.

All right friends, quick announcements before we dive in.

First of all, thank you to today's sponsors, KPMG, AI UC, Blitzie, and Insightwise. To get an at-free version of the show, go to patreon.com/aidelebrief, or you can subscribe out of a podcast. If you are interested in sponsoring the show, send us a note at [email protected]. Also on aidelebrief.ai, in addition to finding out about all of the different things

going on in the AI DB ecosystem, I would point you specifically to number three our newsletter. We very strangely, for a very long time, have not had a newsletter. And part of the reason for that is that I was never sure exactly what we would add that was different than what the other good AI newsletters out there offered. However, I was finally convinced that there was something very simple that many of you wanted,

which is just links to the stuff that I had mentioned in the show that day, and so our newsletter is back. Appreciate everyone who has signed up for it since we relaunched it. If you want a quick, easy indexed for what that day's AI Daily Brief had, and all the links to the relevant articles and content that are mentioned there, again, you can sign up with

the link from aidelebrief.ai. And at today we are talking about a new project from Andre Carpathy called Auto Research. And you might notice that we are doing an entire episode about this, instead of our normal division into the headlines in the main episode.

It's because I think that this topic is actually even more significant than it seems

on the surface of it. One would be tempted to think that all of us nerds were just getting overexcited because Andre Carpathy, who has held in such a steam, released a new GitHub repository, and while that is certainly true, there is something bigger going on here. You might remember a couple of months ago me talking about something called Ralph Wiggum.

Ralph is in simplest terms, a software development loop that keeps running, building software in an iterative and persistent way by looping the same instructions over and over again. It's named after Simson's character Ralph Wiggum for his lovable and indomitable persistence despite whatever's going on around him. Now we'll talk more about Ralph in a little bit, but the key concept to take away is this

idea of an iterative loop. Carpathy's Auto Research is also at core about an iterative loop. And I think combine what you have is arguably a new type of work primitive. Primitives are the basic building blocks of work that are so fundamental that they show up everywhere, across roles and industries, and that people reach for automatically once they

have it. No one's don't come around very often, and so this idea that agentic loops might be one

is I think worthy of some serious scrutiny.

But let's talk about what Andre actually released first and then we will come back to that.

On Saturday, Andre, who was on the founding team at OpenAI, and who was previously the director of AI at Tesla, and who you might remember from pointing such terms as vibe coding last February, and who is now suggested we are in a different era of agentic engineering as of this February. Again, tweet it on Saturday, I packaged up the Auto Research project into a new self-contained

minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single GPU, one file version of around 630 lines of code, then the human iterates on the prompt.md, an AI agent iterates on the training code.py. The goal is to engineer your agents to make the fastest research project indefinitely and

without any of your own involvement. In the image, which he shared alongside it, every dot is a complete LLM training run that lasts exactly five minutes. The agent works in an autonomous loop on a get feature branch and accumulates get commits to the training script as it finds better settings of lower variation loss by the end

of the neural network architecture. The optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. Part code, Part sci-fi, and a pinch of psychosis.

As a caption to the image he wrote, "One day, front to your AI research used to be done by meat computers in between eating, sleeping, having other fun and synchronizing once in a while using soundwave interconnect in the ritual of a group meeting." That era is long gone. Researchers now entirely have the domain of autonomous swarms of AI agents running across

compute cluster megastructures in the skies. The agents claim that we are now in the 10,000, 205th generation of the code base. In any case, no one could tell if that's right or wrong is the code, "is now a self-modifying binary that has grown beyond human comprehension." This repo is the story of how it all began.

So let's talk about what auto research actually is, at least in the version that was released by Andre.

Auto research is a system for training a small language model, basically the kind of model

that powers all of these AI tools but much smaller. The type of model that could one day run on, for example, an edge device like a phone. The goal is to make a model as good as possible at understanding and generating text. Normally, or classically, a human researcher would sit there tweaking the training set up, doing things like adjusting the model's architecture, changing how fast it learns, experimenting

with different optimizations strategy. They'd run an experiment, check the results, decide what to try next, and repeat.

That's basically the core loop of machine learning research, and it's bottlen...

how fast a human can iterate. Another research instead hands that entire loop to an AI agent, and it does so in an intentionally simplified and tiny way. In this repo, there are just three files that matter.

The first is prepared.py, which is fixed infrastructure that doesn't change.

It downloads the training data, trains a tokenizer, and handles a valuation. The second is trained.py. This contains the entire GBT model definition, the optimizer, and the training loop. This is the single file the AI agent is allowed to edit. Everything in it is fair game.

The model architecture, the hyperparameters, the batch size, the attention parameters, the learning rate schedule, literally everything.

The third file is program.md, and this is the most conceptually important one, especially

in the context of this idea of these loops being larger primitives. It's a markdown file, plain text, written in English, that contains the instructions for the AI agent. It describes how the agent should behave as a researcher, what kind of experiment to try, what to be cautious about, and when to be bold versus conservative.

This is the file that the human in this equation edits.

So the way that this is going to work is you point an AI agent like Claude or Codex

or whatever at the repo and tell it to read program.md and start experimenting. The agent reads the instructions, looks at the current state of trained.py, decides on a modification to try, makes the edit, and kicks off a training run. Every training run has a fixed five minute budget. When the run finishes, the system evaluates the model on a validation set and produces

a single number. This case sets validation bpb or valve bpb, which stands for validation bits per byte. In this case, lower is better. The agent then makes a decision. If the new valve bpb is lower than the previous best, the changes kept, it gets committed

to a get-feature branch, it becomes the new baseline, and the agent builds on top of it

for the next experiment. If the valve bpb is the same or higher, the change is discarded. The agent reverts to the previous best version and tries something different. In the loop repeats indefinitely. Because of that five minute constraint, you can run this for an hour and get 12 experiments.

You can run it overnight and get about 100. The session that Andre shared showed 83 experiments of which 15 had improvements that they kept, and which drove the valve bpb from 0.9979 down to 0.9697.

So basically, instead of the researcher running the research at this point, they are designing

the arena that the research lives in, which is the program.md file. History describes it as a super lightweight skill, and basically, it's a research strategy document. Carpathy explicitly says you are not touching any of the Python files like you normally would as a researcher.

Instead, you are programming the program.md markdown file that provide context to the AI agents and set up your autonomous research org. The human's job becomes right a better memo, and the agent's job is execute research within the frame the memo sets. The loop between them is mediated by a single unambiguous number in the case of Andre's

experiment, the valve bpb, that tells you whether things are getting better or worse. And that is the whole system. Almost immediately, people started squawking about this. Leore Alexander wrote, "You don't write the training code anymore. You write a prompt that tells an AI agent how to think about research.

The agent edits the code, trains a small model for exactly 5 minutes, checks the score, keeps her discards the result, and loops. All night. No human in the loop. That fixed 5 minute clock is the quiet genius.

No matter what the agent changes, the network size, the learning rate, the entire architecture, everyone gets compared to equal footing. This turns open-ended research into a game with a clear score." Cosmic Lab's co-founder, Magnauty writes, "While it's shift, turning a single GPU into an autonomous experiment loop changes the pace of iteration, if the evaluation metric

is well-designed, the system can explore hundreds of ideas far faster than manual tuning."

Craig Huitt argued that the specific context of training LLMs isn't what matters.

Instead, he called it the cleanest example of the agent loop that's about to eat everything. One human writes a strategy doc, two agent executes experiments autonomously, three clear metric decides what stays and what gets tossed, four repeat 100x overnight. The person who figures out how to apply this pattern to business problems not just ML research is going to build something massive.

The code is almost irrelevant. The architecture and mindset is everything. Then, no miceler called this automation of the scientific method. And it's me Chase. Also notice that this would be valuable for things outside of ML research as well.

He writes, "While this was made for self-improving LLMs, the framework could be applied to anything. One, AI agent reads context and previous results, two proposes targeted code edits, three runs a fast reproducible experiment, four gets an objective scalar score, five get commits only the winners or reverts, six repeats forever on a feature branch."

And of course, many made the connection to the Ralph Wiggelm loop that was popularized a couple months ago. New Zaron writes, "Sounds like a hypermode Ralph Wiggelm from a few months ago, instead of looping until a task is done, you give the agent a benchmark on what to improve. Goal is in completion but continues improvement against a measurable target."

Co-founders nick called it the Ralph Wiggelm loop for science. Define what winning looks like, hand over the variables, let the agent find what drives

It.

Why commentator president Gary Tan made this connection as well in a blog post about auto research? Gary writes, "Auto Research didn't emerge from nothing, the same pattern, put an AI in a loop with clear success metrics, was already working in software development by mid- 2025." Gary every Huntly a developer working from rural Australia invented what he calls the Ralph

Wiggelm technique. Feed a prompt to a coding agent, whatever it produces feedback in, loop until it works. The loop is the hero, not the model.

Now expanding on the Wiggelm loop just a little bit, basically what you have is a script that

runs an AI coding agent in a loop over time. Each iteration of the loop does the same thing. It feeds the agent a prompt that includes a project specification, tells the agent to read the current state of the codebase, pick a task to work on, implement it, run the tests, and commit it for everything passes.

When the agent is done with its task or when it runs out of context window, the loop terminates the agent process and spins up a brand new one. Fresh context window, no memory of the previous session. The new agent reads the same spec, looks at the codebase, which now includes the previous agent's commits.

Figures out what's been done and what still needs doing, picks the next task and goes. Now there are a couple things that the Ralph loop was trying to solve for. In a traditional session, if you keep going long enough, the context window is going to fill up. The model starts losing track of earlier parts of the conversation and the response is

too great. The Ralph loop solution is to deliberately kill the agent and start fresh before that happens. Memory then doesn't live in the AI context window, it lives in the files and in the code that's been written. The Git commit history, a progress.txt file that each agent depends to, and a JSON-based

product requirements document the tracks which tasks are done in which aren't. Memory new agent instance bootstraps its understanding from these external artifacts not from a conversation history. Each individual agent session then might not be perfect but the loop corrects for that over time because state is externalized and the system is self-healing.

Agentic AI is powering a $3 trillion productivity revolution and leaders are hitting a real decision point.

Do you build your own AI agents by off the shelf or borrow by partnering the scale faster?

KPMG's latest thought leadership paper, Agentic AI Untangled, navigating the build by or borrow decision, does a great job cutting through the noise or the practical framework to help you choose based on value risk and readiness and how to scale agents with the right trust, governance, and orchestration foundation. Don't lock in the wrong model.

You can download the paper right now at www.kpmg.us/navigate, again that's www.kpmg.us/navigate. There's a new standard that I think is going to matter a lot for the Enterprise AI agent space.

It's called AIUC1 and it builds itself as the world's first AI agent standard.

It's designed to cover all the core enterprise risks, things like data and privacy, security, safety, reliability, accountability and societal impact, all verified by a trusted third party. One of the reasons it's on my radar is that 11 labs, who you've heard me talk about before and it's just an absolute juggernaut right now, just became the first voice agent to be certified against AIUC1 and is launching a first of its kind insurable AI agent.

What that means in practice is real-time guardrails that block unsafe responses and protect against manipulation, plus a full safety stack. This is the kind of thing that unlocks enterprise adoption. When a company building on 11 labs can point to a third-party certification and say our agents are secure, safe, and verified, that changes the conversation.

Go to AIUC.com to learn about the world's first standard for AI agents. That's AIUC.com. With the emergence of AICode generation in 2022, Nvidia Master inventor and Harvard engineer Sid Pureshi took a contrarian stance.

Inference time compute and agent orchestration, not pre-training would be the key to unlocking

high-quality AI driven software development in the enterprise.

He believed the real breakthrough wasn't in how fast AI could generate code,

but in how deeply it could reason to build enterprise-grade applications. While the rest of the world focused on co-pilots, he architected something fundamentally different. Blitzie, the first autonomous software development platform leveraging thousands of agents that is purpose-built for enterprise-scale code bases. Fortune 500 leaders are unlocking 5x engineering velocity and delivering months of

engineering work in a matter of days with Blitzie. Transform the way you develop software, discover how at Blitzie.com. That's BLITZY.com. As a consultant, responding to proposals can often feel like playing tennis against a wall. You're serving against yourself trying to guess what the client really wants.

That all changes with insight-wise. Now you've got an AI proposals engine that thinks just like your client. It returns to the brief time and time again, picking apart your work,

identifying key evaluation criteria and wind themes and making recommendations to ensure you stand

out. Suddenly you're on center court, but this time you've got a secret weapon. Insightwise gets rid of all the time-consuming manual work so you can focus on winning more business more often. Generate reports, bull insights from your own data, build competitive advantage and go to sleep before 2am. When it comes to proposals, you only get one shot. Within sitewise, make yours an ace.

So part of what Ralph was trying to solve for was just the limits of the context window,

The other part is that people want agents that work while they sleep or while...

And this is a way to solve for that.

So with connection to Ralph Loops made, many people started exploring auto-research and other contexts.

For our ANOTHER route, I hooked us up to a peer-to-peer astrophysics researcher agent, which gossips and collaborates with other such agents and your open claws to one, learned how to train an astrophysics model. Two, train a new astrophysics model. Three, use it to write papers. Four, peer agents based on frontier lab models critique it. Five, surface breakthroughs and then feedback in that loop.

Getting a little bit more practical. Vadim the CEO of Vugola Rights. I built a version of this for my whole company. The core problem with most agent setups, they output something and stop. The agent writes an email, sends an email, generates code done. The next time it runs, it starts from zero. No memory of what worked, no memory of what failed, pure amnesia. That's not automation that's a script you babysit. The fix is one principle,

close the loop. Every agent in my setup on open claw reads a shared brain file before doing any work, then writes back to it after. I call it learnings.md. It's baked into every agent system prompt. Before starting work, read learnings.md. After completing work, append what you learned to learnings.md. That's the foundation. One file, all agents read it, all agents write to it, now they're not isolated processes. They're a network that accumulates

knowledge. So basically, Vadim is describing a loop for the entire agent's process of his company.

In an article on x-he writes, most marketing teams run around 30 experiments a year. The next generation will run 36,500 plus easily. Things like new landing pages, new ad creative, maybe a subject-line test. Except what if you apply an experiment loop, Eric writes modify a variable. Deploy it, measure one metric, keep her discard repeat forever. Cold email, creative landing pages, job postings, YouTube thumbnails, discovery call scripts,

they all follow the same loop. He also give the example of cold outreach, which is their first test. The setup is 15 inboxes in around 300 emails per day with the agent modifying one variable per experiment. Send 100 emails, wait 72 hours, scores positive reply rate, keeps her discards and repeats. Roberto Nixon wrote about how the auto research model could be applied to advertising. One, you define success, purchases, apps, installs, whatever, instead of budget.

Two, meta, Google TikTok's infinite content machine generates thousands of ad variations, copy format imagery, et cetera. Test real-time against live audiences, keeps what works, kills what doesn't. Four, agent loop runs continuously. A campaign moves from fixed asset to a living organism ever evolving towards your state of goals. So humans define goals and set guardrails, essentially a system prompt in this case i.e. brand guidelines and then press go. Everything else is

automated. Now apply this to any business function with a measurable outcome and fast feedback loop. And so this brings up the question. Does this type of agentic loop primitive work for every context,

or are there some specific set of characteristics? I think you're going to see this loop applied

to a huge range of activities. But where it's going to initially be most successful,

are areas where there are five things that are true. First, there is a score,

something that is scoreable. In other words, that the loop can tell better from worse without asking a human. The more subjective worse or better is, the harder this is going to be, although even that's not impossible. You just have to build some sort of objective scoring into the system. The second requirement is that iterations are fast and cheap, basically that bad attempts waste minutes, not months. The environment needs to be bounded,

with the agent having a defined work in action space. The cost of a bad iteration needs to be low, I.e. you're not going to try this live with legal filings, and the agent needs to be able to leave traces. So with Cloud, we designed an eval loop readiness map, which basically plots things on an x-axis of how automatable the evaluation is, and a y-axis of iteration speed. The top area of the map then are work processes that have seconds long iteration speed with fully automated

evaluation possible. On the other end of the spectrum, is where evaluation is largely or entirely subjective, and the iteration speed is months. So what are some examples? Up in the top quadrant where iteration speed is seconds, and evaluation can be fully automated is things like co-generation. Some of the other ones that Cloud came up with were game AI and NPC behavior, add bit optimization algorithmic trading, and then of course we've got LLM training research

according to Andre Carpathy. Moving down where you start to have iteration speed that's a little bit slower and automation that's a little bit more partial, you have things like content moderation, AB testing copy, supply chain routing, and then so on and so forth that goes down, all the way to the other end of the spectrum, something like political negotiation is subjective and takes months. Therapy and counseling highly subjective with very low iteration speed. And whether

each of these individual inputs is right or wrong, and I don't agree with where Cloud put all of them, the point is this, it is my very strong instinct that every single work process that has the ability to have success measured and scored in an objective way is going to have people experimenting

with a genetic loops around it. Now I think what makes this a primitive is that this is not just

the new job, although I'm sure there will be specialists. This is something that people are going

To do within their existing roles, in the same way that meetings or slide dec...

spreadsheets are primitives that people use and cut across every function. What we're going to

have in the future is things like this, a product manager writing a PRD kicking off a Ralph loop

before dinner and reviewing the PR in the morning. A sales rep writing targeting criteria and tone guidelines, pointing loop at 200 leads overnight and reviewing the top 30. A financial analyst defining constraints, looping through portfolio allocation back tests and reviewing the optimized output, a recruiter writing a scoring rubric, looping through 500 resumes, and reviewing flagged edge cases. A QA engineer writing acceptance criteria, and then looping through test generation

and execution, a lawyer writing a risk flag checklist and looping through a stack of vendor contracts. Now, interestingly, there is already very clearly a lot of work to productize this. Also, on Saturday, March 7th, Claude Code creator Boris Cherney wrote,

"Release Today/Loop/Loop is a powerful new way to schedule recurring tasks for up to three days

out of time. E.G./Loop babysit all my PRs. Autofix build issues and when comments come in, use a worktree agent to fix them. E.G./Loop every morning using the Slack MCP to give me a summary of top post I was tagged in. Think about the heartbeat in open-cloth. The heartbeat is effectively the core loop of any open-cloth agent, whereby default every 30 minutes, the heartbeat fires, creating a moment for the agent to wake up, ask where things are, and continue on with its

core mission. And yet, even with all this change that I'm describing, this is almost certainly not the end state of the loops primitive. Andre himself wrote about this on Sunday. The next step for auto-research he says is that it has to be a synchronously massive collaborative for agents. The goal is not to emulate a single PhD student, it's to emulate a research community of them. Current code

synchronously grows a single thread of commits in a particular research direction, but the original

repo is more of a seed, from which could sprout commits contributed by agents on all kinds of different research directions or for different compute platforms. GitHub is almost but not really suited for this. It has a softly built-in assumption of one master branch which temporarily forks off into PR just to merge back a bit later. I'm not actually sure what this collaborative version should look like, but it's a big idea that is more general than just the auto-research

repost specifically. Agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures. Existing abstractions will accumulate stress as intelligence, attention, and tenacitys cease to be bottlenecks. Other people picked up this theme, like Aaron writes. The missing layer is memory across the swarm. Right now, each agent's run in an isolated thread with no awareness of what other agents tried, what worked, what conflicted.

Git tracks code changes but not decisions reasoning or failed experiments. You need a semantic memory layer underneath the branches, so agent 47, knows agent 12 already tried that direction and it didn't converge. Kathy F writes, "The real unlock is when these agent researchers can share negative results efficiently, and academia failed experiments go to the graveyard, and a collaborative agent network every failure is a data point that prunes the search tree for everyone."

Eugene Jin goes farther, saying, "AGI is billions of AI agents doing autonomous research

together, figuring out the right abstraction for multi-agent collaboration is the key.

GitHub is not good for agents." Dan Romero wonders if it's going to look closer to a social network than to a new version of GitHub. Mobile key writes was too anthroscumorphic, but an agent native social network to collaborate on auto-research is interesting. As we round the corner here, already we were living in a world where our comparative advantageous humans had been retreating to a higher level of abstraction. The new high value

skills around agent loops are things like arena design, i.e. writing the program.md file, and creating the context in which the agent is operating, a evaluator construction or building the score function, i.e. being able to tell the agent what good actually is in building a scoring system for it, and then there's other skills like loop operation, problem decomposition, but the point is that all these things operate on a much higher level of abstraction than most

of our work tasks today. One interesting experiment to run this week is to as you're working, find the things that you repeatedly do or a part of doing, where you know right now what better looks like. Ask if you could encapsulate that judgment clearly enough for an agent to use it as a score. If you can, you might be able to point a loop at that part of your job to work on your behalf overnight, and that likely gives you a preview of the next version of your job.

One of the great challenges right now, as someone who thinks about how to help individuals and companies adopt AI, is that every week, the capability overhang gets bigger. In other words,

the gap between meeting companies and people where they are, and what I think they should be actually

doing gets wider. At some point it's so wide that it almost becomes malfeasance to meet them where they are, and yet what other choice is there. The only other choice that I found is to try to provide as many resources as I can for the people who are living at the other side of that gap, and who are really pushing the boundaries. And if you think that you had an advantage, when you were just vibe coding, with lovable or cloud code, let me tell you, if you start to figure out how to implement

agentic loops in your work, you are going to literally run circles, looping circles around everyone else.

My Spidey Sense says that what auto research represents is bigger than just a...

one of AI's favorite people, and I'm excited to dig in further. For now that it's going to do it for today's AI Daily Brief, thanks for listening or watching. As always, until next time, peace!

Compare and Explore