Lawfare Archive: Elliot Jones on the Importance and Current Limitations of AI Testing

The Electronic Communications Privacy Act turns 40 this year, and it's showin...

On Friday, March 6th, Laugh Fair and Georgetown Law are bringing together leading scholars,

practitioners, and former government officials for installing updates to ECPA, a half-day event on what's broken with the statute and how to fix it. The event is free and open to the public in person and online. Visit Laugh Fair Media.org/EchpaEvent. That's LaughFairMedia.org/ECPAEvent for details and to register. I'm Marissa Lone and Internet Lawfare, with an episode from the Laugh Fair Archive

from March 15th, 2026. On March 9th, anthropic filed a lawsuit against the Department of Defense's designation of the Artificial Intelligence Company as a supply-chameless. A category associated with foreign adversary companies such as Russia's Casper Schee and China's Huawei. When the talks with anthropic fell through, open AI made a deal with the Pentagon

that reportedly keeps the same limitations on the military's use of its AI systems

that caused the anthropic agreement to fail. For today's archive, I chose an episode from August 30th, 2024, in which Kevin Fraser and Elliott Jones sat down to discuss the state of AI systems testing and regulation, and how industry leaders such as Open AI and anthropic have taken different approaches to testing their efforts.

It's the Laugh Fair podcast. I'm Kevin Fraser, assistant professor at St. Thomas University College of Law and a Tarbell fellow at Laugh Fair, joined by Elliott Jones, a senior researcher at the Aida Lovelace Institute. One thing we actually did here from companies, from academics and from others is, they

“would love regulators to tell them what evaluations do you need for that. I think that”

a big problem is that there isn't actually had to be that conversation between what are the kinds of tests you would need to do, the regulators care about, the public cares about, but are going to test the things people want to know. Today we're talking about AI testing in light of a lengthy and thorough report that Elliott co-authored.

Before navigating the nitty gritty, let's start at a high level. Why are AI assessments so important? In other words, what spurred you and your co-authors

to write this report in the first place?

Yeah, I think what really is better to think about this is that we've seen massive developments in the capabilities that AI in the last couple of years, the kind of chatGBT and everything that has followed. I think everyone's now aware how far some of this technology is moving,

“but I think we don't really understand how it works, how it's impacting society, what”

the risks are, and I think a few months ago we started talking about this project. There was the UK AI safety summit, there were lots of conversations and things in the air about, like how do we go about testing, how safety things are, how are these talking, but we felt a bit unclear about where the actual state of play was there, we thought about we looked to this and looked around and like there was a lot of interesting work out there, but

we couldn't find any kind of comprehensive guide to actually how useful are these tools, how much can we know of these systems? To level set for other listeners out there, Elliott, can you just quickly define the difference between a benchmark in a valuation and an audit? That is actually a slightly tricky question than it sounds. At a very high level, a kind

of evaluation is just trying to understand something about the model or the impact the model is having, and when we spoke to experts in this field who work on information model developers in independent assessors, some of them use the same dishes for audits, sometimes audits

“were a subset of evaluations, sometimes evaluations are a subset of audits, but I think”

for a very general sense of listeners, evaluations are just trying to understand what can the model do, what behaviors does it exhibit, and maybe what the broader impacts of it has on say energy costs, jobs, the environment, other things around the model. For benchmarking in particular, benchmarking is often using a set of standardized questions that you give to the model, so you say we have these hundreds of questions maybe from

say like an AP history exam, where you have the question, you ask the model the question, you see what answer you get back and you compare the two, and that allows you to have a fairly standardised and comparable set of scores, you can compare across different models. So a benchmark is a subset of the broader category of evaluations. And when we focus on that difference between the valuations and audits, if you were to define

audits distinctly, if you were to try to separate them from the folks who can fade it with evaluations, what is the more nuanced, say guess, definition of audits? The important thing when I think about audits is that they are kind of well structured and

Standardized, so if you're going into say a financial audit, you kind of does...

that auditors are expected to go through to assess the books to check out what's going

on, everyone kind of knows exactly what they're going to be doing from the start, they

“know what endpoints they're trying to work out. So I think an audit would be something”

whether is a good set of standardized things or going in exactly what you're going to do, I know exactly what you're testing against. Audits might also be more expensive than just the model, so an audit might be a kind of governance audit where you look at the kind of practices of a company or you look at how the staff are operating, not just what is the model, what is the model doing, whereas evaluations sometimes can be very structured as

I kind of discussed with benchmarks, but can also be very exploratory where you just give an expert a system and see what they can do in 10 hours. We know that testing AI is profoundly important, we know that testing any emerging technology

is profoundly important. Can you talk to the difficulties again at a high level of testing

AI? Folks may know this from a prior podcast. I've studied extensively car regulations and it's pretty easy to test a car, right? You can just drive it into a wall and get a sense of whether or not it's going to protect passengers. What is it about AI that makes it so difficult to evaluate and test? Why can't we just run it into a wall?

“Yeah, I think it's important to distinguish which kind of AI we're talking about, whether”

it's narrow or like we're talking about, or just X-ray system where we can actually see what the results we're getting, it has a very specific purpose. We can actually test it against those purposes, and I think in that area we can do the equivalent of running it into a wall and seeing what happens, and what we decide to focus on in this report was foundation models. These kind of very general systems, these large line with models,

which can do hundreds, thousands of different things, and also they can be applied downstream in finance, in education, in health care, and because there are so many different settings these things could be applied in so many different ways that people can fine tune them that they can find due applications. I think the developers don't really know how the system can be used when they put them out in the world, and that's part of what makes

them so difficult to actually assess because you don't have a clear goal, you don't know exactly who's me using it, how they're using it, how, and so when you start to think

“about testing, you're like, "Oh God, where'd we even start?" I think the other difficulty”

with some of these AI systems is we actually just don't understand how they work on the inside with a car, I think we have a pretty good idea how the combustion engine works, how the wheel looks, if you ask a foundation model developer, so why does it give the output it gives? They can't really tell you. So it's as if Henry Ford invented all at the same time a car, a helicopter, a submarine, put it out for commercial distribution, and said, "Fear

out what the rest are. Let's see how you're going to test that." And now we're left with this open question of what are the right mechanisms and methods to really identify those risks. So obviously, this is top of mind for regulators. Can you tell us a little bit more about the specific regulatory treatment of AI evaluations, and I guess we can just run through the big three the US, the UK, and the EU? Yeah, so I guess I'll start with the EU, because I think they're

the furthest along on this track in some ways. The European Union passed the European AI Act earlier this year. And as part of that, there are obligations around trying to assess some of these general purpose systems for systemic risk to actually go in and find out how these are working, what are they going to do? And they've set up this European AI office, which right now is consulting on its codes of practice, the guy are going to set out requirements for these companies

that say, "You do need to evaluate for certain kinds of risks." So is this system that might enable, like more cyber warfare, is the system that might enable systemic discrimination? Is this

a system that might actually lead to over alliance or concerns about critical infrastructure?

So the European AI office is already kind of consulting around whether evaluation should be a kind of a requirement for companies. I think in the US and the UK, things are both much more in a voluntary footing right now. The UK back in, when we had been November, set up its AI safety institute. And that has gone a long way in terms of voluntary evaluations. So that has been developing different evaluations often with a national security focus around, say, cyber, bio,

other kinds of concerns you might have. But that has been much more in a voluntary footing of companies choosing to share their models with this British government institute. And then somehow, and I think I'm not even really sure exactly how this kind of plays out, the issue is doing these tests. They've been publishing some of the results, but that's all very much on other kind of voluntary footing. And that has been kind of reports in the news that actually, that's caused a bit of tension

on both sides because the companies don't know how much this post to share or how much they want to share. They don't know if they're supposed to make changes when the UK says, "Look at this result." They're like, "Cool, what does that mean for us?" And I think the U.S. isn't pretty similar about maybe one step back because the United States AI safety institute is just still being set up. And so it's working with the UK AI safety institute. And I think they're kind of

working a lot together on these evaluations. But that's still much more in a, the companies choose to work with these institutes. They choose what to share. And then the government kind of works of what it's got. So there are a ton of follow-up questions there. I mean, again, just for folks who are thinking at my speed. If we go back to a car example, right? And let's say the

Car manufacturers get to choose the test or choose which wall they're running...

speed and who's driving, all of a sudden we could see these tests could be slightly manipulated,

which that's problematic. So that's, that's one question I want to dive into in a second. But

another big concern that kind of comes to mind immediately is the companies running the test themselves where if you had a car company, for example, controlling the crash test, that might raise some red flags about, well, do we know that they're doing this to the full extent possible? So you all spend a lot of time in the report diving into this question of who's actually doing the testing. So under those three regulatory regimes, am I correct in summarizing that? It's still all on the

“companies, even in the EU, the UK and the US. So on the EU side, I think it's still yet to be seen.”

I think they haven't drafted these codes of practice yet. This kind of stuff hasn't got going. I think some of this will remain with the companies in the act. There are a lot of obligations for companies who demonstrate that they are doing certain things, that they are, in fact, carrying out certain tests. But I'm pretty sure that the way that you is going, there is also

going to be a requirement for some kind of like third party assessment. This might take the form of the

European AFS itself, carrying out some evaluations going into companies and saying, "Give us access your models, we're going to run some tests." But I suspect that in, similarly to how finance or its work, it's likely to be outsourced with third party where the EU AFS says, "Look, we think that these are reputable people, these are companies or organizations that are good at testing, they have the capabilities we're going to ask them to go in and have a look at

these companies and then publish those results and get a sense from there." It's a bit unclear how that's relationships going to work, maybe the companies we're the ones choosing the third party evaluators, in which case you have still some of these concerns and questions, maybe a bit more transparency. In the UK and US case, some of this has been the government already getting involved, there's a kind of just said earlier, the UK AI Safety Institute has actually got a

great technical team, they've managed to pull in people from open AI, from deep mind, other people with great technical backgrounds and they're starting to build some of their own evaluations themselves

“and run some of those themselves. I think that's a really promising direction because as you were”

kind of mentioning earlier about companies choosing their own tests, in this case it's also having for a benchmark for example, if you've got the benchmark in front of you, you can also see the answers, so you're not just choosing what test to take, you've also got the answer sheet right in front of you, whereas if you've got, say, the UK AI Safety Institute or the US AI Safety Institute, building their own evaluations, suddenly the companies don't know exactly what they're

being tested against either and that makes it much more difficult to manipulate and game that

kind of system. And going to that critical question of the right talent to conduct these AI

evaluations, I think something we've talked about from the outset is this is not easy, we're still trying to figure out exactly how they work, what evaluations are the best, which ones are actually going to detect risks and all these questions, but key to that is actually recruiting and retaining AI experts. So is there any fear that we may start to see a shortage of folks who can run these tests? I mean, we know that US has an AC, the UK has an AC again, that's AI Safety Institute. South Korea,

“I believe is developing one, France, I believe is developing one. Well, all of a sudden we've got”

14, 16 who knows how many ACs are out there. Are there enough folks to conduct these tests to begin with, or are we going to see some sort of sharing regime do you think between these different testers? I'll tackle the sharing regime question first, so we are already starting to see that for some of the most recent tests on Claude 3.5, where Anthropic shared early access of their system, they shared it with the US and the UK AC and they kind of work together on those tests.

I think that it was the US AC primarily getting that access from Anthropic, kind of getting using the heft of the US government basically to get the company to share those things, but leaning on the technical skills within the UK AC to actually conduct those tests and there's been an ounce kind of international network of AI Safety Institutes, and hopefully going to bring all of these together. And I expect that maybe in the future we'll see some degree of specialization

and knowledge sharing between all of these organizations that in the UK, they've already built up a lot of talent around national security evaluations. I suspect we might see the United States AI Safety Institute looking more into questions of systemic discrimination or more societal impacts. Each government is going to want to have its own kind of capabilities in house to do this stuff. I suspect that we will see that sharing precisely because as you identify,

there are only so many people who can do this. I think that's only a short term consideration, though, and it's partly because we're relying a lot on people from the coming from the companies to do a lot of this work. But I think the existence of these AI Safety Institutes themselves, we are good training ground for more junior people who are coming into this, who want to learn how to evaluate the systems, who want to get across these things, but don't necessarily want to

join a company. Maybe they'll come from academia, they'll be going to these ACs instead of joining a deep mind or an open AI. And I think that that might kind of ease the bottleneck in future.

I kind of imagine that I was talking earlier about having these third-party a...

evaluators. I suspect we might see some stuff from these AI Safety Institutes going off and founding them and kind of growing that ecosystem to provide those services over time. When folks go to buy a car, they, especially if they have kids or dogs or any other loved ones,

for all the bunny owners out there or you pick your pet, you always want to check the crash

safety rating. But as things stand right now, it sounds as though some of these models are being released without any necessarily required testing. So you've mentioned a couple times these code for practices that the EU is developing. Do we have any sort of estimate on when those are

“going to be released and when testing may come online? Yeah, so I think we're already starting to”

see them being drafted right now. I think that over the course of the rest of the summer in the autumn, the EU is going to start into great working groups that are going to kind of work through each of the sections of the code of practice. I think we're kind of expecting it to wrap up around next April. So I think by the kind of spring of next year we'll be starting to see at least the kind of first iteration of what these codes of practice look like. But that's only when the code

of practice to publish when we see these actually being implemented when we see companies taking steps on this questions. Maybe they'll get ahead of the game. Maybe they'll see this coming down the track and start to move in that direction. A lot of these companies are going to be involved in this consultation in this process of deciding what's in the code of practice. But equally, they could get published and then it take a while before we actually see the consequences of that. April of next

year. I know by no means a technical AI expert, but I've ventured guess the amount of progress that can be made in the next eight months can be pretty dang substantial. So that's quite the time horizon. Thankfully, though, as you mentioned, we've already seen in some instances compliance with the UK AC testing, for example. But you mentioned that some labs, maybe our little hesitant to participate in that testing. Can you detail that a little bit further

“about why labs may not be participating into the full extent or maybe a little hesitant to do so?”

Yeah, so yeah, it's not quite clear which labs have been sharing and not sharing. I know that anthropic has because they set it when they publish clawed 3.5. To the others, it's kind of unclear. There's a certaino-pakeness on both sides about exactly who was involved. But as the why they might be of it concerned, I think there are some legitimate questions like say around commercial sensitivities. If you're actually evaluating these systems, then that means you

probably need to get quite a lot of access to these systems. And if you're a meta in your publishing Lama 300 billion just out on the web, maybe you're not so worried about that. You're kind of putting all the weights out there and just seeing how things go. But if you're an open AI or a deep-minded anthropic, that's a big part of your value. If someone leaked all of the GPT4 weights onto the internet, that would be a real, real hit to open AI. So I think there are legitimate

“security concerns they have around this sharing. I think there's also another issue where because”

this is a voluntary regime. If you choose to share your model, and the AI safety issue says it's got all these problems. But someone else doesn't. Then that just makes you look bad because you've exposed all the issues with your system. Even though you probably know that the other providers have the same problems too, because you're the one who stepped forward and actually given an access and let your system be evaluated. It's only your problems that get exposed. So I think that's

another issue with the voluntary regime of if it's not everyone involved, then that kind of disincentifies as anyone getting involved. Oh good old collective action problems. We see them

yet again and almost always in the most critical situations. So speaking of critical situations,

I'll switch to critical harm. Critical harm is what is the focus of SB 1047 that is the leading AI proposal in the California state legislature that adds of now this is August 12th is still under consideration. And under that bill, labs would be responsible for identifying or making reasonable assurances that their models would not lead to critical harm such as mass casualties or cyber security attacks that generate harms in excess of, I believe, $500 million. So when you think about

that kind of evaluation, is that possible, how do we know that these sorts of critical harms aren't going to manifest from some sort of open model or even something that's close, like anthropic models or open AI models? I think with the tests we currently have, we just don't know.

I think the problem is that I guess there's a step one of trying to even create evaluations

in some of these critical harms. There are some kind of evaluations out there like the weapons of mass destruction proxy benchmark, which tries to assess using multiple choice questions, kind of whether or not a system has knowledge of biosecurity concerns, cyber security concerns,

Kind of chemicals, security concerns, things that maybe could lead down the t...

harm. But that says it says very much just a proxy. The system having knowledge of something doesn't

tell you whether or not it's actually increasing the risk or chance of those events occurring. So I think that on one level there's just a generalization problem or a kind of external validity problem of, a lot of tests can do what they need to do. It can tell you does the system have stored that knowledge. But translating that's whether the system has stored knowledge or not into, can someone take that knowledge, can they apply it, can they use that to create

a mass casualty event? I just don't think we have that knowledge at all. And I think this is where in the report we talk about pairing evaluation with post-market monitoring, with instant reporting.

“And I think that's a key step to be able to do this kind of assessment of saying, okay,”

when we evaluated the system beforehand, we saw these kind of properties. We saw that it had this kind of knowledge. We saw it had this kind of behavior. I know the other end, once it was released into the world, we saw these kind of outcomes occur. And hopefully that would come long before any kind of mass casualty event or really serious event. But you might be able to start matching up results on say this proxy benchmark with increased chance of people using these systems to create

these kinds of homes. So I think that's one kind of issue. But right now, I don't think we kind of have that historical data of seeing how the kind of test the full system is released match up to behaviors and actions after the system is released. As you pointed out earlier, usually when we think about testing for safety and risks, again, let's just go to a car example, if you fail your driving test, then you don't get to drive. Or if you fail a specific aspect of that test,

let's say parallel parking, which we all know is just way too hard when you're 15 or 16, then you go and you practice parallel parking. What does the report say on this question of kind of follow-up aspects of testing? Because it's hard to say that there's necessarily a whole lot of benefit to testing for the sake of testing. What sort of add-ons are follow-up mechanisms should we see after testing's done? Yeah, I guess there's a range of different things you might

want to see a company do. I think for some tests where you see somewhat biased behavior or

“somewhat kind of biased outputs from a system. Maybe all that means is that you need to look”

back at your data set your training system on, say, okay, it's under up something these groups, it's not including, say, African-Americans or African-Americans, who expect us as much, so we need to add some more of that data into the training. And maybe that can fix the problem that you identify. That can go some way to actually resolving that issue. So there is some stuff you can do that's just kind of as you're training the model, as you're testing it, kind of adjusting it,

making sure that it's kind of adding on to that. A kind of second step you can do is you might

find that actually it's very difficult to find you now at some of these problems, but that actually there are just certain kinds of prompts into a system. Say, someone asking about how would I build a bomb in my basement that you can just build a safety filter on top that says, if someone asked this kind of question as a system, let's just not do that. And so your evaluation tells you, there is this harm to the information inside the model, where you can't necessarily completely

get rid of it, especially if it's going to really damage the performance. But you can put guardrails around the system that make that inaccessible and make it very hard for a user to do that. And similarly, you might want to monitor what the outputs of the model is. If you start seeing it, mention how to build a bomb, then you might want to cut that off and either ban the user or prevent

“the model from completing it's output. I think when we get into this trickier ground, an area”

where I think companies haven't been so willing to do is on delaying deployment of a model, or even restricting access to model completely and deciding what to publish it. I think one

example of this is that OMI had a kind of voice-cloning model, a very, very powerful system that could

generate very realistic sounding voice audio. And they decided not to release it. And I think that's actually quite admirable to say, we did some evaluations, we discovered that this system could actually be used for, say, mass-spirfishing. If you think about, you get a call from your grandparents, and they're saying, oh, I'm really in trouble, I really need your help, and it's just not them, and imagining that capability being everywhere. That's something really dangerous, and they've

decided not to release it. But equally, I suspect that as there are more and more commercial pressures, as these companies are competing with each other, there's going to be increasing pressure to this system is a bit dangerous, maybe there are some risks, maybe there are some problems, but we spent a billion dollars trading the system, so we need to get that money back somehow, and so they're going to push ahead with deploying the system. And so I think that's the kind of

step that a company might take that are going to get a bit more tricky around, not just putting guard rails around it, or tweaking it a bit, but actually saying, we built something that we shouldn't release. I feel as though that pressure to release regardless of the outcomes is only going to increase as we hear more and more reports about these labs, having questions around revenue and profitability, and as those questions may be persist, that pressure is only going to grow. So that's

quite concerning, and I guess I also want to dive a little bit deeper into the actual cost of testing.

When we talk about crashing a car, you only have to take one car, let's say t...

20 grand and 70 grand, or for all those Ferrari drivers out there, we've got a half a million dollar

car or something that you're slamming into a wall. With respect to doing a evaluation of an AI model, what are the actual costs of doing that? Do we have a dollar range on what it takes to test these

“different models? To be perfectly honest, I don't have that, I don't know the amounts. I think the”

closest I've seen is that Anthropic talks about when they were implementing one of these benchmarks. Even this off-the-shelf publicly available widely used benchmark, that's still required a few engineers spending a couple of months of time working on implementing that system, and that's for something that they don't have to come up with a benchmark themselves, they don't have to come with anything new, it's just taking something off the shelf and actually applying it to their system,

and so I can imagine a few engineers at a couple months of time and they pay their engineers a lot,

so that's going to be in the like hundreds of thousands of dollars range. Let alone the cost of

compute of running the model across all of these different prompts and outputs, and that was just for one benchmark, and many of these systems are trained on. A lot of different benchmarks, there's lots of red teaming involved when, say a company like OpenAI is doing red teaming, they're often hiring tens or hundreds of domain experts to try and really test what capabilities these systems have, and I can imagine they're not cheap either, so I don't have like a good dollar amount,

“but I imagine it's pretty expensive. I think it's really important to have a robust”

conversation about those costs so that all stakeholders know, okay, maybe it does make sense if you're an AI lab, and now you have 14 different AI safety institutes demanding you adhere to 14 different evaluations, that's a lot of money, that's a lot of time, that's a lot of resources, who should have to bear those costs is an interesting question that I feel like merits quite a robust debate. Ellie, we've gotten quite the overview of the difficulty of conducting evaluations

of the possibility of conducting audits, and then in some cases instituting benchmarks, one question I have is how concerned should we be about the possibility of audit washing? This is the phenomenon we've seen in other contexts where a standards developed or a certification is created and folks say, we took this climate pledge or we signed this human rights agreement, and so now you don't need to worry about this product, everything's good to go, don't ask any

questions, keep using it, it'll be fine. Are you all concerned about that possibility in an AI context? Yes, I'm definitely concerned about that. I think the one thing we'd really want to emphasize is like evaluations on necessary, you really have to go in and look at your system, but given the current state of play of this quite nascent field, these evaluations are only ever going to be indicative,

“they're only ever going to be, here are the kind of things you should be kind of thinking about”

worrying about. You should, with the current evaluations, not ever say, look, we did these four tests and it's fine, partly because as we discussed before, we haven't actually seen these in the real world long enough to know what those consequences are going to be, and without that kind of follow-up, without that kind of post-market monitoring, without that instant reporting, I would really not want anyone to say, this is a stamp of approval, just because they passed a few evaluations.

Thinking about the report itself, you all, like I said, did tremendous work. This is a thorough research document. Can you walk us through that process a little bit more? Who did you all consult? How long did this take? Yes, sure, this was quite a difficult topic to tackle in some ways, because a lot of this, as are quite nascent field, is kind of held in the minds of people working directly on these topics. So, we kind of started off this process by, between January and March,

this year, talking to a bunch of experts, some people working in foundation mold developers,

some people working in third party, or just evaluators, people working in government,

academics, or work in these fields to just try and get a sense from them, people who have hands-on experience of running evaluations and seeing how hard they are to do a practice of repeating those things and seeing these actually play out in real life. So, a lot of this work is based on just trying to talk to people who are at, kind of at the coal faces evaluation and getting a sense of what they were doing. As to exactly who, that's a slightly difficult topic, I think, because this is

quite a sensitive area, a lot of people wanted to be off the record when talking about this, but we did try and cover a fairly broad range of developers' obsesses of these kind of things. Alongside that, we did our own kind of deep dive literature review. There are some great survey work out there. Laura Vitinger, a deep mind, has done some great work kind of mapping out the space of like socio-technical risks and the evaluations there, and so drawing on some of

these existing survey papers, doing our own kind of survey of different kinds of evaluation, we worked with William Agnew as our technical consultant who has a bit more of a computer science background, so he get into the nitty gritty of some of these more technical questions. So we tried to marry that kind of on the ground knowledge from people with what was out there on the academic literature. I would say this is just a snapshot, this took us like six months and I think some

of the things we wrote are essentially already out of date, some of the work we did looking at

Where our evaluations at, what is the coverage, people are publishing, no eva...

so this is definitely just a snapshot, but yeah, we tried to kind of marry the academic literature with the speaking people on the ground. So we know that other countries, states, regulatory authorities are going to lean more and more on these sorts of evaluations, and they already are to a pretty high extent. From this report, would you encourage a little more regulatory humility among current AI regulators to maybe put less emphasis on testing or at least

“put less weight on what testing necessarily means at this point in time?”

To a degree, I think it depends what you want to use these for. I think in our report we try and break down kind of three different ways you might use evaluations as a tool. One is a kind of almost, you're just scoping slash what is going to come down the road, just giving you a general sense of risks, what to prioritize, what to look out for. I think for that evaluations are really useful. I think that they can give you a good sense of maybe the cybersecurity concerns a model might have,

maybe some of the bio concerns. It can't tell you exactly what harm it's going to cause, but it can give you a directional question of where to look. I think another way in which current evaluation is already useful is if you're doing an investigation. If you're a regulator and you're looking at a very specific model, say you want to look at chatchipy team in May 2024 and you're concerned about how it's representing certain different groups or it's how it's being used in

recruitment, say you're thinking about how is this system going to view different CVs and what comments is going to give about a CV depending on different names. You can do those tests really well,

“if you want to test it for that kind of bias, I think actually we're already kind of there and it”

can be a very useful tool for a regulator to assess these systems. But I think you have to have that degree of specificity because the results of evaluations change so much just based on small changes in the system and based on small changes in context, unless you have a really clear view of exactly

what concern you have, they're not going to be the most useful. The third kind of way you might

use it is this kind of safety sign off, say this is the system is perfectly fine and here's our sample approval. We are definitely not there and I think if I was a regulator right now, one thing we actually did here from companies, from academics and others is they would love regulators to tell them what evaluations do you need for that? I think that a big problem is that there isn't actually had been that conversation between what are the kinds of tests you would need to do

the regulators care about, the public cares about that are going to test the things people want to know and what are they going to build? And I think absent that guidance industry in academia, I'm just going to pursue what they find most interesting or what they care about the most.

“So I think right now it's incumbent on regulators on policy makers to say, here are the things”

we care about, here's what we want you to build tests for and then maybe further down the line,

once those tests have been developed, once we have a better sense of the science evaluations, then we could start thinking about using it for that third category. And my hope and please answer this in a favorable way. Have you seen any regulators say, oh my gosh, thank you for this great report. We're going to respond to this and we will get back to you with an updated approach to evaluations. Does that occurred? Let's spend the response to this report so far.

I don't want to mention anyone by name. I feel like it'd be a bit unfair to do that here. But yeah, I think it's generally been pretty favorable. I think that actually a lot of what we're saying has been in the air already. As I said, we spoke to all people kind of working on this already thinking about this. And part of our endeavor here was to try and bring together conversations people already having, discussions that we have, but in a very comprehensible and public facing format.

And I think the regulators were already and are taking these kind of questions seriously. I think one difficulty is a question of regulatory capacity. Regulators are being asked to do a lot in these different fields. If I take the European AFS, for example, they've got, you know, I think maybe less than 100 people now for such a massive domain. And so one kind of question is just they have to prioritize. They have to try and cover so many different things. And so I think

without more resources going into that area. And that is always going to be a political question of

what things do you prioritize? Where do you choose to spend the money? It's just going to be difficult regulators to have the time and mental space to deal with some of these issues. And that's a fascinating one too because if we see this constraint on regulatory capacity, I'm left wondering, okay, let's imagine I'm a smaller lab or an upstart lab. Where do I get placed in the testing order? Is open AI going to jump to the top of the queue and

get that evaluation done faster? Do I have the resources to pay for these evaluations if I'm a smaller model? So really interesting questions when we bring in that big I-word as I call it, the innovation word, which seems to dominate a lot of AI conversations these days. So at the Institute, you all have quite inexpensive agenda and a lot of smart folks should be expect a follow-up report in the coming months or are you all moving on to a different topic or what's the plan?

Yeah, I think partly we're wondering to see how this plays out, wanting to see how this field moves along. I think one question that we are thinking about quite a lot and might explore further is this kind of question of third-party auditing, third-party evaluation,

How would this kind of space grow?

there is currently a lack of access for these evaluators right now. A lack of ability of them to

get access these things especially on their own terms rather than on the terms of the companies. There's a lack of standardization. If you are someone shopping around as a smaller lab or a start-up for evaluation services, it's a bit opaque to you on the outside who is going to be doing good

“evaluations who does good work and who is trying to sell you snake oil and so I think that one”

thing we're really thinking about is how do we kind of create this auditing market where people on both sides so you as the lab know you're buying a good service that regulators will trust the everyone will work but also you as a consumer when you're thinking about using an AI product. You can look at it and say, oh it was evaluated by these people. I know that someone has kind of certified them that someone has said these people are up to stuff and they're going to

do a good job and so I think that's one thing we're really thinking about of how do you build up this market so that it's not just reliant on regulatory capacity because I think while that might be good in the short term for some of these biggest companies, it is just not going to be sustainable in the long term for government to be paying for and running all these evaluations for everyone if AI is as big as some people think it will be. And thinking about some of those

prospective questions that you are made dig into and just the scope and scale of this report is there anything in the off chance that not all listeners go read every single page? Is there

“anything we've missed that you want to make sure you highlight for our listeners?”

I think one other thing I do want to bring up is the kind of lack of involvement of effective communities and all of this that we asked almost every we spoke to. So do you involve effective

communities in your evaluations and basically everyone said no and I think this is a real problem that

as I kind of mentioned before about what a regulator's want what is the public one in this questions? Actually deciding what risks we need to evaluate for or also what is an acceptable level of risk is something that we don't want to be left just to the developers or even just to a few people in a government office it's something we want to involve everyone in to decide there are real benefits these systems these systems are actually enabling new and interesting ways of working

and new and interesting ways of doing things but they have real harms too and we need to actually engage people especially those most marginalized in our society in that question and say what is the risk you're willing to take on what is an acceptable evaluation mark for the for this kind of work and that can be at multiple stages that can be in actually doing the evaluation cells have you got a very diverse group of red teaming on model trying to pick it apart have you got involved the

goal setting stage at that kind of product stage when you're about to launch something into the world are you making sure that it actually does involve everyone who might be subject to that if you're thinking about using a large language model in recruitment have you got a diverse panel of people assessing that system and understanding is it going to hurt people i think minority backgrounds

“is it going to affect women in different ways so i think that's a really important point that i”

just want everyone to take away i would love to see much more work and how you bring people into the evaluation process because that's something we just really didn't find at all okay well Elliot you've got a lot of work to do so i'm going to have to leave it there so you can get back to it thanks so much for joining thanks much the law fair podcast is producing cooperation with the Brookings institution you can get

add free versions of this and other law fair podcasts by becoming a law fair material support your through our website law fairmedia dot org slash support you'll also get access to special events and other content available only to our supporters please rate and review us wherever you get the podcasts look out for our other podcasts including rational security, chatter, allies and the aftermath our latest law fair presents podcast series on the government's

response to January 6 check out our written work at law fairmedia dot org the podcast is edited

by Jen Pacha our theme song is from alibi music as always thank you for listening

Transcript

The Electronic Communications Privacy Act turns 40 this year, and it's showin...

Standardized, so if you're going into say a financial audit, you kind of does...

Car manufacturers get to choose the test or choose which wall they're running...

I kind of imagine that I was talking earlier about having these third-party a...

Kind of chemicals, security concerns, things that maybe could lead down the t...

When we talk about crashing a car, you only have to take one car, let's say t...

Where our evaluations at, what is the coverage, people are publishing, no eva...

How would this kind of space grow?

Compare and Explore

More Transcripts You Might Like

Dames & Moore v. Regan

Why AI Needs Ethics First | Nekia Nichelle & Shekhar Natarajan | Live at CES 2026

Success, Spirituality & Staying Real | Shilpa Shetty Kundra x Shekhar Natarajan

Birthrights and Birthwrongs [TEASER]

Will.i.am on AI, Data Ownership & The Future of Nations | Shekhar Natarajan Podcast

AI Needs Intention, Not Fear | will.i.am on Leadership, Creativity & the Future of Artificial Intelligence