Sure, AI makes mistakes – but compared to what, exactly?

Author: alex.steele@leadingai.co.uk

Published: 29/03/2026

AI makes mistakes

I’m very aware I’m a performance management nerd, so I try not to bore you too often with my mild obsession with questioning how people know if the thing they invested in is actually any good or not. I realise I may have over-corrected: it’s a whole year since we gave you any advice on how to assess the performance of your AI. It’s time.

When we published Is it working? How to measure the impact of AI (without driving people nuts), lots of organisations were just experimenting with generative AI tools and trying to work out whether they were useful. People were running pilots, access was patchy, and figuring out whether it was actually making a difference still felt a little way off.

A year on, the conversation has shifted – but only slightly. We’ve got more evidence to draw on, and AI tools are increasingly embedded in everyday workflows. But when people talk about their performance, the judgement is often surprisingly quick and – more worryingly – vibes-based.

“We tried an AI note-taker and it kept calling our digital strategy our digital tragedy. Hilarious! It’s just so bad…”

Cue your colleague – we’ll call him Dave* – writing off all AI-generated content as unreliable nonsense.

Except whenever I hear that, my instinct is to ask the more awkward question: compared to what, Dave? Actually, compared to who?

Anyone who has relied on human meeting notes will know those are rarely perfect and often absolute bobbins. People miss things, mishear things, interpret discussions through their own lens and occasionally forget to circulate the notes at all until long after anyone can remember things clearly enough to question their account. For some reason, though, that’s not the baseline: we are holding tech to a different standard.

That’s partly because many organisations are still judging AI performance without having been honest first about the baseline. Instead of measuring how things have changed, they compare it to either an ideal state or they hazard an unchecked guess. The technology feels impressive, or disappointing, or slightly annoying. Assumptions follow accordingly.

From a performance management nerd’s perspective, that’s a very risky way to make decisions.

The AI you ask people to work with doesn’t have to be perfect

The research emerging over the past couple of years gives us a much clearer picture of how generative AI actually performs in real workplaces. The results are consistent: the gains are real, but they are usually incremental rather than dramatic.

One of the most widely cited studies looked at 5,000 customer support agents using a generative AI assistant. Productivity increased by around 15 per cent overall, with the biggest improvements among less experienced staff.

A UK government study comparing AI-assisted evidence reviews with human-only reviews found the teams using AI completed the work 23 per cent faster, while the final quality of the outputs was similar once they had been reviewed and edited.

More widely, labour market research suggests generative AI tools are currently saving workers around five per cent of their working time on average. Five per cent may not sound dramatic, but across an organisation – or, better yet, the economy – that quickly adds up. For many professionals that’s roughly two hours a week that can be redirected into more valuable work.

None of these studies suggest AI is flawless. In fact, most assume the output is reviewed and edited by humans. The value comes from improving the workflow, not replacing human judgement entirely.

The baseline problem

In performance management this idea is totally standard and very straightforward: if you want to know whether something works, you establish a baseline and measure change against it.

If a team processes 100 reports per week and a new system helps them process 110, that’s an improvement. If the cost per report is down, the quality is the same or better and the teams are happier and more likely to stick around, that’s even better. No one expects the number to jump to infinity before deciding the system is worthwhile.

This is not a radical concept. McKinsey have written some excellent books on the subject — all of which I’d recommend. They consistently make the rather sensible point that good performance systems measure the things that actually matter to people doing the work, not abstract metrics that look tidy on a slide. You can also check on the vibes – but do it properly: ask a representative group the right question about how they feel and check on it over time. Don’t just run with Dave’s feedback.

When it comes to AI, however, the comparison often becomes weirdly unrealistic. Instead of asking whether the new tool improves the process, people ask whether the AI ever makes mistakes at all.

That is a very high bar. It is also a very unusual one, in performance terms.

Spellcheck occasionally suggests the wrong word and routinely overestimates how often we’re trying to text someone the word “duck”. Search engines sometimes return irrelevant results – occasionally on purpose because they think it’s what you wanted or they were paid to show you. Spreadsheet models can contain errors that quickly compound if someone builds them imperfectly.

None of these imperfections make these tools useless; the comparison is always with the alternative.

AI should be judged in the same way.

Measure something real

If organisations want to evaluate AI properly, the starting point is surprisingly simple: measure outcomes that reflect how work actually happens. And measure a collection of things; there’s no single magic indicator. You want some qualitative info, some quantitative, some leading indicators, some lagging… you get the idea. Measures work best in little baskets you can have a conversation about.

That could include:

What to measure Why it matters
Time taken to complete a task Captures straight-up productivity improvements
Amount of editing required after AI drafts Shows how much effort the tool saves and helps assess quality over time
Adoption and continued use Indicates whether staff genuinely find the tool helpful
Error rates compared with the previous process Keeps the comparison grounded in reality

 

In practice, most generative AI tools work best as first-draft engines. They create a starting point that humans refine and approve – a bit like asking an assistant for a blog outline and a set of research notes. And that makes the question much clearer. Not did the AI ever get something wrong? But did this process improve the work overall?

Raising the floor

Another interesting pattern across the research is that AI often benefits less experienced workers the most. Experts still gain productivity improvements, but the impact tends to be smaller.

For newer staff, though, the tools can act as a form of on-demand support. AI assistants help people structure documents, find information faster and produce work closer to expert level. We’ve seen the same thing across our partners: the biggest winners are often the staff who are struggling the most, which makes the benefits all the more valuable.

It means you can compare the impact of AI to the cost of training and experience that might otherwise take years to build. Seen that way, one of the most important effects of AI may not be replacing skilled professionals. It may be raising the floor of performance across organisations, helping more people produce good work more consistently. That “hit the ground running” line in every job advert suddenly becomes a lot more achievable.

If your new system is faster, easier or more consistent than the one that existed before, then the organisation has improved its process. If that improvement is measurable rather than based on anecdote (or vibes), leaders can make much more confident decisions about how, where and when to invest in and use the technology.

That may not be as dramatic as the idea of flawless AI replacing human work. But it is much closer to how most technology actually delivers value: not by being perfect, but by making everyday work a little bit better than it was before.

*This is pretty unfair on the Dave community, many of whom I count as friends. I can only apologise.

 

Still not convinced? You don’t have to be. We wrote an actual report using actual data — which we had baselined — about AI adoption in North Yorkshire’s social care teams.