
Mainstream generative AI’s biggest weakness – and therefore the thing that most often draws negative press attention – is that it gets things wrong. It hallucinates. Sometimes the outcome is a harmless nugget of nonsense you can share for lols but sometimes it’s more dangerous. If you want to use gen AI for work, or for anything that really matters to you, it simply marks AI out as unreliable. The fascinating irony is that tech quirks driving that unreliability are some of the same features that enable it to improve so very, very quickly.
It’s not all AI, but you know the phenomenon we’re talking about. You ask ChatGPT or Claude for some information and get three different answers on three different tries. Sometimes that’s by design — models are built to be probabilistic, not deterministic, and sometimes they’ve been updated in between, or you’ve asked a subtly different question — but sometimes it’s a good old-fashioned hallucination.
OpenAI and Georgia Tech researchers recently published a paper that gets to the heart of why this happens: it turns out hallucinations aren’t just about missing information. They’re about incentives.
Here’s the problem: these kinds of AI models are designed to be continuously evaluated to improve the training, and those evaluations work a bit like multiple-choice tests. If the model answers correctly, it gets a point. If it says “I don’t know,” it scores nothing. If it guesses, sometimes it gets lucky. Like a student who scribbles something that sounds plausible instead of leaving the page blank, the model therefore gets rewarded for bluffing – just like so many expensively educated politicians who can sound plausible because they use big words and look the part.
All of that means a vicious cycle sets in: the AI learn to bluff with aplomb, benchmarks reward this behaviour, so even as models improve, they still hallucinate, because that’s what the system has trained them to do. It’s like telling someone his bluster is charming and, oh, I don’t know, rewarding him by making him Mayor of London: that person will learn that bluster works and they should do it more, and then the problem will become harder and harder to contain.
The paper’s proposed fix is simple: change the scoring. Penalise wrong answers more than honest uncertainty, so that “I don’t know” becomes the preferable answer to nonsense. Imagine an AI (or a politician) with the humility to tell you when they don’t know the answer. That’s not just refreshing, it’s essential if we want trustworthy systems rather than enthusiastic, confident but unreliable quiz team members.
Why this matters for the kind of AI we build
This research lands close to home for us at Leading AI Towers*. We’ve always been big fans of retrieval augmented generation (RAG), because it changes the game from the start. A RAG system isn’t rewarded for bluffing; it is obliged to prioritise grounding answers in the documents you trust.
That’s why our AI tools don’t behave like most online general-purpose chatbots. Solutions like our Policy Buddy aren’t trying to please you or predict the most “likable” answer (although we do design them to be professional and friendly, because the alternative is absolutely no fun). They’re trained with a narrower role: give accurate advice, grounded in the policies and procedures you can access.
It’s why they’re dramatically less prone to hallucination – so much less prone that we’ve never actually seen it happen outside a testing environment**. When you anchor a model to a reliable source of truth and constrain its purpose, you remove much of the inclination to bluff. The model still generates natural language, but it does so with its feet planted firmly on solid ground.
And here’s where I’ll make a gentle point about the problem with Microsoft Copilot, for the 47th time. We’re working with organisations who are trying to build staff assistants with Copilot, trained on things like HR policies. These are serious tech teams, working closely with Microsoft, and yet the results are patchy at best. Too patchy to roll out at scale. Staff often find the assistant inconsistent or unreliable. We think that’s not because the technology isn’t sophisticated — it is — but because a general-purpose Copilot is still wired to have too big a job and behave like the “good test-taker” OpenAI describes in the paper: producing plausible-sounding answers, even when the better move would be “I don’t know.” There’s other issues too, but this is likely a fair chunk of what’s going wrong.
By contrast, our focused RAG models are trusted every day in critical areas like children’s social care. The difference isn’t just retrieval; it’s design. We don’t ask the model to be everyone’s assistant. We ask it to play a clear role, with the right incentives, in a domain where accuracy and trust come first.
A quick detour into cats
I sometimes explain hallucination with this thought experiment: what happens if you asked ChatGPT to tell you everything it knows about cats?
If it actually vomited out every piece of cat-related information it had ever absorbed, like a big, dumb database, it would be overwhelming: unreadable for you, and impractical for the tech. It would be every bit as annoying as asking me to tell you everything about my ridiculous boy cat, Christmas. You would soon wish you hadn’t asked. But because we will keep asking normal human questions, the model has to make probability-based decisions and filter the response based on mathematical probability: what’s the most likely next word to give, in context, that fits the request?
That’s not so different from how humans operate. When you walk down a busy street, you don’t process every brick, leaf and face in perfect detail. You make “good enough” assessments of what’s around you, so you can keep moving rather than freezing to assess every option before taking each new step.
The difference is, when a human misses something, they’ve still got instinct and the senses to check against reality. A general-purpose model doesn’t have that advantage — unless you give it retrieval: connect it to a trusted library of documents, so instead of guessing from memory, it can look up the facts it needs. Without RAG, it’s like someone walking down the street with their eyes closed, confidently telling you what they think they see. Sometimes they’ll be right. Often, they’ll walk into a lamppost.
Where this leaves us
I find it encouraging that OpenAI is talking openly about incentives and honesty in AI. Building models that can admit uncertainty is a necessary step. ChatGPT5 has given out a few ‘I don’t know’ responses and that’s progress.
But for most real-world use cases — and perhaps especially in public services, where trust is everything — you don’t have to wait for the labs to change the benchmarks. We can design systems today that are:
- Narrow in purpose — playing a defined role, not trying to be everything to everyone.
- Grounded in retrieval — citing sources of truth instead of making confident guesses.
- Focused on advice, not decisions — supporting professionals without overstepping into choices they should make themselves.
After all, the more you learn, the more you realise how much you don’t know. And in AI, as in life, sometimes the wisest thing is knowing when to say: “I’m not sure,” and call in an expert.
*We all work from home. If we want to work from an actual tower, we’re expected to supply our own.
**…yet. As of September 2025.
Latest posts

22 Sept – Conference – AI in Education: Making it Safe, Smart and Strategic
Events