Hallucinating Reality. An Essay on Business Benefits of Accurate LLMs and LLM Hallucination Reduction Methods
“Don’t believe everything you get from ChatGPT“ – Abraham Lincoln
Let’s talk about hallucinations – those, in the context of LLMs, mean generating plausible-looking but false or misleading information. I sometimes wonder how much of their bad reputation got stuck with us because first impressions are the most lasting. Initially, I thought that once people internalize that they do happen and we make some significant (but not necessarily great) progress toward eliminating them, the topic would fade into the background. Either by self-selecting the use cases, building around hallucinations, improving models, or combining all three. That fade hasn’t yet happened, and the worry still exists. I hear people claiming that some use cases are never going to work because there will always be “some hallucinations.” Or that they tried RAG four months ago and it didn’t work. Or they were scared away by the $1 Chevy story. That’s all about AI hallucinations and business risks.
We will start with what the future may bring and the cases for optimism. Later, let’s look at what we can do today to reduce and/or manage hallucinations. RAG, RIG, HITL, CoT, and others await.
From short-term hurdles to long-term gains. Strategies for AI accuracy
While hallucinations are real…, that is, the issue of dealing with them is real, I wouldn’t view them as showstoppers. Once again, I think we were too zealous and overestimated the short term, but now we have to appreciate how we underestimate the long term. The three forces of optimism in overcoming hallucinations are:
- Once you know this, you can tweak both – what use case you’re solving and how.
- There’s a lot of effort being put into this – both money and brain power. We are still in the early days, and we’ve been making a lot of progress (think GPT-o1 vs GPT-3) on ALL the dimensions (performance, price, capabilities).
- Right now, we still mostly hope we can hop over some unresolved issues with AI (i.e., let the “AI” figure this out), but we could always resort to working on the data/API/logic layer instead.
Just to explain what I mean by 3. – say you would like to extract data from an invoice or some kind of form. Taking a photo of it and putting it into ChatGPT might be the easiest way to start. But then you discover it doesn’t work well with some formats, or you need to do some corner case handling. You might build a bigger system around it to fix it. You should also discover that it works better if you just have the original PDF instead of the photo. But going a step further – what if you actually don’t need the PDF in the first place but would get the data directly from an online form or through an API? In that case, all the data & field mapping would be defined programmatically, making sure all the critical things (e.g., account number, client ID) have to be rigid and right, thus limiting AI to some post-processing like summarization or classification. In some cases, solidifying your solution via working at this layer will be a much easier way to improve it (e.g., get from 97% to 99%) than working on the AI layer (i.e., improving the model).
Now, some of you may say that the point of having new powerful AI is also to avoid mundane, foundational work and workarounds to 15 corner cases – which can end up being expensive or limited by your software engineering resources. But guess what’s the killer AI application right now – it’s coding. So, it’s getting much, much cheaper to build software. What will be left is figuring out how to architect it right to get the best side of AI and the best side of software per se. The shift from models to compound AI systems is not just about the foundational and platform side – we can embrace it at the use case level as well.
So, the future looks bright, but since you can’t eat hope for breakfast, let’s zoom in on hallucinations and then look at methods and strategies that could be used right now to tackle the problem.
The illusion of believability. Why we trust AI text and how to build systems to verify it
There’s something about getting the information via reading that makes it instantly more believable. Maybe that’s our programming from the way we are educated, maybe that’s the tone (even more amplified with ChatGPT and alike), or maybe that’s the permanence of a written text. Anyway, the issue is real – not only are LLMs prone to hallucinations, but also we are not equipped to catch them easily. We already lived with a very similar problem, the “not everything you read on the internet is true,” but in the classical search/retrieval approach, there’s a key difference – we just retrieve the content. The point being, if the sources are truthful, so will be our answer. Thus, once we optimize sources for truthfulness, the algorithms can focus on just figuring out what we are looking for (although you may view PageRank as not only importance but also some kind of believability scoring – the world is never perfect).
Things change when we move to LLMs. Under the hood the model is trained to create a compellingly looking text. With some RLHF magic, we figured out we can make it into creating compellingly looking answers or anything we prompted it to do. One way to look at it is that, at its core, such a model is optimizing for looks more than for correctness. And for an alien civilization that would, for some reason, speak English, a small factual difference doesn’t really influence compellingness. As always – show me the incentives, and I’ll show you the outcome.
But, as I’ve already said – It’s all about the system. We are not limited to serving LLMs “neat” – we can build the whole “AI system” around it (or embed it into an existing one). That’s a general trend that we are seeing nowadays. Such an approach gives us more elements to play with – e.g. add input/output guardrails, provide context, or hardcode some logic. Check out this great post from Chip Huyen to see what components you could add and why.
The Truth Is Out There
So, how to reduce hallucinations in LLMs? What are the techniques for minimizing LLM hallucinations? Here are a few approaches.
-
- Make LLMs less prone to hallucinations (easier said than done)
The most direct approach is to try to fix it at the core – e.g., by fine-tuning LLMs on curated, high-quality datasets or adding constraints requiring factual consistency. Unfortunately, this approach is unlikely to get rid of hallucinations without sacrificing the model’s capabilities. It can also be costly and time-consuming, especially if you need to repeat the process. Because of this, it’s not commonly the first choice. - Design systems that support accurate LLM performance – use grounding to anchor outputs of a language model to a trusted source.
As suggested above, instead of solely relying on an LLM to do the work, we can integrate them into systems that help to manage and reduce hallucinations. This approach, especially RAG, has gained much popularity recently, and rightly so.- RAG (Retrieval-Augmented Generation)
The RAG approach combines LLMs with external knowledge bases to retrieve accurate and factual information. It provides the LLM with the data retrieved from these reliable sources, and the model uses them to provide the final answer. It’s resource-efficient because it does not store the information within the model and can rely on a scalable search. It also allows to work/update the knowledge sources separately from the model itself. To illustrate how RAG may work in practice, say you would like to have an assistant suggesting a possible resolution or procedure given the description of an encountered situation. The user would ask a question like: “I got the error code E045. What should I do?”, in the retrieval step, the system searches through the knowledge base (e.g., one with procedures, manuals, or previously encountered problems) and pulls up the most relevant things. Then, in the generation step, this information is fed with the original question to an LLM, producing the response. Check out our blog if you’re interested in more details. - RIG (Retrieval-Interleaved Generation)
The RIG approach goes a step further and “plugs” the retrieved source directly into the LLM’s output. It’s less popular than RAG. The difference compared to RAG, in a similar scenario, would be that the model first figures out that the response should be something like “In case of the error code E045, you should follow the following procedure <<Procedures(“What is the procedure for E045”)>>” and then, just before returning the response to the user, we would execute the Procedures query to fill in the details. As such, it’s going to be less flexible in how it presents the information, but instead, it’s faithfully retrieved in the same form as we stored it. - Sourcing
Mechanisms that trace and cite sources for retrieved information allow the model to either present supporting evidence or direct references that the users can check. Think references in Wikipedia or links to web pages that the answer is based on. This can also work in tandem with RAG. - Output guardrails and fact-checking pipelines
We can implement automated pipelines to detect (or even try to correct) factual discrepancies outputted by the LLM. For instance, before presenting it to the user, we can use another LLM to help us assess the factuality of the response, whether we stay within the scope of the question and the system’s competence, or other quality/risk indicators. These might prompt us to change the output or refrain from providing it altogether and return a fail-safe message. Since AI models are probabilistic in nature having proper guardrails is a good idea in general, regardless of our stance on hallucinations. - Prompt engineering and CoT (Chain-of-thought)
Prompting techniques can improve the model’s faithfulness, e.g., by setting constraints and guidelines, limiting it to specific sources, or providing step-by-step reasoning. These could be instructions in prompts like “provide information from xyz” (where xyz are established sources), “list potential … walk through each … identify key … summarize … compare,” etc. Of course, our prompt can also depend on the query so that we can use a different approach, e.g. on a general question vs a more specific, complex deep dive. - Rule-based systems
Traditional rule-based systems integrated alongside LLMs may ensure that certain responses follow predefined criteria, e.g., ones in which the answer is straightforward or where we would not want to risk hallucinations. This might mean “hard-wiring” our response to “What’s my account balance?” or resorting to “For investment topics, please consult a financial advisor.” This is where it pays off a lot to understand how your system will be used. Perhaps 50% of interactions would revolve around a few use cases that you would like to lay out programmatically and don’t leave anything to a chance or be able to introduce subtle tweaks or changes depending on some external criteria (e.g., A/B testing or user-specific). - Human-in-the-loop verification (HITL)
Last but not least, we may use human review and/or validation, e.g., for high-stakes or critical systems. This may follow a predefined rule (e.g., fall back on humans for tricky queries) or be triggered dynamically during output verification or based on user interaction (e.g., unable to resolve an issue within specified timeframe/amount of turns, user getting angry, high risk/fidelity considerations). This may also be a good way of instilling periodic/sampling reviews for quality control and feedback loops for improvement.
- RAG (Retrieval-Augmented Generation)
- Make LLMs less prone to hallucinations (easier said than done)
- UX design that shifts the burden of facts
Another way to manage hallucinations is to design a UX that shifts the burden to the user or makes it clear when there’s a discrepancy. For example, when asked, “What were the sales for product X in Germany last quarter?” instead of directly returning an answer, we can retrieve the relevant data and show it to the user. Or we could specifically mark what requires verification. This is particularly useful if the system is already embedded or accompanying an existing application.
- Accept it (and even embrace it with the right use cases)
If we expect and don’t mind hallucinations, they may not really be a problem. Use cases around brainstorming, content generation, and inspiration come to mind. We don’t complain about hallucinations in DALL-E – its sole role is to do so. If it’s clear that our system is in this “creative” and not “factual” mode, our users can also self-select themselves, their use cases, or how they consume outputs to account for this.
ChatGPT, please summarize this for me
This text is already long, so I asked ChatGPT to write this last section for me. It took me three tries and some nudges – the first one was more of a summary enlisting the points than a concluding part of the text, so I asked it to reinforce the main message and insights instead. The second one got an OK-ish substance but sounded too lofty, not matching the style of the main part. So, I asked it to match the original tone, and here it is :)
“We’re making headway, even if the path is far from flawless. The concern around AI hallucinations remains, but it’s not a dead end—it’s a sign of an evolving tool we’re learning to use with purpose. Progress comes from understanding that the perfect solution isn’t around the corner; instead, it’s about building layers, integrating systems, and making smart design choices. With the right approach, we don’t need to wait for AI to be perfect; we can shape it to work for us now, navigating the imperfections while keeping an eye on the long-term gains.”