Hallucinations & Human in the Loop

Are Generative AI hallucinations here to stay? According to recent research, hallucinations are not a bug but a feature of current AI models, a byproduct of the probabilistic nature of these systems that won’t go away. While their frequency can be reduced, the study suggests that complete elimination isn’t possible — at least not yet.

The study aimed to assess the reliability of even the most advanced language models, focusing on their tendency to produce hallucinations — instances where the AI generates confident but factually incorrect information. The researchers wanted to understand the frequency and nature of these hallucinations, using rigorous testing methods and datasets specifically designed to challenge the model’s reasoning and knowledge in order to uncover just how far these systems could be trusted in real-world applications.​

What they found was revealing: no matter how advanced or extensively trained the model, hallucinations still occurred. The AI doesn’t “fail” to know things, but rather, it generates responses to every query, even when lacking a reliable or accurate answer — hence, the hallucinations. This is a result of how language models are designed to predict the next word in a sequence based on probabilities from the datasets they are trained upon — rather than an actual understanding of facts.

While this often results in contextually appropriate and accurate responses, it can also lead to confident yet factually incorrect answers, particularly when the model encounters knowledge gaps or ambiguous inputs. The study highlights the fact that, despite the models’ sophistication, their predictions remain prone to errors in the absence of explicit knowledge or clarity in the queries posed.

In case you wondering, the research also explored potential solutions to mitigate hallucinations, focusing on fine-tuning and retrieval-augmented generation (RAG).

Fine-tuning, where models are trained on more specific datasets, improves task-specific accuracy but doesn’t completely solve the issue of hallucinations, as models still rely on probabilistic predictions rather than fact-based reasoning.

RAG, on the other hand, attempts to reduce hallucinations by retrieving external, verified information before generating responses. While this grounding process helps, the study found that neither fine-tuning nor RAG fully eliminates hallucinations.

Now, this doesn’t mean AI is unreliable — in fact, we’re often talking about errors occurring at the decimal level — but it does underline that full automation is not as close as some might have hoped. Good news for those worried about mass unemployment. Bad news for enterprises looking to streamline operations with the promise of flawless AI-driven workflows.

Augmentation over Automation

Then again, AI has never been about fully automating workflows — it’s always been about augmenting us to do more with less. This is why having a human-in-the-loop (HITL) is essential. Rather than removing humans from the equation, HITL integrates human judgment into the process, allowing AI to handle routine tasks while humans step in to oversee critical decisions and address any inaccuracies or hallucinations the AI might produce.

It’s a careful balance — where AI handles routine tasks, but human judgment remains essential in maintaining the quality and integrity of the outcomes. AI alone won’t make workers more efficient if they follow suggestions willy-nilly. Instead, efficiency comes when workers use AI as a tool while staying critical of its outputs, ensuring that any errors or hallucinations are caught before they impact the final results.

In fact, it is in trusting AI blindly where the real risks lie. A study by Agudo et al found that participants who leaned heavily on AI suggestions often fell victim to "automation bias," where they assumed the system’s confidence equated to accuracy. This led them to replicating the AI’s mistakes, ultimately compromising the decision-making process.

It makes sense: workers who perceived the AI systems as less reliable were more cautious, taking extra steps to verify the system’s suggestions. When the AI provided incorrect suggestions, participants who first made their own judgments had a significantly higher accuracy rate of 66.2% compared to just 36.8% for those who relied on the AI’s recommendations prior to making their own decisions.

So when participants were aware of AI’s limitations and understood that it could generate errors, they became more critical of its suggestions — fostering better outcomes as users applied their own judgment alongside that of the AI’s.

The findings of both of these studies show that any successful AI adoption strategy must go beyond integrating the technology into user workflows in order to be implemented effectively.

Of course, efforts to mitigate hallucinations must continue, but my point is that it is also essential to place a strong emphasis on educating users about AI’s potential for errors as well. At least for the time being.

AI is an invaluable tool, but it’s still fallible. Anyone who says otherwise is either wrong, lying or both.

As these studies help illustrate, ensuring that users are aware of AI’s limitations is crucial not only for maintaining accuracy but also for fostering long-term trust in the technology. Human-in-the-loop approaches are not stopgap solutions — they are vital to making sure AI enhances, rather than replaces, human judgment.

With proper training and awareness, enterprises can harness the full power of AI without falling into the trap of over-reliance, using its (many) strengths while remaining mindful of its (fewer) weaknesses.


Get in touch to learn more about how to minimise the risks and maximise the rewards of enterprise AI adoption.


Previous
Previous

Multimodal Agents, Modality Fusion & World Models

Next
Next

Latency, Inference & UX