When AI Lies: Understanding and Fighting LLM Hallucinations

4 min read

Large language models have become incredibly powerful—yet they share an unsettling trait with humans: they sometimes make things up. But unlike a human’s white lie, LLM hallucinations are statistical fabrications born from the model’s attempt to predict the next token. When these fabrications appear confident and polished, they become especially dangerous.

The Two Faces of Hallucination

Not all hallucinations are created equal. In-context hallucination occurs when a model contradicts information explicitly provided in the conversation context. The model ignores what you just told it and generates something different.

Extrinsic hallucination is more insidious—the model generates content that contradicts its training data or makes up facts about the world. The real problem: with训练数据 sized in the trillions of tokens, it’s nearly impossible to verify every claim against the source material.

The core challenge is straightforward: LLMs are trained to maximize log-likelihood of tokens, not to tell the truth. When the model “remembers” incorrect information from training, it’s not lying—it’s just doing what it was designed to do: predict what comes next.

Why Do Models Hallucinate?

The root causes span the entire ML pipeline:

Pre-training data problems: The internet is a noisy source. Outdated information, outright errors, and contradictions all get baked into the model. A model trained on all of GitHub, all of Wikipedia, and all of Reddit learns patterns—not facts.

Fine-tuning traps: Research by Gekhman et al. (2024) revealed something alarming: when you fine-tune an LLM on new knowledge, the model learns those examples slower than consistent ones. But once it learns them, those new facts actually increase hallucination rates. The model becomes overconfident about information it barely understands.

The knowledge uncertainty problem: Models struggle to know what they don’t know. When asked about obscure topics, they generate plausible-sounding responses rather than admitting uncertainty.

How Researchers Detect Hallucinations

The research community has developed sophisticated detection methods:

FActScore (Min et al., 2023) breaks down generated text into atomic facts and checks each against a knowledge base. The result: a precision score showing what percentage of claims are actually true.

SAFE (Search-Augmented Factuality Evaluator) goes further—using LLMs as agents that issue multiple Google searches to verify each factual claim. Interestingly, SAFE agrees with human annotators 72% of the time while being 20x cheaper.

SelfCheckGPT takes a clever approach: generate multiple responses to the same prompt, then check consistency between them. If the model can’t stay consistent with itself, that’s a red flag.

Fighting Back: Practical Solutions

Retrieval-Augmented Generation (RAG)

The most deployed solution: give the model access to a knowledge base at inference time. Instead of relying solely on training memory, the model retrieves relevant documents and generates based on that grounding.

Self-RAG (Asai et al., 2024) trains models to decide whether to retrieve information, then critically evaluate their own outputs against retrieved content.

Chain-of-Verification

Rather than generating one answer, models can now:
1. Produce an initial response
2. Generate verification questions about key claims
3. Answer those questions independently
4. Revise the original response based on findings

This breaks the feedback loop where the model’s original (potentially wrong) answer influences its own re-generation.

Fine-Tuning for Factuality

New training methods like FLAME (Factuality-Aware Alignment) focus specifically on factuality:
– Generate training data that’s more factual than the model’s natural output
– Use FActScore as a reward signal during RLHF
– Avoid inadvertently teaching the model new, potentially wrong information

Calibration: Knowing What You Don’t Know

Models like GPT-4 show surprisingly good calibration—they’re genuinely more uncertain about questions they can’t answer. The key is extracting this uncertainty in useful ways: prompting for confidence levels, or using token probabilities to estimate reliability.

The Road Ahead

Hallucination isn’t a bug to fix—it’s a fundamental characteristic of next-token prediction models trained on imperfect data. The goal isn’t perfection; it’s building systems that:
– Acknowledge uncertainty when present
– Ground responses in verifiable sources
– Provide appropriate confidence levels

As AI systems take on more consequential tasks—medical advice, legal research, scientific literature—we need hallucination rates approaching zero. The research community is making progress, but we’re not there yet.


Based on analysis of Lilian Weng’s “Extrinsic Hallucinations in LLMs” (July 2024)

Topics

Share this article

Related Articles