
June 4, 2026
A practical look at hallucinations, retrieval, and why having the right documentation is not the same as understanding it.
Over the past few months, I’ve been experimenting with a simple question:
Can a generic LLM become a Ruby expert simply by giving it access to Ruby documentation?
The answer is both yes and no.
Like many developers exploring AI tooling, I built a Retrieval-Augmented Generation (RAG) pipeline using a local vector database and indexed Ruby documentation. The goal was straightforward: reduce hallucinations and improve technical accuracy when answering questions about Ruby libraries and APIs.
The results were fascinating.
The model improved dramatically.
But it didn’t stop hallucinating.
Instead, the hallucinations evolved.
The Assumption Most Developers Make
When first learning about RAG, many developers assume a workflow like this:
Question ↓Retrieve relevant documentation ↓Provide context to LLM ↓Correct answer
It feels logical.
If the correct information is present, the model should produce the correct answer.
Unfortunately, reality is more complicated.
A better representation is:
Question ↓Retrieve relevant documentation ↓Provide context to LLM ↓Model interprets context ↓Answer
That extra step changes everything.
The model is still generating tokens probabilistically.
Documentation helps.
Documentation does not guarantee understanding.
My Ruby-LibGD Experiment
While testing a knowledge base built around Ruby-LibGD documentation, I noticed something unexpected.
Many answers improved immediately after indexing the project documentation.
Questions that previously generated completely fabricated API calls suddenly became accurate.
However, a smaller category of failures remained.
In those cases:
Question ↓Correct document retrieved ↓Correct chunk retrieved ↓Context provided ↓Incorrect answer
At first, I assumed the retrieval system was failing.
It wasn’t.
The retrieval pipeline was doing exactly what it was supposed to do.
The problem was happening after retrieval.
Retrieval Is Not Understanding
This is perhaps the most important lesson I learned.
RAG solves a knowledge-access problem.
It does not solve a reasoning problem.
Imagine asking:
What does this parameter represent?
The relevant documentation is retrieved.
The parameter description is present.
The model still has to interpret the text correctly.
Sometimes it succeeds.
Sometimes it combines that information with prior training data.
Sometimes it fills in gaps that don’t actually exist.
The retrieval system did its job.
The language model didn’t necessarily do its job.
Four Types of RAG Failures
After enough testing, most failures seemed to fall into four categories.
1. Retrieval Failure
The simplest case.
The correct document was never retrieved.
Question ↓Wrong document ↓Wrong answer
This is what most people think of when discussing RAG quality.
Better embeddings, hybrid search, metadata filtering, and reranking can often improve this.
2. Chunking Failure
The correct information exists.
But it is split across multiple chunks.
Chunk A----------------Part of the explanationChunk B----------------Remaining explanation
The answer requires both chunks.
The retriever only finds one.
The model then tries to complete the missing information.
Sometimes correctly.
Sometimes not.
3. Context Compression Failure
This one surprised me.
Many developers assume:
More context = better answers.
Not always.
If you retrieve twenty partially relevant chunks, the important information can become diluted.
The answer may be buried inside a large amount of surrounding text.
The model sees everything.
But attention is not infinite.
4. Reasoning Failure
The most interesting category.
Everything works.
The correct document is found.
The correct chunk is found.
The relevant context is present.
The answer is still wrong.
Correct retrieval +Correct context +Incorrect interpretation
This is where many “RAG solved hallucinations” narratives start to break down.
The Hidden Conflict: Context vs Memory
Large language models have two knowledge sources.
Parametric Memory
Knowledge learned during training.
Retrieved Context
Knowledge provided at runtime.
Ideally, retrieved context wins.
In practice, that doesn’t always happen.
Suppose the model learned something years ago from training data.
Now you provide newer documentation that says something slightly different.
The model must reconcile two competing sources of truth.
Sometimes it chooses correctly.
Sometimes it doesn’t.
This is one reason why hallucinations can survive even when documentation is available.
What RAG Actually Fixes
After extensive testing, I no longer view RAG as a hallucination-removal system.
I view it as a hallucination-reduction system.
Without RAG:
Hallucination:100% invented
With RAG:
Hallucination:Partially grounded in real documentation
That’s still a significant improvement.
The model becomes far more useful.
The error rate drops.
Accuracy improves.
But the system remains probabilistic.
The Question We Should Be Asking
Most discussions focus on:
How do we improve retrieval?
That’s important.
But I increasingly think the more interesting question is:
How do we detect when a model ignored correct evidence?
That leads to a completely different set of techniques:
- Citation requirements
- Answer verification
- Self-evaluation
- Groundedness checks
- RAG evaluation frameworks
- Secondary validation passes
At that point, you’re no longer building a chatbot.
You’re building a knowledge system.
Final Thoughts
RAG is one of the most important techniques in modern AI engineering.
It dramatically improves accuracy.
It reduces hallucinations.
It allows models to work with private and domain-specific knowledge.
But one misconception continues to appear in blog posts, tutorials, and conference talks:
“If the right context is present, the model will answer correctly.”
My experiments suggest otherwise.
The right context is necessary.
It is not sufficient.
A language model can read the correct documentation and still produce the wrong answer.
Understanding that distinction changed how I evaluate AI systems.
And perhaps more importantly, it changed the question I ask when something goes wrong.
Instead of asking:
Why didn’t the model find the answer?
I now ask:
Why didn’t the model use the answer it already had?
Have you encountered similar behavior in your RAG systems? I’d love to hear your experiences, especially from teams building AI-powered tools in Ruby.
