Turning a Generic LLM into a Ruby Expert: What RAG Fixed and What It Didn’t

Turning a Generic LLM into a Ruby Expert: What RAG Fixed and What It Didn't
Turning a Generic LLM into a Ruby Expert: What RAG Fixed and What It Didn’t

June 4, 2026

A practical look at hallucinations, retrieval, and why having the right documentation is not the same as understanding it.

Over the past few months, I’ve been experimenting with a simple question:

Can a generic LLM become a Ruby expert simply by giving it access to Ruby documentation?

The answer is both yes and no.

Like many developers exploring AI tooling, I built a Retrieval-Augmented Generation (RAG) pipeline using a local vector database and indexed Ruby documentation. The goal was straightforward: reduce hallucinations and improve technical accuracy when answering questions about Ruby libraries and APIs.

The results were fascinating.

The model improved dramatically.

But it didn’t stop hallucinating.

Instead, the hallucinations evolved.

The Assumption Most Developers Make

When first learning about RAG, many developers assume a workflow like this:

Question
Retrieve relevant documentation
Provide context to LLM
Correct answer

It feels logical.

Tokyo Topographic Map
Built for Ruby on Rails

Build Maps Without
Google APIs

Generate beautiful production-ready maps directly from your Rails backend. Fast rendering, zero external dependencies, full control.

✓ No API fees ✓ Self-hosted ✓ Rails Native ✓ Fast Rendering
Why developers switch
Replace expensive map stacks.

Stop relying on third-party map billing and bloated JS libraries. Render static or dynamic maps directly in Ruby.

Try It Now
Tokyo MapView Demo

If the correct information is present, the model should produce the correct answer.

Unfortunately, reality is more complicated.

A better representation is:

Question
Retrieve relevant documentation
Provide context to LLM
Model interprets context
Answer

That extra step changes everything.

The model is still generating tokens probabilistically.

Documentation helps.

Documentation does not guarantee understanding.

My Ruby-LibGD Experiment

While testing a knowledge base built around Ruby-LibGD documentation, I noticed something unexpected.

Many answers improved immediately after indexing the project documentation.

Questions that previously generated completely fabricated API calls suddenly became accurate.

However, a smaller category of failures remained.

In those cases:

Question
Correct document retrieved
Correct chunk retrieved
Context provided
Incorrect answer

At first, I assumed the retrieval system was failing.

It wasn’t.

The retrieval pipeline was doing exactly what it was supposed to do.

The problem was happening after retrieval.

Retrieval Is Not Understanding

This is perhaps the most important lesson I learned.

RAG solves a knowledge-access problem.

It does not solve a reasoning problem.

Imagine asking:

What does this parameter represent?

The relevant documentation is retrieved.

The parameter description is present.

The model still has to interpret the text correctly.

Sometimes it succeeds.

Sometimes it combines that information with prior training data.

Sometimes it fills in gaps that don’t actually exist.

The retrieval system did its job.

The language model didn’t necessarily do its job.

Four Types of RAG Failures

After enough testing, most failures seemed to fall into four categories.

1. Retrieval Failure

The simplest case.

The correct document was never retrieved.

Question
Wrong document
Wrong answer

This is what most people think of when discussing RAG quality.

Better embeddings, hybrid search, metadata filtering, and reranking can often improve this.

2. Chunking Failure

The correct information exists.

But it is split across multiple chunks.

Chunk A
----------------
Part of the explanation
Chunk B
----------------
Remaining explanation

The answer requires both chunks.

The retriever only finds one.

The model then tries to complete the missing information.

Sometimes correctly.

Sometimes not.

3. Context Compression Failure

This one surprised me.

Many developers assume:

More context = better answers.

Not always.

If you retrieve twenty partially relevant chunks, the important information can become diluted.

The answer may be buried inside a large amount of surrounding text.

The model sees everything.

But attention is not infinite.

4. Reasoning Failure

The most interesting category.

Everything works.

The correct document is found.

The correct chunk is found.

The relevant context is present.

The answer is still wrong.

Correct retrieval
+
Correct context
+
Incorrect interpretation

This is where many “RAG solved hallucinations” narratives start to break down.

The Hidden Conflict: Context vs Memory

Large language models have two knowledge sources.

Parametric Memory

Knowledge learned during training.

Retrieved Context

Knowledge provided at runtime.

Ideally, retrieved context wins.

In practice, that doesn’t always happen.

Suppose the model learned something years ago from training data.

Now you provide newer documentation that says something slightly different.

The model must reconcile two competing sources of truth.

Sometimes it chooses correctly.

Sometimes it doesn’t.

This is one reason why hallucinations can survive even when documentation is available.

What RAG Actually Fixes

After extensive testing, I no longer view RAG as a hallucination-removal system.

I view it as a hallucination-reduction system.

Without RAG:

Hallucination:
100% invented

With RAG:

Hallucination:
Partially grounded in real documentation

That’s still a significant improvement.

The model becomes far more useful.

The error rate drops.

Accuracy improves.

But the system remains probabilistic.

The Question We Should Be Asking

Most discussions focus on:

How do we improve retrieval?

That’s important.

But I increasingly think the more interesting question is:

How do we detect when a model ignored correct evidence?

That leads to a completely different set of techniques:

  • Citation requirements
  • Answer verification
  • Self-evaluation
  • Groundedness checks
  • RAG evaluation frameworks
  • Secondary validation passes

At that point, you’re no longer building a chatbot.

You’re building a knowledge system.

Final Thoughts

RAG is one of the most important techniques in modern AI engineering.

It dramatically improves accuracy.

It reduces hallucinations.

It allows models to work with private and domain-specific knowledge.

But one misconception continues to appear in blog posts, tutorials, and conference talks:

“If the right context is present, the model will answer correctly.”

My experiments suggest otherwise.

The right context is necessary.

It is not sufficient.

A language model can read the correct documentation and still produce the wrong answer.

Understanding that distinction changed how I evaluate AI systems.

And perhaps more importantly, it changed the question I ask when something goes wrong.

Instead of asking:

Why didn’t the model find the answer?

I now ask:

Why didn’t the model use the answer it already had?


Have you encountered similar behavior in your RAG systems? I’d love to hear your experiences, especially from teams building AI-powered tools in Ruby.

Article content

Leave a comment