LEARNMMXXVI

Retrieval-Augmented
Generation, demystified

Retrieval-Augmented Generation, or RAG, is the technique that lets an LLM answer questions about documents it was never trained on. It pairs a search system with the language model: retrieve the relevant passages, give them to the model as context, then generate the answer. RAG is what powers most enterprise AI assistants in 2026, and it is the main practical answer to LLM hallucination.

Try Namulai free30-day free trial · €19.80/month after · cancel anytime

01 / DEFINITION

Retrieve first, then generate

A pure LLM answers from its training data alone. It cannot know about your company's internal wiki, today's news, or a PDF you just uploaded.

RAG fixes this with a two-step pipeline. Step one: a retrieval system (usually vector search over an embedding index) finds the documents most relevant to the user's question. Step two: those documents are inserted into the LLM's prompt as context, and the model generates an answer grounded in them.

02 / WHY

Why RAG beats fine-tuning for most use cases

Fine-tuning bakes new knowledge into the model weights. It is expensive, slow to update, and tends to degrade general capability if done badly.

RAG keeps the model frozen and changes only what goes into the prompt. New documents are indexed in seconds. Outdated documents are removed instantly. The model can cite which passages it used. For knowledge that changes (policies, prices, news, product docs) RAG is almost always the right architecture.

03 / PIECES

Embeddings, vector store, retriever, generator

An embedding model converts each document chunk into a high-dimensional vector that captures meaning. A vector store (Pinecone, Weaviate, pgvector, Qdrant) holds millions of these vectors and supports fast nearest-neighbour search.

The retriever embeds the user's question, finds the closest document chunks, and passes them to the generator (the LLM). Better retrievers use hybrid search (vector + keyword), reranking, and query rewriting. The quality of the retrieval often matters more than which LLM does the generation.

04 / IN PRACTICE

Where RAG fits with Namulai's eight models

Perplexity inside Namulai is a hosted RAG system: it retrieves from the live web and grounds answers in cited sources. For most general questions, that is the easiest way to use RAG without building anything.

For private documents, you can paste excerpts directly into a Claude or Gemini prompt: with 200k to 2M token context windows, that is effectively manual RAG and works surprisingly well for one-off questions.

05 / FAQ

learn.what-is-rag.faqTitle

learn.what-is-rag.faq.q1

learn.what-is-rag.faq.a1

learn.what-is-rag.faq.q2

learn.what-is-rag.faq.a2

learn.what-is-rag.faq.q3

learn.what-is-rag.faq.a3

learn.what-is-rag.faq.q4

learn.what-is-rag.faq.a4

Try sourced answers with Perplexity in Namulai

Try Namulai free

30-day free trial · €19.80/month after · cancel anytime

Retrieval-Augmented Generation, demystified