0
Act 3

Application

1 / 9

RAG Fundamentals

Act 3 · ~5 min

Theory

RAG adds a retrieval step between a user's query and the LLM's response. Instead of answering from training memory alone, the model is supplied with relevant documents pulled from an external store.

Queryuser question
Embedvector
Retrievetop-k chunks
Augmentprompt + context
Generateresponse
The query is embedded, retrieved chunks are injected into the prompt, then the LLM generates the answer.

The three steps:

StepWhat happens
RetrieveQuery embedding compared to indexed chunks; top-k returned
AugmentRetrieved chunks injected into the prompt as context
GenerateLLM is steered to answer from supplied context, leaning less on memory

RAG vs. fine-tuning for facts: RAG is updatable without retraining, source-attributable, and cheaper. Fine-tuning suits style or format adaptation, not rapidly changing facts.

Limits: retrieval quality is the binding constraint — irrelevant or empty results propagate into ungrounded generation. The context window caps how many chunks can be injected, and each request adds retrieval latency.

RAG reduces but does not eliminate hallucination. Next: chunking and vector search — the variables that determine retrieval quality.