How RAG Actually Works

ChatGPT's answer to "doesn't know our internal docs" — at query time, retrieve company docs and inject them into the prompt. That's Retrieval-Augmented Generation (RAG). A simple idea whose production form is a layered system — embedding / vector DB / chunking / re-ranking / evaluation. This guide covers each layer.

RAG Flow

Phase 1 — Indexing (offline, once):
  documents → chunking → embedding → store in vector DB

Phase 2 — Retrieval + Generation (online, per query):
  user question
       ↓
  embed question → vector DB search → top-k similar chunks
       ↓
  context = retrieved chunks
       ↓
  LLM prompt = "context: {chunks}\nquestion: {q}\nanswer:"
       ↓
  LLM response

Embedding — A Vector of Meaning

"sneaker" vs "running shoe" — semantically similar, lexically different
   → String matching can't find them

Embedding models:
  embed("sneaker")     = [0.12, -0.34, 0.78, ...]  (768 dims)
  embed("running shoe") = [0.11, -0.35, 0.79, ...]
  embed("airplane")    = [-0.89, 0.45, -0.12, ...]

Cosine similarity:
  sim(sneaker, running shoe) = 0.95  (semantically close)
  sim(sneaker, airplane)     = 0.10  (different meaning)

→ "Semantic search" possible.

Embedding model choices:
- OpenAI text-embedding-3-small: 1536 dims, cheap, strong English
- text-embedding-3-large: 3072 dims, more accurate
- Cohere embed-multilingual: multi-language (incl. Korean)
- BGE / E5 / GTE (OSS): self-hostable
- Korean-specific: ko-sroberta-multitask

Vector DB — Store + Search Embeddings

Top-10 similar from millions of 768-dim chunks per query — naive
brute force = O(N × d) per query = slow.

ANN (Approximate Nearest Neighbor) algorithms:
- HNSW (Hierarchical NSW): graph-based, most popular
- IVF: clustering-based
- LSH: hashing-based
- DiskANN: SSD-based for huge datasets

Vector DB tools:
- Pinecone — managed, easy + pricey
- Weaviate (OSS) — multi-modal
- Qdrant (OSS) — Rust, fast
- Milvus (OSS) — big scale
- pgvector (Postgres extension) — atop existing PG / small datasets
- ChromaDB (OSS) — embedded, dev-friendly
- Elasticsearch 8+ — full-text + vector hybrid

Small dataset (< 100K): pgvector or ChromaDB suffices
Large dataset (millions+): Pinecone / Qdrant / Milvus

Chunking — The Naive-Split Trap

One embedding for the entire document? Too big, meaning diluted.
One per character? Meaningless.

Chunks of ~500-1500 tokens are typical.

Naive — split by characters:
  chunk[0] = doc[0:1000]
  chunk[1] = doc[1000:2000]
  ...

Problems:
- Splits in the middle of sentences / paragraphs → meaning muddied
- "the answer is" and "yes" split apart → lose context at retrieval

Better — semantic-unit chunking:
- Split by paragraph + further split long paragraphs
- Recursive split (large → small separators)
- Sliding window (200 token overlap) — preserve boundary info

Modern — semantic chunking:
- Compare embeddings to detect topic shifts → split there
- e.g. LangChain's SemanticChunker

→ Chunking strategy accounts for 30%+ of RAG quality. Experiment.

Hybrid Search — Vector + Keyword

Vector search weakness:
- Weak on exact keyword matches (product SKUs, names)
- Searching "iPhone 15 Pro Max" → "iPhone 13" still retrieves too

Keyword search (BM25) weakness:
- No synonym handling
- Searching "fast running shoes" → "sneaker" not found

Hybrid:
- Combine BM25 + vector scores (RRF — Reciprocal Rank Fusion)
- Or weighted (0.5 × bm25 + 0.5 × vector)

→ Capture both accuracy and meaning. Modern production standard.

Tools:
- Elasticsearch / OpenSearch hybrid query
- Weaviate / Pinecone hybrid
- LangChain / LlamaIndex EnsembleRetriever

Re-Ranking — Boost Precision

Take top-50 from vector / hybrid search → re-rank to top-10.

Why:
- Vector similarity = semantic proxy
- True "best chunk to answer this question" needs different criteria
- Cross-encoder reranker = (question + chunk) input → direct relevance score

Tools:
- Cohere Rerank — API, strong
- BGE Reranker — OSS
- ColBERT — late interaction, fast

Trade-off:
- retrieve top-50 + rerank top-10 → accuracy ↑, latency +50-200ms
- Standard for important tasks, skip for latency-critical

Evaluation — The Hardest Part

"Is this RAG good?" is hard:
- Ground truth (correct answers) often missing
- LLM outputs vary (sampling)
- Human evaluation is expensive

Tools:
- Ragas — LLM-based auto eval (faithfulness, relevance, context recall)
- TruLens — per-stage metrics
- LLM-as-judge — GPT-4 compares two answers

Metrics:
- Context recall: do retrieved chunks contain the answer?
- Context precision: are retrieved chunks low-noise?
- Faithfulness: does the answer stick to facts in context?
- Answer relevance: does the answer address the question?

Practical:
- Curate a 50-100 question golden set, evaluate weekly
- Analyze production query logs + user feedback

RAG vs Fine-tuning

Property	RAG	Fine-tuning
Adding new knowledge	Just add documents (instant)	Retrain
Cost	Retrieval cost + LLM API	Training cost ↑ + same serving
Style change	Weak (relies on prompt)	Strong (shapes LLM output style)
Hallucination	Reduced via context grounding	Still possible
Source citation	Natural (cite retrieved chunks)	Hard

Most knowledge tasks = RAG. Style / behavior changes = fine-tuning. Both can combine (fine-tune base + RAG for fresh knowledge).

Common Pitfalls

Naive character splits — meaning units lost → quality drop. Use recursive or semantic chunking.
Vector search only — weak on keywords. Use hybrid (BM25 + vector).
top-k too small — k=3 may miss the answer chunk. k=10-20 + rerank.
No evaluation — comfort of "seems to work". Golden set + Ragas-style auto eval.
Context window overflow — sum of retrieved chunks exceeds LLM context limit. Dynamic chunk count or a larger-context model.

Wrap-up

RAG — inject new knowledge via retrieval without updating LLM parameters. Embedding + vector DB + chunking + hybrid + rerank layers. An alternative to fine-tuning but the two can combine.

Practical — small start: ChromaDB (in-memory) + OpenAI embeddings + recursive chunking + naive top-k. For production, add hybrid + rerank + Ragas eval. The standard pattern for most knowledge chatbots / Q&A.