ChatGPT's answer to "doesn't know our internal docs" — at query time, retrieve company docs and inject them into the prompt. That's Retrieval-Augmented Generation (RAG). A simple idea whose production form is a layered system — embedding / vector DB / chunking / re-ranking / evaluation. This guide covers each layer.
RAG Flow
Phase 1 — Indexing (offline, once):
documents → chunking → embedding → store in vector DB
Phase 2 — Retrieval + Generation (online, per query):
user question
↓
embed question → vector DB search → top-k similar chunks
↓
context = retrieved chunks
↓
LLM prompt = "context: {chunks}\nquestion: {q}\nanswer:"
↓
LLM responseEmbedding — A Vector of Meaning
"sneaker" vs "running shoe" — semantically similar, lexically different
→ String matching can't find them
Embedding models:
embed("sneaker") = [0.12, -0.34, 0.78, ...] (768 dims)
embed("running shoe") = [0.11, -0.35, 0.79, ...]
embed("airplane") = [-0.89, 0.45, -0.12, ...]
Cosine similarity:
sim(sneaker, running shoe) = 0.95 (semantically close)
sim(sneaker, airplane) = 0.10 (different meaning)
→ "Semantic search" possible.
Embedding model choices:
- OpenAI text-embedding-3-small: 1536 dims, cheap, strong English
- text-embedding-3-large: 3072 dims, more accurate
- Cohere embed-multilingual: multi-language (incl. Korean)
- BGE / E5 / GTE (OSS): self-hostable
- Korean-specific: ko-sroberta-multitaskVector DB — Store + Search Embeddings
Top-10 similar from millions of 768-dim chunks per query — naive
brute force = O(N × d) per query = slow.
ANN (Approximate Nearest Neighbor) algorithms:
- HNSW (Hierarchical NSW): graph-based, most popular
- IVF: clustering-based
- LSH: hashing-based
- DiskANN: SSD-based for huge datasets
Vector DB tools:
- Pinecone — managed, easy + pricey
- Weaviate (OSS) — multi-modal
- Qdrant (OSS) — Rust, fast
- Milvus (OSS) — big scale
- pgvector (Postgres extension) — atop existing PG / small datasets
- ChromaDB (OSS) — embedded, dev-friendly
- Elasticsearch 8+ — full-text + vector hybrid
Small dataset (< 100K): pgvector or ChromaDB suffices
Large dataset (millions+): Pinecone / Qdrant / MilvusChunking — The Naive-Split Trap
One embedding for the entire document? Too big, meaning diluted.
One per character? Meaningless.
Chunks of ~500-1500 tokens are typical.
Naive — split by characters:
chunk[0] = doc[0:1000]
chunk[1] = doc[1000:2000]
...
Problems:
- Splits in the middle of sentences / paragraphs → meaning muddied
- "the answer is" and "yes" split apart → lose context at retrieval
Better — semantic-unit chunking:
- Split by paragraph + further split long paragraphs
- Recursive split (large → small separators)
- Sliding window (200 token overlap) — preserve boundary info
Modern — semantic chunking:
- Compare embeddings to detect topic shifts → split there
- e.g. LangChain's SemanticChunker
→ Chunking strategy accounts for 30%+ of RAG quality. Experiment.Hybrid Search — Vector + Keyword
Vector search weakness:
- Weak on exact keyword matches (product SKUs, names)
- Searching "iPhone 15 Pro Max" → "iPhone 13" still retrieves too
Keyword search (BM25) weakness:
- No synonym handling
- Searching "fast running shoes" → "sneaker" not found
Hybrid:
- Combine BM25 + vector scores (RRF — Reciprocal Rank Fusion)
- Or weighted (0.5 × bm25 + 0.5 × vector)
→ Capture both accuracy and meaning. Modern production standard.
Tools:
- Elasticsearch / OpenSearch hybrid query
- Weaviate / Pinecone hybrid
- LangChain / LlamaIndex EnsembleRetrieverRe-Ranking — Boost Precision
Take top-50 from vector / hybrid search → re-rank to top-10.
Why:
- Vector similarity = semantic proxy
- True "best chunk to answer this question" needs different criteria
- Cross-encoder reranker = (question + chunk) input → direct relevance score
Tools:
- Cohere Rerank — API, strong
- BGE Reranker — OSS
- ColBERT — late interaction, fast
Trade-off:
- retrieve top-50 + rerank top-10 → accuracy ↑, latency +50-200ms
- Standard for important tasks, skip for latency-criticalEvaluation — The Hardest Part
"Is this RAG good?" is hard:
- Ground truth (correct answers) often missing
- LLM outputs vary (sampling)
- Human evaluation is expensive
Tools:
- Ragas — LLM-based auto eval (faithfulness, relevance, context recall)
- TruLens — per-stage metrics
- LLM-as-judge — GPT-4 compares two answers
Metrics:
- Context recall: do retrieved chunks contain the answer?
- Context precision: are retrieved chunks low-noise?
- Faithfulness: does the answer stick to facts in context?
- Answer relevance: does the answer address the question?
Practical:
- Curate a 50-100 question golden set, evaluate weekly
- Analyze production query logs + user feedbackRAG vs Fine-tuning
| Property | RAG | Fine-tuning |
|---|---|---|
| Adding new knowledge | Just add documents (instant) | Retrain |
| Cost | Retrieval cost + LLM API | Training cost ↑ + same serving |
| Style change | Weak (relies on prompt) | Strong (shapes LLM output style) |
| Hallucination | Reduced via context grounding | Still possible |
| Source citation | Natural (cite retrieved chunks) | Hard |
Most knowledge tasks = RAG. Style / behavior changes = fine-tuning. Both can combine (fine-tune base + RAG for fresh knowledge).
Common Pitfalls
- Naive character splits — meaning units lost → quality drop. Use recursive or semantic chunking.
- Vector search only — weak on keywords. Use hybrid (BM25 + vector).
- top-k too small — k=3 may miss the answer chunk. k=10-20 + rerank.
- No evaluation — comfort of "seems to work". Golden set + Ragas-style auto eval.
- Context window overflow — sum of retrieved chunks exceeds LLM context limit. Dynamic chunk count or a larger-context model.
Wrap-up
RAG — inject new knowledge via retrieval without updating LLM parameters. Embedding + vector DB + chunking + hybrid + rerank layers. An alternative to fine-tuning but the two can combine.
Practical — small start: ChromaDB (in-memory) + OpenAI embeddings + recursive chunking + naive top-k. For production, add hybrid + rerank + Ragas eval. The standard pattern for most knowledge chatbots / Q&A.