Building Vietnamese RAG for enterprise: 7 lessons from 50,000 pages
A reranker matters more than a "bigger" embedding model. Chunking by heading beats fixed size. PII redaction belongs at the input pipeline, not in the prompt.
Context
Built RAG for legal + CS teams over 50,000+ pages of SOPs, contracts, manuals. Lookup time dropped from 12 min → 40s. Here are 7 lessons.
1. Reranker > "bigger" embedding
We tested bge-large-en + bge-m3 — barely different on a 200-question VN legal eval set. The pivot: adding BGE Reranker top-20→top-3, precision jumped from 0.71 to 0.89. Cross-encoder reranking is cheap (only re-ranks top-K) but far more effective than scaling the encoder.
2. Heading-based chunking beats fixed size
Legal docs are structured: § 1.1, § 2.3, etc. Hard 800-token chunks fragment clauses. Switching to heading-based chunking (one chunk per logical section + 200-token overlap) lifted recall by 18%.
3. PII redaction must live in the input pipeline
Do not trust "system prompt: do not leak PII". Redact in chunk content before indexing (regex + NER for ID numbers, phones, emails).
4. Hybrid search > vector-only
Vector retrieval handles semantic queries well, but is weak on exact codes/numbers/keywords (contract IDs, SOP codes). Adding BM25/full-text + RRF fusion covers both.
5. Cite source verbatim
Rule: responses always quote source verbatim + link. Do not let the LLM paraphrase. Legal needs the original.
6. Eval set in production from day one
Build 200 labeled questions from historical tickets. Every pipeline PR runs eval; fail if F1 drops by 2 points.
7. On-prem for sensitive docs
Self-hosting Qdrant + bge-m3 + a local LLM (Llama 3.1 70B 4-bit) covers 80% of use cases. Only complex queries escalate to GPT-4 with redacted data.