Building Vietnamese RAG for enterprise: 7 lessons from 50,000 pages

A reranker matters more than a "bigger" embedding model. Chunking by heading beats fixed size. PII redaction belongs at the input pipeline, not in the prompt.

Author: Nhật Anh·Published: Apr 15, 2026·2 min readAI RAG Production Vietnamese

Context

Built RAG for legal + CS teams over 50,000+ pages of SOPs, contracts, manuals. Lookup time dropped from 12 min → 40s. Here are 7 lessons.

1. Reranker > "bigger" embedding

We tested bge-large-en + bge-m3 — barely different on a 200-question VN legal eval set. The pivot: adding BGE Reranker top-20→top-3, precision jumped from 0.71 to 0.89. Cross-encoder reranking is cheap (only re-ranks top-K) but far more effective than scaling the encoder.

2. Heading-based chunking beats fixed size

Legal docs are structured: § 1.1, § 2.3, etc. Hard 800-token chunks fragment clauses. Switching to heading-based chunking (one chunk per logical section + 200-token overlap) lifted recall by 18%.

3. PII redaction must live in the input pipeline

Do not trust "system prompt: do not leak PII". Redact in chunk content before indexing (regex + NER for ID numbers, phones, emails).

4. Hybrid search > vector-only

Vector retrieval handles semantic queries well, but is weak on exact codes/numbers/keywords (contract IDs, SOP codes). Adding BM25/full-text + RRF fusion covers both.

5. Cite source verbatim

Rule: responses always quote source verbatim + link. Do not let the LLM paraphrase. Legal needs the original.

6. Eval set in production from day one

Build 200 labeled questions from historical tickets. Every pipeline PR runs eval; fail if F1 drops by 2 points.

7. On-prem for sensitive docs

Self-hosting Qdrant + bge-m3 + a local LLM (Llama 3.1 70B 4-bit) covers 80% of use cases. Only complex queries escalate to GPT-4 with redacted data.

Overall results

Share:X / Twitter Facebook LinkedIn Telegram

Metric	Before	After
Avg time-to-answer	12 min	40s
Right-version answer	64%	96%
Tickets about errors	—	−43%