Building a RAG System for Hong Kong Building Regulations

Hong Kong’s building regulations are scattered across hundreds of PDFs from multiple government departments. Codes of Practice, Practice Notes, Circular Letters — each with their own numbering schemes, cross-references, and update cycles. If you want to know the fire-resistance rating for a concrete partition, you might need to check three different documents across two departments.

I built Ordinance to solve this. It’s a RAG system that ingests 224 government PDFs into 5,700+ searchable chunks and answers regulatory questions with cited, verifiable responses.

Here’s the architecture and the hard problems I ran into.

System overview

Every query flows through a pipeline that prioritizes correctness over speed. The full path from user input to verified answer:

flowchart TB
  Q[User Query] --> V{Cache Hit?}
  V -->|Exact match| R[Return Cached]
  V -->|Semantic match ≥0.95| R
  V -->|Miss| E[Embed Query]
  E --> P1[Vector Search
pgvector cosine]
  E --> P2[Keyword Search
PostgreSQL FTS]
  E --> P3[Query Expansion
gpt-5-mini]
  E --> P4[Live Gov Web
data.gov.hk]
  P3 --> P5[Expanded Vector Search]
  P3 --> P6[Expanded Keyword Search]
  P1 & P2 --> RRF1[RRF Fusion]
  P5 & P6 --> RRF2[RRF Fusion]
  RRF1 & RRF2 --> MRG[Merge + Deduplicate]
  MRG --> RR[Cohere Rerank v3.5]
  RR --> GEN[GPT-4o Generation
temp=0.1, max 800 tokens]
  P4 --> GEN
  GEN --> CV[Citation Verification]
  CV --> FS[Faithfulness Scoring
gpt-5-mini judge]
  FS --> AL[Audit Log + Cache Write]
  AL --> RES[Response with
citations + scores]

The key insight: retrieval, expansion, and live web search all run in parallel. The expansion path generates 2-3 query variants, each getting their own hybrid search pass. Reciprocal Rank Fusion merges everything into a single ranked list before reranking.

The retrieval problem

The naive approach to RAG — embed everything, cosine similarity, done — breaks down fast with regulatory text. “Minimum corridor width for means of escape” needs to match documents that say “clear width of escape route shall not be less than 1050mm.” The semantic gap between how people ask and how regulations are written is significant.

I ended up with a hybrid strategy that attacks the problem from both directions:

flowchart LR
  Q[Query] --> EMB[OpenAI
text-embedding-3-large
3072 dimensions]
  Q --> TSQ[PostgreSQL
plainto_tsquery]
  EMB --> VS[Vector Search
cosine distance
top 15]
  TSQ --> KS[Keyword Search
BM25 ranking
top 15]
  VS --> RRF[Reciprocal Rank Fusion
RRF_K = 60]
  KS --> RRF
  RRF --> CR[Cohere Rerank
score threshold 0.1]
  CR --> TOP[Top 5 Chunks]

Vector search handles semantic matching — understanding that “corridor width” and “clear width of escape route” are about the same thing. 3072-dimension embeddings stored in PostgreSQL with pgvector, indexed with HNSW for sub-second lookups across 5,700+ chunks.

Full-text search catches what embeddings miss — specific clause numbers like “Clause 4.3.2,” technical terms like “FRC” or “DUKPT,” and regulatory identifiers. PostgreSQL’s tsvector/tsquery with GIN indexing.

RRF fusion is the glue. For each chunk, the fused score is 1 / (RRF_K + rank_vector) + 1 / (RRF_K + rank_keyword). If a chunk ranks high on both, it dominates. If only on one, it still gets considered. RRF_K=60 dampens the impact of exact rank position — what matters is whether you’re in the top results at all.

Cohere reranking is the final filter. The reranker sees full query text and full chunk text (not just embeddings), catching relevance signals that cosine similarity alone misses. Anything below a 0.1 relevance score gets dropped.

Ingestion: from PDF to searchable chunk

Government PDFs have structure — Parts, Sections, Clauses — and that structure carries meaning. A clause that says “the above requirements” is useless without knowing which Part it belongs to.

flowchart TB
  PDF[Fetch PDF from
BD / FSD / EPD] --> HASH[SHA-256 Hash]
  HASH -->|Changed| PARSE[Parse PDF Text]
  HASH -->|Unchanged| SKIP[Skip Document]
  PARSE --> HIER[Extract Hierarchy
Part → Section → Clause]
  HIER --> CHUNK[Structure-Aware Chunking
256-512 tokens
75-token overlap]
  CHUNK --> XREF[Extract Cross-References]
  XREF --> EMB[Batch Embed
100 chunks per call]
  EMB --> FTS[Generate tsvector
for full-text search]
  FTS --> STORE[(PostgreSQL + pgvector)]
  STORE --> OBS[Mark Old Chunks
Obsolete]

The chunking is the trickiest part. A naive token-count split will break a regulation mid-clause, losing the legal context that makes it interpretable. The parser respects section boundaries — it won’t split inside a clause unless the clause itself exceeds the max token limit.

Each chunk carries metadata:

Section hierarchy as an array: ["Part III", "Section 5", "Clause 5.2"]
Cross-references extracted via regex: links to other codes of practice, clause numbers
Source metadata: department, document type, version, ordinance number

Change detection uses SHA-256 hashing at fetch time. When the Buildings Department updates a PDF, only the changed document gets re-processed. With 224 documents across 3 departments, this keeps ingestion incremental.

Making generation trustworthy

For regulatory content, hallucination isn’t just annoying — it’s dangerous. Someone might make a compliance decision based on the answer. The generation pipeline has four layers of verification:

flowchart TB
  CTX[Retrieved Chunks
+ Web Sources] --> GEN[GPT-4o Generation
temp=0.1]
  GEN --> ANS[Raw Answer with
Bracketed Citations]
  ANS --> CE[Citation Extraction
regex: brackets → sources]
  CE --> CV{Each Citation
in Context?}
  CV -->|Yes| OK[Verified Citation]
  CV -->|No| PHANTOM[Phantom Citation
Flagged]
  ANS --> UC[Uncited Claim Detection
Keywords: must, shall,
required, minimum]
  UC --> FLAG[Uncited Regulatory
Claims Flagged]
  OK & PHANTOM & FLAG --> FJ[Faithfulness Judge
gpt-5-mini
Score 0-10]
  FJ --> FINAL[Final Response
+ Quality Metadata]

Citation enforcement is baked into the system prompt. Every factual claim must include [Document Name (Dept), Version, Section X.X]. Temperature at 0.1 minimizes creative interpretation — this is regulatory text, not poetry.

Citation verification extracts every bracketed reference and matches it against the chunks that were actually retrieved. If the model cites “Code of Practice for Fire Resisting Construction, Clause 4.3” but that clause wasn’t in the context window, it’s flagged as a phantom citation. This catches the most dangerous form of hallucination — confident, specific, wrong citations.

Uncited claim detection scans for regulatory language — “must,” “shall,” “required,” “minimum,” “not less than” — and checks whether the surrounding claim has a citation. Regulatory statements without sources get flagged.

Faithfulness scoring uses a separate LLM judge (gpt-5-mini, cheaper than gpt-4o) that reads the retrieved context and the generated answer, then scores 0-10 with reasoning and a list of any unsupported claims. This runs in parallel with the cache write to minimize latency impact.

Two-level caching

The full pipeline takes ~8 seconds. That’s acceptable for a first query, but not for repeated ones. Two cache layers bring repeat queries down to ~15ms — a 500x speedup.

flowchart LR
  Q[Incoming Query] --> N[Normalize
NFKC + lowercase
+ whitespace collapse]
  N --> EC{Exact Cache
TTL: 1 hour}
  EC -->|Hit| R[Return in ~15ms]
  EC -->|Miss| EMB[Embed Query]
  EMB --> SC{Semantic Cache
cosine ≥ 0.95}
  SC -->|Hit| R
  SC -->|Miss| PIPE[Full Pipeline
~8 seconds]
  PIPE --> W[Write to Both Caches]
  W --> R2[Return Result]

The exact cache normalizes queries with NFKC unicode normalization, lowercasing, and whitespace collapse before lookup. Simple and fast.

The semantic cache is more interesting. It stores the query embedding alongside the cached result. When a new query arrives, its embedding is compared against all cached embeddings using cosine similarity. If any cached query exceeds 0.95, that cached answer is returned. This handles rephrasings — “what’s the minimum stair width?” and “how wide must stairs be?” resolve to the same cached answer.

The 0.95 threshold is deliberately conservative. For regulatory content, a subtle difference in phrasing can mean a completely different requirement. Better to re-run the pipeline than serve a wrong cached answer.

Data model

The core schema that makes everything work:

erDiagram
  regulation_chunks {
      uuid id PK
      text content
      vector embedding "3072 dimensions"
      tsvector search_vector "generated"
      text source_department
      text document_type
      text document_name
      text version
      text[] section_hierarchy
      text[] cross_references
      boolean is_current
  }
  document_versions {
      uuid id PK
      text document_name
      text sha256_hash
      timestamp ingested_at
  }
  query_cache {
      uuid id PK
      text normalized_query
      vector query_embedding
      jsonb result
      timestamp expires_at
  }
  query_audit_log {
      uuid id PK
      text query
      float faithfulness_score
      jsonb citations
      integer latency_ms
      float cost_usd
  }
  regulation_chunks ||--o{ document_versions : "versioned by"
  query_cache ||--o{ query_audit_log : "logged in"

Key indexing decisions:

HNSW on embedding for approximate nearest neighbor search — trades perfect recall for 10x speed
GIN on search_vector for fast full-text search
B-tree on source_department, document_type, is_current for filtered queries
search_vector is a GENERATED ALWAYS column — PostgreSQL keeps it in sync with content automatically

Query expansion

Sometimes a single query doesn’t capture all relevant documents. “Fire safety requirements for kitchens” might miss chunks that use “cooking facilities” or “commercial food preparation areas.”

flowchart TB
  Q[Original Query] --> EXP[gpt-5-mini
Generate 2-3 variants]
  EXP --> Q1[fire safety cooking facilities]
  EXP --> Q2[commercial food preparation
fire requirements]
  EXP --> Q3[kitchen fire protection
building code]
  Q --> S0[Hybrid Search]
  Q1 --> S1[Hybrid Search]
  Q2 --> S2[Hybrid Search]
  Q3 --> S3[Hybrid Search]
  S0 --> RRF[RRF Fusion
across all result sets]
  S1 --> RRF
  S2 --> RRF
  S3 --> RRF

All searches run in parallel. The expanded queries use a smaller model (gpt-5-mini) to keep costs down — expansion is about breadth, not depth. RRF fusion handles deduplication naturally; chunks that appear in multiple result sets get boosted.

A separate path probes live government websites concurrently — CSV datasets from data.gov.hk for fire door ratings, glazing specifications, and MiC system approvals. These get appended to the streaming response after the main answer, so they don’t block time-to-first-byte.

What I’d do differently

Evaluation first. I built the eval suite after most of the pipeline was in place. In hindsight, having a gold-standard set of question-answer pairs from day one would have made every architectural decision empirical instead of vibes-based.

Smaller embeddings initially. 3072-dim embeddings are expensive to store and slow to compare. I’d start with 1536-dim and only upgrade if retrieval quality demanded it — the HNSW index alone is significantly larger at 3072.

Chunk size tuning per document type. The 256-512 token range was chosen based on general RAG guidance. Tabular data (fire rating tables, material specifications) wants smaller chunks. Narrative explanations want larger ones. Per-document-type tuning would likely improve retrieval precision.

Streaming-first from the start. The non-streaming path was built first, then streaming was added. Building streaming-first would have simplified the architecture — the non-streaming path is just streaming with buffering.

Stack

TypeScript end-to-end. Express 5 for the API. PostgreSQL with pgvector for storage and hybrid retrieval. OpenAI for embeddings and generation. Cohere for reranking. Zod for runtime validation across every boundary. Deployed on Railway with auto-migrations on boot.

The live system is at ordinance.maniksoin.com.