Zero-Backend Hybrid Search: Running BM25 + Semantic Search in the Browser
One of the biggest pain points of a static blog is search. No backend server means no Elasticsearch, no database queries, not even a simple full-text search API. Most people either integrate a third-party service like Algolia or simply give up on search altogether.
I chose a third path: move the entire search engine into the browser. Not a simple string match, but a full BM25 + semantic expansion hybrid search system with cross-lingual Chinese-English retrieval, all with zero backend dependencies.
This post is a complete technical summary of the system.
Architecture Overview
The system has two halves: build time (Node.js) and runtime (browser).
Build Time (node) Runtime (browser)
┌──────────────────┐ ┌───────────────────────────┐
│ Markdown/Notebook │ │ P0: Inverted index + meta │ → Instant keyword search
│ Photo albums + AI │ build.sh │ P1: Keyword vectors (KVEC)│ → Semantic expansion ready
│ ─────────────────→│ ──────────→ │ P2: ONNX model │ → Full semantic search
│ index-builder │ └───────────────────────────┘
│ vector-builder │ ↓
│ image-tagger │ BM25 + Semantic → Composite Score Fusion
└──────────────────┘
Key design principle: progressive loading. Users can search with keywords the instant the page opens (P0). Semantic search loads in the background — it enhances the experience but is never required.
Build Time: From Content to Index
Tokenizer: Chinese Unigram + Bigram
The foundation of any search system is tokenization. English naturally splits on spaces; Chinese has no such luxury. The common approach is a segmentation library like jieba, but that adds build-time and runtime dependencies.
I used a lighter approach: Chinese character unigrams + bigrams.
function tokenize(text) {
const tokens = text.toLowerCase().match(/[\u4e00-\u9fff]+|[a-z0-9]+/g) || [];
const result = [];
for (const token of tokens) {
if (/[\u4e00-\u9fff]/.test(token)) {
// Chinese: each character + adjacent pairs
for (let i = 0; i < token.length; i++) {
result.push(token[i]);
if (i < token.length - 1)
result.push(token.slice(i, i + 2));
}
} else if (token.length >= 2) {
result.push(token);
}
}
return result.filter(t => !STOPWORDS.has(t)
&& (t.length >= 2 || /[\u4e00-\u9fff]/.test(t)));
}
Example: "搜索引擎" → ["搜", "搜索", "索", "索引", "引", "引擎", "擎"]
Benefits:
- Zero dependencies: no segmentation dictionary needed
- High recall: bigrams cover most common words ("搜索", "引擎" all match)
- Single-character queries work: unigrams ensure "树" (tree) or "花" (flower) return results
- Build/runtime consistency: same tokenizer in both environments
The trade-off is lower precision ("索引" and "引擎" both match on "引"), but BM25's IDF weighting naturally suppresses high-frequency generic tokens.
Inverted Index: Compact v2 Format
The index builder scans all Markdown posts, Jupyter notebooks, and photo albums, producing three files:
| File | Content | Size |
|---|---|---|
search-inverted.json |
Compact inverted index | ~2.1 MB |
search-metadata.json |
Article metadata (title, date, excerpt) | ~117 KB |
search-vocab.json |
Vocabulary statistics | ~2.4 MB |
The inverted index uses a compact v2 format, replacing URL strings with numeric IDs:
{
"v": 2,
"docs": ["/blog/posts/2026/...", "/gallery/20210711-Chengdu Panda zoo/", ...],
"avgDL": 250.5,
"N": 282,
"dl": [245, 268, ...],
"idx": {
"search": [[0, 5], [3, 2], [12, 1]],
"cathedral": [[42, 3], [43, 2]]
}
}
[[docNum, tf], ...] replaces [{id: "url", tf: 5}, ...], cutting JSON size by roughly 50%.
Document Chunking
Long articles are split into ~500-character chunks with 50-character overlap. Split points prefer sentence endings or line breaks. The final index aggregates at the article level — all chunks from one article merge their term frequencies — so BM25 scores reflect whole-article relevance.
Bilingual Photo Album Indexing
Photo search is a distinctive feature of this system. Each photo album gets AI-generated bilingual tags (more on this later), then enters the same inverted index as articles:
// Index text = location + description + English tags + Chinese tags + year
const text = [album.location, description, tagsEn, tagsZh, year].join(' ');
// Tags field (for title/tag weight boost)
const tags = `photo gallery ${album.location} ${tagsEn} ${tagsZh}`;
This means searching "熊猫" (panda) directly hits the Chengdu Panda Zoo album, and searching "museum" finds museum photo galleries.
Build Time: Keyword Vectors
Why Pre-Compute Vectors?
The standard approach to semantic search is embedding both queries and documents, then computing cosine similarity. But embedding hundreds of documents in real-time in the browser is too slow.
A different approach: don't embed documents — embed the vocabulary.
At build time, filter the ~70,000 index terms down to 8,000 most valuable ones and pre-compute their embeddings. At search time, embed the query once, then dot-product against 8,000 pre-computed vectors — pure CPU arithmetic, done in ~50ms.
Vocabulary Filtering Strategy
70,000 terms can't all be vectorized (too large). The filtering strategy:
score = 0
if (appears in title/tags) score += 100 // curated terms, highest priority
score += min(df, 50) × 2 // wider coverage = more matching value
score += min(maxTf, 20) // high TF in some docs = meaningful
if (chinese && length >= 2) score += 10 // Chinese two-char words are meaningful
if (english && length >= 4) score += 5 // longer English words more distinctive
Top 8,000 by score make the cut.
Notably, Chinese bigrams are filtered by default (most are meaningless character pairs like "景的"), but bigrams appearing in titles or tags are preserved — these are curated, meaningful vocabulary like "教堂" (cathedral) and "熊猫" (panda) from photo tags.
multilingual-e5-small: The Cross-Lingual Key
The system originally used BGE-small-zh-v1.5 (512-dim, Chinese-only). It worked well within Chinese but completely failed cross-lingually:
BGE-small-zh:
cosine("教堂", "cathedral") = 0.33 ← far below threshold
cosine("利物浦", "liverpool") = 0.28 ← nearly orthogonal
Switching to multilingual-e5-small (384-dim, 100+ languages):
multilingual-e5-small:
cosine("埃及", "egypt") = 0.917 ✓
cosine("展览", "exhibition") = 0.897 ✓
cosine("雕像", "sculpture") = 0.879 ✓
cosine("博物", "museum") = 0.838 ✓
cosine("熊猫", "panda") = 0.830 ✓
An important e5 convention: queries need a "query: " prefix, while corpus terms don't. Build-time vocabulary embeddings have no prefix; browser search adds the prefix.
Int8 Quantization and KVEC Binary Format
384 dims × 8,000 terms × 4 bytes = 12.3 MB — too large for browsers.
Solution: Int8 quantization. e5 outputs are L2-normalized (range [-1, 1]), so directly scale by 127:
quantized = clamp(round(float × 127), -128, 127)
Precision loss < 0.5%, storage compressed 4×: 12.3 MB → 3.1 MB, ~1.8 MB gzipped.
Binary format (KVEC):
[4B magic "KVEC"]
[4B vocab_size uint32]
[4B dims uint32]
[vocab_size × dims bytes: Int8 vectors, row-major]
[remaining bytes: JSON term array, UTF-8]
Embedding Cache
Computing 8,000 embeddings takes several minutes. To speed up incremental builds, the vector builder maintains a JSON cache file. Each build only computes new/changed terms, reusing cached results. The cache is pruned after each build to only keep terms in the current vocabulary.
Build Time: AI Image Auto-Tagging
Why Image Tagging?
Photo album metadata typically only has English location names and dates. Searching "大熊猫" in Chinese won't find "Chengdu Panda zoo", and searching "教堂" won't find "Liverpool Metropolitan Cathedral".
Solution: use multimodal AI at build time to analyze representative album images and generate bilingual tags.
AI Fallback Chain
// Priority: Gemini CLI → OpenAI API → Claude CLI
if (hasGemini()) tryGemini(image); // local CLI, fastest
if (hasOpenAI()) tryOpenAI(image); // cloud API, most reliable
if (hasClaude()) tryClaude(image); // common in dev environments
The prompt requests pure JSON output:
Analyze this photo. Output ONLY a JSON object:
- "en": 5-10 English keyword tags
- "zh": 5-10 Chinese keyword tags
Example: {"en":["cathedral","gothic architecture"], "zh":["教堂","哥特式建筑"]}
Tag Quality
The AI doesn't just mechanically translate — it generates culturally appropriate tags:
// Chengdu Panda Zoo
{"en": ["pandas", "bamboo", "zoo", "wildlife", "natural habitat"],
"zh": ["熊猫", "竹子", "动物园", "野生动物", "自然栖息地"]}
// Shanghai Museum Egypt Exhibition
{"en": ["exhibit", "ancient artifact", "Egyptian history", "museum"],
"zh": ["展览", "古代文物", "埃及历史", "博物馆"]}
Idempotency and Caching
The script checks whether tags.json already exists in each album directory. If present, skip. This means:
- Repeated runs don't waste API calls
- New albums are processed automatically
- You can manually edit
tags.jsonto override AI results
Runtime: Browser-Side Search
Progressive Loading
The key to user experience is never waiting:
| Phase | Loaded | Latency | Capability |
|---|---|---|---|
| P0 | Inverted index + metadata | < 100ms | Full keyword search |
| P1 | Keyword vectors (1.8MB) | 200-500ms | Semantic expansion ready |
| P2 | ONNX model (~20MB) | 2-20s | Full semantic search |
P0 is usable immediately. While the user types, P1/P2 load in the background. If the model isn't ready when the user searches, keyword results show first and semantic results merge in dynamically when available.
BM25 Keyword Search
Standard BM25 with k1=1.2, b=0.75:
score(q, d) = Σ IDF(t) × (tf × (k1+1)) / (tf + k1 × (1-b + b × |d|/avgDL))
A nice touch: prefix matching fallback. When exact matches return nothing, the system tries prefix matching ("water" → "waterfall", "watermelon") at a 0.8× score discount.
Semantic Expansion: Not Re-Ranking, but Query Augmentation
Semantic search doesn't re-rank BM25 results — it discovers new related terms and runs another BM25 retrieval round with them.
Flow:
- Embed query text with e5 model (with "query: " prefix)
- Cosine similarity against 8,000 pre-computed vocabulary vectors
- Take the top 8 terms above 0.82 similarity as expansion terms
- Run BM25 with expansion terms, but without TF — each term's document contribution is weighted by semantic similarity
Why no TF? Semantic expansion finds related topics, not exact matches. An article mentioning "waterfall" once is equally relevant to the expansion term as one mentioning it ten times.
Composite Score Fusion
Final ranking fuses both paths:
finalScore = 0.6 × normalize(BM25_score) + 0.4 × normalize(semantic_score)
Plus two bonus factors:
- Co-occurrence bonus ×1.2: documents appearing in both keyword and semantic results
- Title match bonus ×1.5: query terms found in article title
Both score sets are normalized independently (max → 1.0) to prevent either path from dominating by raw magnitude.
Search Results UI
Results render in two sections:
- Photos: horizontal scrolling gallery strip with cover images, locations, dates
- Articles: vertical list with titles, dates, excerpts
Each result shows its source: blue "keyword" badge for keyword hits, pink "AI" badge for semantic expansion hits. Both badges appear if the result came from both paths.
After semantic search completes, expansion terms are displayed (e.g., "Related: cathedral, church, gothic"), letting users see what the search engine "associated" with their query.
Service Worker: Model Caching
ONNX model files (~20MB) use a cache-first strategy via Service Worker:
const MODEL_PATTERNS = ['/onnx/', 'multilingual-e5', '.onnx', 'tokenizer', '/public/models/'];
// Intercept: cache-first
if (isModelFile(url)) {
const cached = await caches.match(request);
if (cached) return cached; // Cache hit
const response = await fetch(request);
cache.put(request, response.clone()); // Cache on first load
return response;
}
After the first visit, all subsequent loads are local reads. This means semantic search works even offline (provided the model was loaded previously).
Performance Data
Measured on this site (282 articles + 137 photo albums, 71,332 index terms):
| Metric | Value |
|---|---|
| P0 keyword search latency | < 10ms |
| P1 vector loading | ~300ms (1.8MB gzip) |
| P2 model first load | 5-15s (network dependent) |
| P2 model cached load | < 2s |
| Semantic expansion | ~50ms (8,000 dot products) |
| Inverted index size | 2.1 MB (gzip ~400KB) |
| Keyword vectors size | 3.1 MB (gzip ~1.8MB) |
| Result rendering | < 5ms |
Search Examples
Cross-Lingual Search
| Query | Keyword Path | Semantic Path | Result |
|---|---|---|---|
| "熊猫" | Chinese tag hit → Chengdu Panda Zoo | Expands to "panda", "zoo" | Photos + articles |
| "cathedral" | English location hit → Liverpool Cathedral | — | Photo albums |
| "教堂" | No direct hit | Expands to "cathedral", "church" | Finds cathedral photos |
| "museum" | Hits multiple museum albums | Expands to "exhibit", "artifact" | Photos + articles |
Single-Character Search
Single-character Chinese queries like "树" (tree) and "花" (flower) work correctly — the tokenizer preserves meaningful Chinese single characters while still filtering out single English letters and stopwords.
Build Pipeline
The complete build process (build.sh):
1. HEIC → JPG conversion (photo format normalization)
2. Photo compression
3. Album description generation (text AI)
4. Image tag generation (multimodal AI, new)
5. Office/LaTeX document conversion
6. posts.json (blog article index)
7. photos.json (photo album index, with bilingual tags)
8. videos.json (video index)
9. Search inverted index build
10. Keyword vector build
11. Static HTML generation
Both image tagging and vector building have caching mechanisms — incremental builds only process new content.
Architecture Review: Design vs. Implementation
This system was built from a detailed architecture design document. Here's a retrospective comparing the original design against the final implementation.
What Landed as Designed
| Design Goal | Status |
|---|---|
| BM25 as base, semantic layer on top | Exactly as planned |
| Progressive enhancement P0/P1/P2 | Keyword instant, vectors next, model last |
| Int8 quantization + binary format | KVEC format, 384-dim × 8,000 terms, gzip ~1.8MB |
| Word-level semantic routing (not document-level) | Core innovation preserved — embed vocabulary, not documents |
| Dual-path score fusion | 0.6/0.4 weighting with co-occurrence and title bonuses |
| Expansion terms visible in UI | "Related: cathedral, church, gothic" displayed |
| Service Worker model caching | cache-first strategy |
| Photo albums in unified index | AI bilingual tags + unified inverted index |
| Defer CLIP / Defer PDF | LLM API replaced CLIP for tagging; PDF deferred |
| Direct replacement (no feature flag) | Old search system fully replaced |
Intentional Deviations
1. multilingual-e5-small instead of BGE-small-zh-v1.5
The original design specified BGE (512-dim, Chinese-only). Both models are small and fast, but BGE's cross-lingual similarity was unusably low — "教堂" vs "cathedral" scored only 0.33. e5 reaches 0.83+ on the same pair. For a bilingual blog, cross-lingual capability is non-negotiable. The trade-off: e5 requires a "query: " prefix convention that must stay consistent between build and runtime.
2. LLM API instead of CLIP for image tagging
The original design used CLIP with a predefined candidate tag pool and cosine matching. The implementation uses Gemini/OpenAI/Claude multimodal APIs to generate free-text bilingual tags. AI generates culturally appropriate tags (e.g., "自然栖息地" for a panda habitat photo), which CLIP couldn't do from a static pool. The API dependency is build-time only, with caching and idempotency.
3. Unigram + bigram instead of jieba
The original design used jieba/nodejieba for offline segmentation, with known consistency risks against FlexSearch's CJK mode at runtime. The implementation uses character-level unigrams + bigrams in both build and runtime, completely eliminating the segmentation consistency problem. Precision is slightly lower, but BM25 IDF naturally suppresses noise.
4. Score-based vocabulary filtering instead of DF thresholds
The original design used layered filtering (title terms unconditional, DF=1 and TF≤2 filtered, DF>80% filtered). The implementation uses a unified scoring function (title +100, DF/TF weighted, length bonus) and takes the top 8,000. More flexible and easier to tune.
5. Similarity threshold 0.82 instead of 0.55
The original design suggested ≥0.55 with a cap of 8 expansion terms. The implementation uses 0.82. This is because e5's similarity distribution runs higher than BGE's — "熊猫" vs "panda" already scores 0.83. The higher threshold maintains precision. Worth monitoring whether edge cases lose useful expansions.
6. No TF in semantic path
Not explicitly addressed in the original design. The implementation weights semantic expansion results by similarity score only, ignoring term frequency. The reasoning: semantic expansion finds related topics, not exact matches — mentioning "waterfall" once is as relevant as mentioning it ten times for the expansion term.
Remaining Gaps
1. SoA (Struct-of-Arrays) memory layout — The original design emphasized SoA for cache-line optimization. The KVEC format uses row-major (AoS) layout: each term's 384 bytes are stored contiguously. At 8,000 terms (~3MB), the entire dataset fits in L3 cache, so the impact is negligible. If vocabulary grows beyond 40,000, SoA would provide measurable benefits.
2. Interactive expansion term removal — The original design specified that users could click to remove individual expansion terms, triggering a re-search. The current implementation displays expansion terms as static badges with no click handlers. This is a UX feature worth adding — implementation is straightforward: add a click handler that removes the term from the expansion list and re-runs fusion scoring without re-computing embeddings.
Takeaways
This system demonstrates that purely static sites can deliver search experiences rivaling dynamic services.
Core design philosophy:
- Progressive enhancement: keyword search is instant, semantic search enhances gracefully
- Build-time investment, zero runtime cost: AI tagging, vector pre-computation all happen at build time
- Cross-lingual without translation: multilingual embedding model handles semantic bridging natively
- Int8 quantization: optimal balance between precision and file size
- Idempotent builds: caching mechanisms ensure repeated builds don't waste resources
The maintenance cost is near zero — no server, no database, no paid search service. Every git push is a complete search engine update.