Zero-Backend Hybrid Search: Running BM25 + Semantic Search in the Browser

One of the biggest pain points of a static blog is search. No backend server means no Elasticsearch, no database queries, not even a simple full-text search API. Most people either integrate a third-party service like Algolia or simply give up on search altogether.

I chose a third path: move the entire search engine into the browser. Not a simple string match, but a full BM25 + semantic expansion hybrid search system with cross-lingual Chinese-English retrieval, all with zero backend dependencies.

This post is a complete technical summary of the system.

Architecture Overview

The system has two halves: build time (Node.js) and runtime (browser).

Build Time (node)                        Runtime (browser)
┌──────────────────┐                    ┌───────────────────────────┐
│ Markdown/Notebook │                    │ P0: Inverted index + meta │ → Instant keyword search
│ Photo albums + AI │  build.sh          │ P1: Keyword vectors (KVEC)│ → Semantic expansion ready
│ ─────────────────→│ ──────────→        │ P2: ONNX model            │ → Full semantic search
│ index-builder     │                    └───────────────────────────┘
│ vector-builder    │                          ↓
│ image-tagger      │                    BM25 + Semantic → Composite Score Fusion
└──────────────────┘

Key design principle: progressive loading. Users can search with keywords the instant the page opens (P0). Semantic search loads in the background — it enhances the experience but is never required.

Build Time: From Content to Index

Tokenizer: Chinese Unigram + Bigram

The foundation of any search system is tokenization. English naturally splits on spaces; Chinese has no such luxury. The common approach is a segmentation library like jieba, but that adds build-time and runtime dependencies.

I used a lighter approach: Chinese character unigrams + bigrams.

function tokenize(text) {
    const tokens = text.toLowerCase().match(/[\u4e00-\u9fff]+|[a-z0-9]+/g) || [];
    const result = [];
    for (const token of tokens) {
        if (/[\u4e00-\u9fff]/.test(token)) {
            // Chinese: each character + adjacent pairs
            for (let i = 0; i < token.length; i++) {
                result.push(token[i]);
                if (i < token.length - 1)
                    result.push(token.slice(i, i + 2));
            }
        } else if (token.length >= 2) {
            result.push(token);
        }
    }
    return result.filter(t => !STOPWORDS.has(t)
        && (t.length >= 2 || /[\u4e00-\u9fff]/.test(t)));
}

Example: "搜索引擎" → ["搜", "搜索", "索", "索引", "引", "引擎", "擎"]

Benefits:

Zero dependencies: no segmentation dictionary needed
High recall: bigrams cover most common words ("搜索", "引擎" all match)
Single-character queries work: unigrams ensure "树" (tree) or "花" (flower) return results
Build/runtime consistency: same tokenizer in both environments

The trade-off is lower precision ("索引" and "引擎" both match on "引"), but BM25's IDF weighting naturally suppresses high-frequency generic tokens.

Inverted Index: Compact v2 Format

The index builder scans all Markdown posts, Jupyter notebooks, and photo albums, producing three files:

File	Content	Size
`search-inverted.json`	Compact inverted index	~2.1 MB
`search-metadata.json`	Article metadata (title, date, excerpt)	~117 KB
`search-vocab.json`	Vocabulary statistics	~2.4 MB

The inverted index uses a compact v2 format, replacing URL strings with numeric IDs:

{
  "v": 2,
  "docs": ["/blog/posts/2026/...", "/gallery/20210711-Chengdu Panda zoo/", ...],
  "avgDL": 250.5,
  "N": 282,
  "dl": [245, 268, ...],
  "idx": {
    "search": [[0, 5], [3, 2], [12, 1]],
    "cathedral": [[42, 3], [43, 2]]
  }
}

[[docNum, tf], ...] replaces [{id: "url", tf: 5}, ...], cutting JSON size by roughly 50%.

Document Chunking

Long articles are split into ~500-character chunks with 50-character overlap. Split points prefer sentence endings or line breaks. The final index aggregates at the article level — all chunks from one article merge their term frequencies — so BM25 scores reflect whole-article relevance.

Bilingual Photo Album Indexing

Photo search is a distinctive feature of this system. Each photo album gets AI-generated bilingual tags (more on this later), then enters the same inverted index as articles:

// Index text = location + description + English tags + Chinese tags + year
const text = [album.location, description, tagsEn, tagsZh, year].join(' ');

// Tags field (for title/tag weight boost)
const tags = `photo gallery ${album.location} ${tagsEn} ${tagsZh}`;

This means searching "熊猫" (panda) directly hits the Chengdu Panda Zoo album, and searching "museum" finds museum photo galleries.

Build Time: Keyword Vectors

Why Pre-Compute Vectors?

The standard approach to semantic search is embedding both queries and documents, then computing cosine similarity. But embedding hundreds of documents in real-time in the browser is too slow.

A different approach: don't embed documents — embed the vocabulary.

At build time, filter the ~70,000 index terms down to 8,000 most valuable ones and pre-compute their embeddings. At search time, embed the query once, then dot-product against 8,000 pre-computed vectors — pure CPU arithmetic, done in ~50ms.

Vocabulary Filtering Strategy

70,000 terms can't all be vectorized (too large). The filtering strategy:

score = 0
if (appears in title/tags) score += 100    // curated terms, highest priority
score += min(df, 50) × 2                   // wider coverage = more matching value
score += min(maxTf, 20)                    // high TF in some docs = meaningful
if (chinese && length >= 2) score += 10    // Chinese two-char words are meaningful
if (english && length >= 4) score += 5     // longer English words more distinctive

Top 8,000 by score make the cut.

Notably, Chinese bigrams are filtered by default (most are meaningless character pairs like "景的"), but bigrams appearing in titles or tags are preserved — these are curated, meaningful vocabulary like "教堂" (cathedral) and "熊猫" (panda) from photo tags.

multilingual-e5-small: The Cross-Lingual Key

The system originally used BGE-small-zh-v1.5 (512-dim, Chinese-only). It worked well within Chinese but completely failed cross-lingually:

BGE-small-zh:
  cosine("教堂", "cathedral") = 0.33  ← far below threshold
  cosine("利物浦", "liverpool") = 0.28 ← nearly orthogonal

Switching to multilingual-e5-small (384-dim, 100+ languages):

multilingual-e5-small:
  cosine("埃及", "egypt")       = 0.917  ✓
  cosine("展览", "exhibition")  = 0.897  ✓
  cosine("雕像", "sculpture")   = 0.879  ✓
  cosine("博物", "museum")      = 0.838  ✓
  cosine("熊猫", "panda")       = 0.830  ✓

An important e5 convention: queries need a "query: " prefix, while corpus terms don't. Build-time vocabulary embeddings have no prefix; browser search adds the prefix.

Int8 Quantization and KVEC Binary Format

384 dims × 8,000 terms × 4 bytes = 12.3 MB — too large for browsers.

Solution: Int8 quantization. e5 outputs are L2-normalized (range [-1, 1]), so directly scale by 127:

quantized = clamp(round(float × 127), -128, 127)

Precision loss < 0.5%, storage compressed 4×: 12.3 MB → 3.1 MB, ~1.8 MB gzipped.

Binary format (KVEC):

[4B magic "KVEC"]
[4B vocab_size uint32]
[4B dims uint32]
[vocab_size × dims bytes: Int8 vectors, row-major]
[remaining bytes: JSON term array, UTF-8]

Embedding Cache

Computing 8,000 embeddings takes several minutes. To speed up incremental builds, the vector builder maintains a JSON cache file. Each build only computes new/changed terms, reusing cached results. The cache is pruned after each build to only keep terms in the current vocabulary.

Build Time: AI Image Auto-Tagging

Why Image Tagging?

Photo album metadata typically only has English location names and dates. Searching "大熊猫" in Chinese won't find "Chengdu Panda zoo", and searching "教堂" won't find "Liverpool Metropolitan Cathedral".

Solution: use multimodal AI at build time to analyze representative album images and generate bilingual tags.

AI Fallback Chain

// Priority: Gemini CLI → OpenAI API → Claude CLI
if (hasGemini()) tryGemini(image);     // local CLI, fastest
if (hasOpenAI()) tryOpenAI(image);     // cloud API, most reliable
if (hasClaude()) tryClaude(image);     // common in dev environments

The prompt requests pure JSON output:

Analyze this photo. Output ONLY a JSON object:
- "en": 5-10 English keyword tags
- "zh": 5-10 Chinese keyword tags

Example: {"en":["cathedral","gothic architecture"], "zh":["教堂","哥特式建筑"]}

Tag Quality

The AI doesn't just mechanically translate — it generates culturally appropriate tags:

// Chengdu Panda Zoo
{"en": ["pandas", "bamboo", "zoo", "wildlife", "natural habitat"],
 "zh": ["熊猫", "竹子", "动物园", "野生动物", "自然栖息地"]}

// Shanghai Museum Egypt Exhibition
{"en": ["exhibit", "ancient artifact", "Egyptian history", "museum"],
 "zh": ["展览", "古代文物", "埃及历史", "博物馆"]}

Idempotency and Caching

The script checks whether tags.json already exists in each album directory. If present, skip. This means:

Repeated runs don't waste API calls
New albums are processed automatically
You can manually edit tags.json to override AI results

Runtime: Browser-Side Search

Progressive Loading

The key to user experience is never waiting:

Phase	Loaded	Latency	Capability
P0	Inverted index + metadata	< 100ms	Full keyword search
P1	Keyword vectors (1.8MB)	200-500ms	Semantic expansion ready
P2	ONNX model (~20MB)	2-20s	Full semantic search

P0 is usable immediately. While the user types, P1/P2 load in the background. If the model isn't ready when the user searches, keyword results show first and semantic results merge in dynamically when available.

BM25 Keyword Search

Standard BM25 with k1=1.2, b=0.75:

score(q, d) = Σ IDF(t) × (tf × (k1+1)) / (tf + k1 × (1-b + b × |d|/avgDL))

A nice touch: prefix matching fallback. When exact matches return nothing, the system tries prefix matching ("water" → "waterfall", "watermelon") at a 0.8× score discount.

Semantic Expansion: Not Re-Ranking, but Query Augmentation

Semantic search doesn't re-rank BM25 results — it discovers new related terms and runs another BM25 retrieval round with them.

Flow:

Embed query text with e5 model (with "query: " prefix)
Cosine similarity against 8,000 pre-computed vocabulary vectors
Take the top 8 terms above 0.82 similarity as expansion terms
Run BM25 with expansion terms, but without TF — each term's document contribution is weighted by semantic similarity

Why no TF? Semantic expansion finds related topics, not exact matches. An article mentioning "waterfall" once is equally relevant to the expansion term as one mentioning it ten times.

Composite Score Fusion

Final ranking fuses both paths:

finalScore = 0.6 × normalize(BM25_score) + 0.4 × normalize(semantic_score)

Plus two bonus factors:

Co-occurrence bonus ×1.2: documents appearing in both keyword and semantic results
Title match bonus ×1.5: query terms found in article title

Both score sets are normalized independently (max → 1.0) to prevent either path from dominating by raw magnitude.

Search Results UI

Results render in two sections:

Photos: horizontal scrolling gallery strip with cover images, locations, dates
Articles: vertical list with titles, dates, excerpts

Each result shows its source: blue "keyword" badge for keyword hits, pink "AI" badge for semantic expansion hits. Both badges appear if the result came from both paths.

After semantic search completes, expansion terms are displayed (e.g., "Related: cathedral, church, gothic"), letting users see what the search engine "associated" with their query.

Service Worker: Model Caching

ONNX model files (~20MB) use a cache-first strategy via Service Worker:

const MODEL_PATTERNS = ['/onnx/', 'multilingual-e5', '.onnx', 'tokenizer', '/public/models/'];

// Intercept: cache-first
if (isModelFile(url)) {
    const cached = await caches.match(request);
    if (cached) return cached;                 // Cache hit
    const response = await fetch(request);
    cache.put(request, response.clone());       // Cache on first load
    return response;
}

After the first visit, all subsequent loads are local reads. This means semantic search works even offline (provided the model was loaded previously).

Performance Data

Measured on this site (282 articles + 137 photo albums, 71,332 index terms):

Metric	Value
P0 keyword search latency	< 10ms
P1 vector loading	~300ms (1.8MB gzip)
P2 model first load	5-15s (network dependent)
P2 model cached load	< 2s
Semantic expansion	~50ms (8,000 dot products)
Inverted index size	2.1 MB (gzip ~400KB)
Keyword vectors size	3.1 MB (gzip ~1.8MB)
Result rendering	< 5ms

Search Examples

Cross-Lingual Search

Query	Keyword Path	Semantic Path	Result
"熊猫"	Chinese tag hit → Chengdu Panda Zoo	Expands to "panda", "zoo"	Photos + articles
"cathedral"	English location hit → Liverpool Cathedral	—	Photo albums
"教堂"	No direct hit	Expands to "cathedral", "church"	Finds cathedral photos
"museum"	Hits multiple museum albums	Expands to "exhibit", "artifact"	Photos + articles

Single-Character Search

Single-character Chinese queries like "树" (tree) and "花" (flower) work correctly — the tokenizer preserves meaningful Chinese single characters while still filtering out single English letters and stopwords.

Build Pipeline

The complete build process (build.sh):

1.  HEIC → JPG conversion (photo format normalization)
2.  Photo compression
3.  Album description generation (text AI)
4.  Image tag generation (multimodal AI, new)
5.  Office/LaTeX document conversion
6.  posts.json (blog article index)
7.  photos.json (photo album index, with bilingual tags)
8.  videos.json (video index)
9.  Search inverted index build
10. Keyword vector build
11. Static HTML generation

Both image tagging and vector building have caching mechanisms — incremental builds only process new content.

Architecture Review: Design vs. Implementation

This system was built from a detailed architecture design document. Here's a retrospective comparing the original design against the final implementation.

What Landed as Designed

Design Goal	Status
BM25 as base, semantic layer on top	Exactly as planned
Progressive enhancement P0/P1/P2	Keyword instant, vectors next, model last
Int8 quantization + binary format	KVEC format, 384-dim × 8,000 terms, gzip ~1.8MB
Word-level semantic routing (not document-level)	Core innovation preserved — embed vocabulary, not documents
Dual-path score fusion	0.6/0.4 weighting with co-occurrence and title bonuses
Expansion terms visible in UI	"Related: cathedral, church, gothic" displayed
Service Worker model caching	cache-first strategy
Photo albums in unified index	AI bilingual tags + unified inverted index
Defer CLIP / Defer PDF	LLM API replaced CLIP for tagging; PDF deferred
Direct replacement (no feature flag)	Old search system fully replaced

Intentional Deviations

1. multilingual-e5-small instead of BGE-small-zh-v1.5

The original design specified BGE (512-dim, Chinese-only). Both models are small and fast, but BGE's cross-lingual similarity was unusably low — "教堂" vs "cathedral" scored only 0.33. e5 reaches 0.83+ on the same pair. For a bilingual blog, cross-lingual capability is non-negotiable. The trade-off: e5 requires a "query: " prefix convention that must stay consistent between build and runtime.

2. LLM API instead of CLIP for image tagging

The original design used CLIP with a predefined candidate tag pool and cosine matching. The implementation uses Gemini/OpenAI/Claude multimodal APIs to generate free-text bilingual tags. AI generates culturally appropriate tags (e.g., "自然栖息地" for a panda habitat photo), which CLIP couldn't do from a static pool. The API dependency is build-time only, with caching and idempotency.

3. Unigram + bigram instead of jieba

The original design used jieba/nodejieba for offline segmentation, with known consistency risks against FlexSearch's CJK mode at runtime. The implementation uses character-level unigrams + bigrams in both build and runtime, completely eliminating the segmentation consistency problem. Precision is slightly lower, but BM25 IDF naturally suppresses noise.

4. Score-based vocabulary filtering instead of DF thresholds

The original design used layered filtering (title terms unconditional, DF=1 and TF≤2 filtered, DF>80% filtered). The implementation uses a unified scoring function (title +100, DF/TF weighted, length bonus) and takes the top 8,000. More flexible and easier to tune.

5. Similarity threshold 0.82 instead of 0.55

The original design suggested ≥0.55 with a cap of 8 expansion terms. The implementation uses 0.82. This is because e5's similarity distribution runs higher than BGE's — "熊猫" vs "panda" already scores 0.83. The higher threshold maintains precision. Worth monitoring whether edge cases lose useful expansions.

6. No TF in semantic path

Not explicitly addressed in the original design. The implementation weights semantic expansion results by similarity score only, ignoring term frequency. The reasoning: semantic expansion finds related topics, not exact matches — mentioning "waterfall" once is as relevant as mentioning it ten times for the expansion term.

Remaining Gaps

1. SoA (Struct-of-Arrays) memory layout — The original design emphasized SoA for cache-line optimization. The KVEC format uses row-major (AoS) layout: each term's 384 bytes are stored contiguously. At 8,000 terms (~3MB), the entire dataset fits in L3 cache, so the impact is negligible. If vocabulary grows beyond 40,000, SoA would provide measurable benefits.

2. Interactive expansion term removal — The original design specified that users could click to remove individual expansion terms, triggering a re-search. The current implementation displays expansion terms as static badges with no click handlers. This is a UX feature worth adding — implementation is straightforward: add a click handler that removes the term from the expansion list and re-runs fusion scoring without re-computing embeddings.

Takeaways

This system demonstrates that purely static sites can deliver search experiences rivaling dynamic services.

Core design philosophy:

Progressive enhancement: keyword search is instant, semantic search enhances gracefully
Build-time investment, zero runtime cost: AI tagging, vector pre-computation all happen at build time
Cross-lingual without translation: multilingual embedding model handles semantic bridging natively
Int8 quantization: optimal balance between precision and file size
Idempotent builds: caching mechanisms ensure repeated builds don't waste resources

The maintenance cost is near zero — no server, no database, no paid search service. Every git push is a complete search engine update.