Dual-Mode Semantic Search: When "Delete the Code" Became "Keep Both"

Part three of the blog search series. Part one built hybrid search in the browser. Part two designed the migration to Cloudflare Workers. This post covers the implementation.

The plan was to replace the browser-side ONNX approach with Workers. What actually happened: I kept both — Workers as default, browser ONNX switchable via config.

Why Keep Both

The browser approach has clear problems: 156MB download, mobile crashes. But the code itself works fine — Int8 quantized vectors, KVEC binary format, in-browser model inference all run correctly on desktop. It just shouldn't be the default. Keeping it costs nothing; deleting it is permanent.

The solution is one config field:

{
  "semanticSearchMode": "server",
  "semanticSearchApi": "https://yuxu.ge/api/semantic-search"
}

"server" uses Cloudflare Workers. "browser" uses browser-side ONNX. Default is "server".

Worker Implementation

Added a /api/semantic-search route to the existing Worker:

if (path === "/api/semantic-search") {
    const { query } = await request.json();
    const topK = Math.min(Math.max(parseInt(body.topK) || 10, 1), 50);

    // 1. Vectorize query → OpenAI (~150ms)
    const embData = await fetch("https://api.openai.com/v1/embeddings", {
        method: "POST",
        headers: {
            "Content-Type": "application/json",
            "Authorization": `Bearer ${env.OPENAI_API_KEY}`,
        },
        body: JSON.stringify({
            input: query,
            model: "text-embedding-3-small",
            dimensions: 512,
        }),
    }).then(r => r.json());
    const queryVec = embData.data[0].embedding;

    // 2. Read document vectors from KV (~2ms)
    const embeddings = await env.SEARCH_EMBEDDINGS.get("search_embeddings", { type: "json" });

    // 3. Cosine similarity 203×512 (<1ms)
    const results = embeddings.documents.map(doc => {
        let dot = 0, normA = 0, normB = 0;
        for (let i = 0; i < 512; i++) {
            dot += queryVec[i] * doc.embedding[i];
            normA += queryVec[i] * queryVec[i];
            normB += doc.embedding[i] * doc.embedding[i];
        }
        return {
            url: doc.url, title: doc.title,
            score: dot / (Math.sqrt(normA) * Math.sqrt(normB)),
        };
    });

    // 4. Sort and filter
    results.sort((a, b) => b.score - a.score);
    return results.slice(0, topK).filter(r => r.score > 0.3);
}

203 documents (148 articles + 55 photos) stored as one KV entry, 1.3MB. Brute-force cosine similarity under 1ms — no need for a vector database. Total latency ~200ms, mostly the OpenAI API call.

Build Pipeline

Generating Vectors

embed-builder.mjs calls OpenAI API for each document. Input is title + tags + first 500 chars of content. SHA256-based incremental caching skips unchanged documents.

const hash = crypto.createHash('sha256').update(input).digest('hex').slice(0, 16);
if (cache[url]?.hash === hash) continue;

First build: 203 docs, batched by 50, cost ~$0.0005. Incremental builds make zero API calls.

Smart Dispatch

build-vectors.mjs picks the right script based on config:

if (mode === 'server') {
    runScript('embed-builder.mjs');         // OpenAI document vectors
    if (!fs.existsSync(browserOutput))
        runScript('vector-builder.mjs');    // auto-fill browser vectors
} else {
    runScript('vector-builder.mjs');        // e5-small keyword vectors
    if (!fs.existsSync(serverOutput))
        runScript('embed-builder.mjs');
}

npm run build:search handles it regardless of current mode.

Client Changes

SearchClient takes different paths based on mode. Same external interface.

Initialization

async init() {
    const loadVectors = this.enableSemantic && !this.useServerSemantic;

    const [invertedData, metadataData, vectorsData] = await Promise.all([
        this._loadWithCache(this.invertedIndexUrl, 'json'),
        this._loadWithCache(this.metadataUrl, 'json'),
        loadVectors ? this._loadWithCache(this.vectorsUrl, 'arraybuffer') : null,
    ]);

    // server mode stops here
    // browser mode loads BGE model in background
    if (!this.useServerSemantic && this.enableSemantic && this.vectorsReady) {
        this._loadBGEModel();
    }
}

Server mode loads only inverted index and metadata (~2.6MB). Browser mode additionally downloads 3MB vectors + ~150MB ONNX model.

Semantic Search

async semanticSearch(query, limit) {
    if (this.useServerSemantic) {
        return this._serverSemanticSearch(query, limit);
    }
    return this._browserSemanticSearch(query, limit);
}

async _serverSemanticSearch(query, limit) {
    const res = await fetch(this.semanticSearchUrl, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ query: query.trim(), topK: limit }),
    });
    const data = await res.json();
    return (data.results || []).map((r, i) => ({
        url: r.url, score: r.score, rank: i + 1, source: 'semantic',
    }));
}

Server mode is one fetch call. Browser mode runs BGE inference + keyword vector expansion.

State

isSemanticReady() {
    if (this.useServerSemantic) return this.enableSemantic;
    return this.modelReady && this.vectorsReady;
}
isModelLoading() {
    if (this.useServerSemantic) return false;
    return this.enableSemantic && this.vectorsReady && !this.modelReady;
}

Hybrid search fusion unchanged: BM25 weight 0.6, semantic weight 0.4, co-occurrence boost 1.2x, title match boost 1.5x.

sidebar.js

const mode = sidebarConfig.semanticSearchMode || 'server';

const enableSemantic = mode === 'server'
    ? sidebarConfig.semanticSearch !== false
    : !isMobileDevice() && sidebarConfig.semanticSearch !== false;

searchClient = new SearchClient({
    enableSemantic,
    semanticSearchMode: mode,
    semanticSearchUrl: sidebarConfig.semanticSearchApi,
});

Server mode skips mobile detection and Service Worker registration. Browser mode keeps original logic.

Comparison

Vectors

	Server	Browser
Model	OpenAI text-embedding-3-small	multilingual-e5-small
Dimensions	512	384
Granularity	Document-level (203 docs)	Keyword-level (~8,000 terms)
Storage	1.3MB JSON (KV)	3MB Int8 binary
Build cost	~$0.0005	$0

Server side embeds documents directly. Browser side embeds a vocabulary for semantic expansion before BM25 lookup — different approach.

Performance

	Browser	Server (default)
Client download	~156 MB	~2.6 MB
First semantic search	5-20 sec	~200 ms
Subsequent searches	~50 ms	~200 ms
Mobile	Crashes	Works
Offline semantic	Supported	Not supported
Monthly cost	$0	~$0.01

Summary

What was done: semantic search defaults to Cloudflare Workers, browser ONNX kept as switchable option. One config field controls it, build pipeline adapts automatically, client code has one interface with two implementations.

Result: mobile no longer crashes, first search dropped from 5-20 seconds to 200ms, client download from 156MB to 2.6MB. Browser approach code is still there — change one config to switch back.