Dual-Mode Semantic Search: When "Delete the Code" Became "Keep Both"
Part three of the blog search series. Part one built hybrid search in the browser. Part two designed the migration to Cloudflare Workers. This post covers the implementation.
The plan was to replace the browser-side ONNX approach with Workers. What actually happened: I kept both — Workers as default, browser ONNX switchable via config.
Why Keep Both
The browser approach has clear problems: 156MB download, mobile crashes. But the code itself works fine — Int8 quantized vectors, KVEC binary format, in-browser model inference all run correctly on desktop. It just shouldn't be the default. Keeping it costs nothing; deleting it is permanent.
The solution is one config field:
{
"semanticSearchMode": "server",
"semanticSearchApi": "https://yuxu.ge/api/semantic-search"
}
"server" uses Cloudflare Workers. "browser" uses browser-side ONNX. Default is "server".
Worker Implementation
Added a /api/semantic-search route to the existing Worker:
if (path === "/api/semantic-search") {
const { query } = await request.json();
const topK = Math.min(Math.max(parseInt(body.topK) || 10, 1), 50);
// 1. Vectorize query → OpenAI (~150ms)
const embData = await fetch("https://api.openai.com/v1/embeddings", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${env.OPENAI_API_KEY}`,
},
body: JSON.stringify({
input: query,
model: "text-embedding-3-small",
dimensions: 512,
}),
}).then(r => r.json());
const queryVec = embData.data[0].embedding;
// 2. Read document vectors from KV (~2ms)
const embeddings = await env.SEARCH_EMBEDDINGS.get("search_embeddings", { type: "json" });
// 3. Cosine similarity 203×512 (<1ms)
const results = embeddings.documents.map(doc => {
let dot = 0, normA = 0, normB = 0;
for (let i = 0; i < 512; i++) {
dot += queryVec[i] * doc.embedding[i];
normA += queryVec[i] * queryVec[i];
normB += doc.embedding[i] * doc.embedding[i];
}
return {
url: doc.url, title: doc.title,
score: dot / (Math.sqrt(normA) * Math.sqrt(normB)),
};
});
// 4. Sort and filter
results.sort((a, b) => b.score - a.score);
return results.slice(0, topK).filter(r => r.score > 0.3);
}
203 documents (148 articles + 55 photos) stored as one KV entry, 1.3MB. Brute-force cosine similarity under 1ms — no need for a vector database. Total latency ~200ms, mostly the OpenAI API call.
Build Pipeline
Generating Vectors
embed-builder.mjs calls OpenAI API for each document. Input is title + tags + first 500 chars of content. SHA256-based incremental caching skips unchanged documents.
const hash = crypto.createHash('sha256').update(input).digest('hex').slice(0, 16);
if (cache[url]?.hash === hash) continue;
First build: 203 docs, batched by 50, cost ~$0.0005. Incremental builds make zero API calls.
Smart Dispatch
build-vectors.mjs picks the right script based on config:
if (mode === 'server') {
runScript('embed-builder.mjs'); // OpenAI document vectors
if (!fs.existsSync(browserOutput))
runScript('vector-builder.mjs'); // auto-fill browser vectors
} else {
runScript('vector-builder.mjs'); // e5-small keyword vectors
if (!fs.existsSync(serverOutput))
runScript('embed-builder.mjs');
}
npm run build:search handles it regardless of current mode.
Client Changes
SearchClient takes different paths based on mode. Same external interface.
Initialization
async init() {
const loadVectors = this.enableSemantic && !this.useServerSemantic;
const [invertedData, metadataData, vectorsData] = await Promise.all([
this._loadWithCache(this.invertedIndexUrl, 'json'),
this._loadWithCache(this.metadataUrl, 'json'),
loadVectors ? this._loadWithCache(this.vectorsUrl, 'arraybuffer') : null,
]);
// server mode stops here
// browser mode loads BGE model in background
if (!this.useServerSemantic && this.enableSemantic && this.vectorsReady) {
this._loadBGEModel();
}
}
Server mode loads only inverted index and metadata (~2.6MB). Browser mode additionally downloads 3MB vectors + ~150MB ONNX model.
Semantic Search
async semanticSearch(query, limit) {
if (this.useServerSemantic) {
return this._serverSemanticSearch(query, limit);
}
return this._browserSemanticSearch(query, limit);
}
async _serverSemanticSearch(query, limit) {
const res = await fetch(this.semanticSearchUrl, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ query: query.trim(), topK: limit }),
});
const data = await res.json();
return (data.results || []).map((r, i) => ({
url: r.url, score: r.score, rank: i + 1, source: 'semantic',
}));
}
Server mode is one fetch call. Browser mode runs BGE inference + keyword vector expansion.
State
isSemanticReady() {
if (this.useServerSemantic) return this.enableSemantic;
return this.modelReady && this.vectorsReady;
}
isModelLoading() {
if (this.useServerSemantic) return false;
return this.enableSemantic && this.vectorsReady && !this.modelReady;
}
Hybrid search fusion unchanged: BM25 weight 0.6, semantic weight 0.4, co-occurrence boost 1.2x, title match boost 1.5x.
sidebar.js
const mode = sidebarConfig.semanticSearchMode || 'server';
const enableSemantic = mode === 'server'
? sidebarConfig.semanticSearch !== false
: !isMobileDevice() && sidebarConfig.semanticSearch !== false;
searchClient = new SearchClient({
enableSemantic,
semanticSearchMode: mode,
semanticSearchUrl: sidebarConfig.semanticSearchApi,
});
Server mode skips mobile detection and Service Worker registration. Browser mode keeps original logic.
Comparison
Vectors
| Server | Browser | |
|---|---|---|
| Model | OpenAI text-embedding-3-small | multilingual-e5-small |
| Dimensions | 512 | 384 |
| Granularity | Document-level (203 docs) | Keyword-level (~8,000 terms) |
| Storage | 1.3MB JSON (KV) | 3MB Int8 binary |
| Build cost | ~$0.0005 | $0 |
Server side embeds documents directly. Browser side embeds a vocabulary for semantic expansion before BM25 lookup — different approach.
Performance
| Browser | Server (default) | |
|---|---|---|
| Client download | ~156 MB | ~2.6 MB |
| First semantic search | 5-20 sec | ~200 ms |
| Subsequent searches | ~50 ms | ~200 ms |
| Mobile | Crashes | Works |
| Offline semantic | Supported | Not supported |
| Monthly cost | $0 | ~$0.01 |
Summary
What was done: semantic search defaults to Cloudflare Workers, browser ONNX kept as switchable option. One config field controls it, build pipeline adapts automatically, client code has one interface with two implementations.
Result: mobile no longer crashes, first search dropped from 5-20 seconds to 200ms, client download from 156MB to 2.6MB. Browser approach code is still there — change one config to switch back.