Skip to content
Warlock.js v4

Semantic LLM Response Cache

Exact-match caching is useless for LLM prompts — “What’s your refund policy?” and “How do refunds work?” are different strings but the same question. A semantic cache keys on meaning: embed the prompt, look for a stored answer whose embedding is close enough, and reuse it. That turns repeat LLM calls (slow, expensive) into a vector lookup.

This combines set({ vector }) to index answers and cache.similar() to retrieve them.

Cache is embedding-agnostic — you compute the vector with whatever embedder you like and hand it over. On a miss, call the model and store the answer indexed by its embedding:

import { cache } from "@warlock.js/cache";
async function answer(question: string): Promise<string> {
const vector = await embed(question); // your embedder → number[]
// 1. Look for a semantically-close cached answer.
const [hit] = await cache.similar<string>(vector, {
topK: 1,
threshold: 0.95, // only reuse very close matches
});
if (hit) {
return hit.value; // cache hit — skip the model entirely
}
// 2. Miss → ask the model, then index the answer by its embedding.
const reply = await llm.complete(question);
await cache.set(`qa.${Date.now()}`, reply, {
vector,
ttl: "7d",
});
return reply;
}

Each similar() hit carries the original key, the stored value, and the cosine score:

const hits = await cache.similar<string>(vector, { topK: 3, threshold: 0.8 });
// [{ key: "qa.1714…", value: "Refunds are processed…", score: 0.97 }, …]

threshold is the cosine-similarity floor in [0, 1] — hits below it are dropped before topK truncation. It’s the dial between reuse and correctness:

  • High (0.95+) — reuse only near-identical questions. Safest; fewer hits.
  • Lower (0.85) — broader reuse, more hits, higher risk of serving an answer to a similar but not identical question.

Start strict and loosen only with evals to back it up.

Pass tags to both the write and the query to keep separate knowledge bases from bleeding into each other — per-tenant answers, per-locale, per-product:

await cache.set(`qa.${id}`, reply, { vector, tags: ["tenant.7"], ttl: "7d" });
const hits = await cache.similar<string>(vector, {
topK: 1,
threshold: 0.95,
tags: ["tenant.7"], // only consider this tenant's answers
});
EnvironmentDriverNotes
Dev / small datasetsmemoryBrute-force O(N) scan — fine for a few thousand entries
Productionpg (with vector config)Real pgvector index (HNSW / IVFFlat), native cosine <=>

The code above doesn’t change between them — only your driver config does. Drivers without a similarity index (redis, file) throw CacheUnsupportedError when you pass vector; see the capability matrix.

:::tip Cost story Every cache hit here is a model call you didn’t pay for. Pair a strict threshold with a generous ttl on stable answers (policies, docs) to maximize the hit rate where reuse is safe. :::