Semantic LLM Response Cache
Exact-match caching is useless for LLM prompts — “What’s your refund policy?” and “How do refunds work?” are different strings but the same question. A semantic cache keys on meaning: embed the prompt, look for a stored answer whose embedding is close enough, and reuse it. That turns repeat LLM calls (slow, expensive) into a vector lookup.
This combines set({ vector }) to index
answers and cache.similar() to retrieve them.
The pattern
Section titled “The pattern”Cache is embedding-agnostic — you compute the vector with whatever embedder you like and hand it over. On a miss, call the model and store the answer indexed by its embedding:
import { cache } from "@warlock.js/cache";
async function answer(question: string): Promise<string> { const vector = await embed(question); // your embedder → number[]
// 1. Look for a semantically-close cached answer. const [hit] = await cache.similar<string>(vector, { topK: 1, threshold: 0.95, // only reuse very close matches });
if (hit) { return hit.value; // cache hit — skip the model entirely }
// 2. Miss → ask the model, then index the answer by its embedding. const reply = await llm.complete(question);
await cache.set(`qa.${Date.now()}`, reply, { vector, ttl: "7d", });
return reply;}Each similar() hit carries the original key, the stored value, and the
cosine score:
const hits = await cache.similar<string>(vector, { topK: 3, threshold: 0.8 });// [{ key: "qa.1714…", value: "Refunds are processed…", score: 0.97 }, …]Tuning the threshold
Section titled “Tuning the threshold”threshold is the cosine-similarity floor in [0, 1] — hits below it are
dropped before topK truncation. It’s the dial between reuse and correctness:
- High (
0.95+) — reuse only near-identical questions. Safest; fewer hits. - Lower (
0.85) — broader reuse, more hits, higher risk of serving an answer to a similar but not identical question.
Start strict and loosen only with evals to back it up.
Scoping with tags
Section titled “Scoping with tags”Pass tags to both the write and the query to keep separate knowledge bases
from bleeding into each other — per-tenant answers, per-locale, per-product:
await cache.set(`qa.${id}`, reply, { vector, tags: ["tenant.7"], ttl: "7d" });
const hits = await cache.similar<string>(vector, { topK: 1, threshold: 0.95, tags: ["tenant.7"], // only consider this tenant's answers});Dev vs. production
Section titled “Dev vs. production”| Environment | Driver | Notes |
|---|---|---|
| Dev / small datasets | memory | Brute-force O(N) scan — fine for a few thousand entries |
| Production | pg (with vector config) | Real pgvector index (HNSW / IVFFlat), native cosine <=> |
The code above doesn’t change between them — only your driver config does.
Drivers without a similarity index (redis, file) throw
CacheUnsupportedError when you pass vector; see the
capability matrix.
:::tip Cost story
Every cache hit here is a model call you didn’t pay for. Pair a strict
threshold with a generous ttl on stable answers (policies, docs) to
maximize the hit rate where reuse is safe.
:::
Related Documentation
Section titled “Related Documentation”- Similarity Retrieval —
cache.similar()in full - Set Options — the
vectorwrite option - Postgres Cache Driver — production pgvector setup
- Choosing a Driver — similarity support per driver