RAG — retrieval-augmented generation
Retrieval-augmented generation: find the most relevant chunks of your content, hand them to the LLM as context, generate the answer. Cascade owns the storage and retrieval halves — the embedding model and the LLM live in your AI module.
This recipe focuses on Cascade’s part: the vector model, the indexing service, and the retrieval service. The AI calls are shown as pseudo-imports so the pattern stays focused on the data layer.
The vector model
Section titled “The vector model”One row per chunk of indexed content:
import { Model, RegisterModel } from "@warlock.js/cascade";import { type Infer, v } from "@warlock.js/seal";
export const knowledgeChunkSchema = v.object({ contentId: v.string(), contentType: v.string(), chunkIndex: v.number(), text: v.string(), embedding: v.any(),});
type KnowledgeChunkSchema = Infer<typeof knowledgeChunkSchema>;
@RegisterModel()export class KnowledgeChunk extends Model<KnowledgeChunkSchema> { public static table = "knowledge_chunks"; public static schema = knowledgeChunkSchema;}contentId + contentType point back at the source document (a Faq, an Article, a Product — whatever you’re indexing). chunkIndex orders the chunks within one document. embedding holds the vector — declared with v.any() here since the shape varies; in the migration it’s a real vector(1536) column with .vectorIndex({ similarity: "cosine" }).
Indexing — write chunks at content-change time
Section titled “Indexing — write chunks at content-change time”A service that takes a piece of content, splits it, embeds each chunk, stores the rows:
import { KnowledgeChunk } from "../models/knowledge-chunk/knowledge-chunk.model";import { embed } from "app/ai/services/embed.service";import { splitIntoChunks } from "app/ai/utils/split-into-chunks";
type IndexContentInput = { contentId: string; contentType: string; text: string;};
export async function indexContent(input: IndexContentInput) { await KnowledgeChunk.where("contentId", input.contentId) .where("contentType", input.contentType) .delete();
const chunks = splitIntoChunks(input.text);
for (let chunkIndex = 0; chunkIndex < chunks.length; chunkIndex++) { const chunkText = chunks[chunkIndex]; const embedding = await embed(chunkText);
await KnowledgeChunk.create({ contentId: input.contentId, contentType: input.contentType, chunkIndex, text: chunkText, embedding, }); }}Two patterns worth noticing:
- Delete-then-rewrite for re-indexing. When the source document changes, the cheapest thing to do is drop all its chunks and create fresh ones — chunk boundaries shift as text changes, so trying to update in place gets complicated.
- Embed per chunk, not per document. Each chunk gets its own vector. The size of the chunk depends on your model’s token limit and the granularity you want — typically 200-500 tokens.
Wire the indexing into the source model’s onSaved event so content stays in sync:
import { Faq } from "../models/faq/faq.model";import { indexContent } from "../services/index-content.service";
Faq.events().onSaved(async (faq) => { await indexContent({ contentId: String(faq.id), contentType: "faq", text: faq.get("body") as string, });});For high-write content, push this through the outbox pattern — embedding is slow and shouldn’t block the save path.
Retrieval — find the closest chunks
Section titled “Retrieval — find the closest chunks”The query service: embed the question, run .similarTo(), return the top K with their text:
import { KnowledgeChunk } from "../models/knowledge-chunk/knowledge-chunk.model";import { embed } from "app/ai/services/embed.service";
type RetrieveInput = { question: string; contentType?: string; limit?: number;};
export async function retrieveChunks(input: RetrieveInput) { const queryEmbedding = await embed(input.question); const query = KnowledgeChunk.query();
if (input.contentType) { query.where("contentType", input.contentType); }
return query .similarTo("embedding", queryEmbedding) .limit(input.limit ?? 5) .get<KnowledgeChunkRow & { score: number }>();}
type KnowledgeChunkRow = { contentId: string; contentType: string; chunkIndex: number; text: string;};.similarTo does two things to the query: it adds the similarity score as a SELECT field, and orders by distance so the vector index can be used. See the Vector search guide for the mechanics.
limit is essentially required — vector search is built around top-K. Without a limit, you’d score every chunk in the table.
Putting it together — the RAG service
Section titled “Putting it together — the RAG service”import { retrieveChunks } from "./retrieve-chunks.service";import { generateAnswer } from "app/ai/services/generate-answer.service";
type AnswerInput = { question: string; contentType?: string;};
export async function answerWithContext(input: AnswerInput) { const chunks = await retrieveChunks({ question: input.question, contentType: input.contentType, limit: 5, });
const context = chunks.map((chunk) => chunk.text).join("\n\n---\n\n");
return generateAnswer({ question: input.question, context, });}Three steps, each in its own service: retrieve, format, generate. Swapping the embedding model, the LLM, or the chunking strategy means touching one file each — not threading new args through the entire pipeline.
Filtering before similarity is the cheap path
Section titled “Filtering before similarity is the cheap path”When the corpus is large (>100k chunks) and you can narrow the candidate set with regular where clauses, do that before .similarTo():
return query .where("contentType", "faq") .where("organizationId", currentOrganization.id) .similarTo("embedding", queryEmbedding) .limit(5) .get();The database eliminates non-matching rows on regular indexes first, then scores the remainder. Much cheaper than scoring everything and then filtering.
Going further
Section titled “Going further”- Vector search mechanics — Vector search guide
- Hybrid search combining vector + full-text — Hybrid search recipe
- MongoDB Atlas vector index setup — Atlas vector setup recipe
- Outbox for async indexing — Outbox pattern recipe