Architecture & Concepts
Understand how Memories, Chats, Users, and RAG work together in MemoryKit.
Learn how MemoryKit processes, stores, and retrieves your data. This page covers the data model, ingestion pipeline, retrieval strategies, and streaming architecture.
Data model
MemoryKit has three core entities that work together:
- Users own both Memories and Chats. Pass
userIdto scope all operations to a specific user. - Memories hold your raw content. Each memory is automatically split into Chunks, which are embedded and indexed.
- Chats are conversational sessions. Each message triggers a RAG query against the user's memories to generate context-aware responses.
- Memories and Chats are connected through RAG — when a Chat message is sent, MemoryKit searches relevant Memories to build context for the LLM response.
Ingestion pipeline
When you create a memory, content goes through an asynchronous pipeline:
Status transitions
| From | To | Trigger |
|---|---|---|
| — | processing | Memory created via API |
processing | completed | Pipeline finished successfully |
processing | failed | Error during processing |
Waiting for completion
Since ingestion is asynchronous, you need to wait for the memory to reach completed status before it's searchable. Two options:
- Polling — Check status periodically. See the Memories guide for code examples.
- Webhooks — Get notified via HTTP callback. See the Webhooks guide.
Retrieval pipeline
MemoryKit supports three retrieval modes that control the speed/quality tradeoff:
| Mode | Pipeline | Retrieval | Total (with LLM) | Best for |
|---|---|---|---|---|
fast | Vector similarity only | ~100ms | ~1s | Low-latency chatbots |
balanced | Vector + reranking | ~300ms | ~2s | Most use cases (default) |
precise | Vector + full-text + graph + reranking | ~800ms | ~3s | Knowledge-heavy apps |
Retrieval is the time to find relevant chunks. Total includes LLM answer generation (for query/chat endpoints). Search endpoints only incur retrieval latency.
RAG pipeline
When you call the Query or Chat endpoints, MemoryKit orchestrates a full RAG pipeline:
The LLM model is configurable per project. By default, MemoryKit uses GPT-4o for answer generation. You can customize the model and behavior through the instructions parameter.
Streaming architecture
Streaming endpoints use Server-Sent Events (SSE) — a standard HTTP protocol for server-to-client real-time updates over a single connection.
Events are sent in order: text chunks (streamed as generated) followed by sources, usage, and done. If an error occurs at any point, an error event is sent instead.