MemoryKit

Architecture & Concepts

Understand how Memories, Chats, Users, and RAG work together in MemoryKit.

Learn how MemoryKit processes, stores, and retrieves your data. This page covers the data model, ingestion pipeline, retrieval strategies, and streaming architecture.

Data model

MemoryKit has three core entities that work together:

User
 ├── Memories (scoped by userId)
 │    └── Chunks (auto-generated)
 │         └── Embeddings (auto-generated)
 └── Chats (scoped by userId)
      └── Messages
           ├── role: "user"
           └── role: "assistant" (RAG-powered)
  • Users own both Memories and Chats. Pass userId to scope all operations to a specific user.
  • Memories hold your raw content. Each memory is automatically split into Chunks, which are embedded and indexed.
  • Chats are conversational sessions. Each message triggers a RAG query against the user's memories to generate context-aware responses.
  • Memories and Chats are connected through RAG — when a Chat message is sent, MemoryKit searches relevant Memories to build context for the LLM response.

Ingestion pipeline

When you create a memory, content goes through an asynchronous pipeline:

POST /v1/memories


┌─────────────┐
│  Accepted    │  status: "processing"
│  (202)       │
└──────┬──────┘


┌─────────────┐
│  Smart       │  Auto-extract title, tags,
│  Ingestion   │  language, content type
└──────┬──────┘


┌─────────────┐
│  Chunking    │  Split content into
│              │  semantic segments
└──────┬──────┘


┌─────────────┐
│  Embedding   │  Generate vector
│              │  embeddings per chunk
└──────┬──────┘


┌─────────────┐
│  Indexing    │  Vector index + full-text
│              │  index + knowledge graph
└──────┬──────┘


┌─────────────┐
│  Completed   │  status: "completed"
└─────────────┘

Status transitions

FromToTrigger
processingMemory created via API
processingcompletedPipeline finished successfully
processingfailedError during processing

Waiting for completion

Since ingestion is asynchronous, you need to wait for the memory to reach completed status before it's searchable. Two options:

  • Polling — Check status periodically. See the Memories guide for code examples.
  • Webhooks — Get notified via HTTP callback. See the Webhooks guide.

Retrieval pipeline

MemoryKit supports three retrieval modes that control the speed/quality tradeoff:

                    Query

          ┌───────────┼───────────┐
          ▼           ▼           ▼
       ┌──────┐  ┌─────────┐  ┌─────────┐
       │ fast │  │balanced │  │ precise │
       └──┬───┘  └────┬────┘  └────┬────┘
          │           │            │
          ▼           ▼            ▼
      Vector      Vector       Vector
      search      search       search
                    +             +
                 Reranking    Full-text
                              search
                                +
                             Graph
                             traversal
                                +
                             Reranking
          │           │            │
          ▼           ▼            ▼
       Results    Results      Results
ModePipelineRetrievalTotal (with LLM)Best for
fastVector similarity only~100ms~1sLow-latency chatbots
balancedVector + reranking~300ms~2sMost use cases (default)
preciseVector + full-text + graph + reranking~800ms~3sKnowledge-heavy apps

Retrieval is the time to find relevant chunks. Total includes LLM answer generation (for query/chat endpoints). Search endpoints only incur retrieval latency.

RAG pipeline

When you call the Query or Chat endpoints, MemoryKit orchestrates a full RAG pipeline:

User Query


┌──────────────┐
│  Retrieval   │  Find relevant chunks
│  (mode)      │  using selected strategy
└──────┬───────┘


┌──────────────┐
│  Context     │  Assemble chunks into
│  Assembly    │  a context window
└──────┬───────┘


┌──────────────┐
│  LLM         │  Generate answer using
│  Generation  │  context + instructions
└──────┬───────┘


┌──────────────┐
│  Response    │  answer + sources +
│              │  token usage
└──────────────┘

The LLM model is configurable per project. By default, MemoryKit uses GPT-4o for answer generation. You can customize the model and behavior through the instructions parameter.

Streaming architecture

Streaming endpoints use Server-Sent Events (SSE) — a standard HTTP protocol for server-to-client real-time updates over a single connection.

Client                          Server
  │                               │
  │  POST /v1/memories/query      │
  │  { "stream": true }           │
  │  ─────────────────────────►   │
  │                               │
  │  event: text                  │
  │  data: {"content": "Based"}  │
  │  ◄─────────────────────────   │
  │                               │
  │  event: text                  │
  │  data: {"content": " on"}    │
  │  ◄─────────────────────────   │
  │                               │
  │  event: sources               │
  │  data: [...]                  │
  │  ◄─────────────────────────   │
  │                               │
  │  event: usage                 │
  │  data: {"tokens_used": 150}  │
  │  ◄─────────────────────────   │
  │                               │
  │  event: done                  │
  │  data: {}                     │
  │  ◄─────────────────────────   │
  │                               │
  │  Connection closed            │

Events are sent in order: text chunks (streamed as generated) followed by sources, usage, and done. If an error occurs at any point, an error event is sent instead.

Edit on GitHub

On this page