Scaling & Resource Management
Cortex scales in two dimensions:
- Horizontal density — many isolated Brainy databases sharing one VM (the multi-tenant workload).
- Vertical scale — a single Brainy instance growing to billion-vector search on commodity hardware (the data-platform workload).
Both modes share the same plugin, the same APIs, and the same on-disk format. Cortex picks the right machinery at init based on what your storage adapter, dataset shape, and index type tell it.
Architecture Overview
A typical deployment runs 10–200 Brainy instances in one Node.js process on a GCE VM. Each instance has its own data directory on an SSD, its own HNSW vector index, metadata index, and graph index. Cortex manages what's shared and what's isolated:
| Component | Scope | Memory |
|---|---|---|
| Embedding model (all-MiniLM-L6-v2) | Shared singleton | 88 MB, loaded once |
| SIMD distance calculations | Shared | Zero per-instance cost |
| UnifiedCache (eviction engine) | Shared singleton | Dynamic — adapts to instance count |
| HNSW vector index | Per-instance | ~700 bytes per entity |
| Metadata inverted index | Per-instance | ~300 bytes per entity |
| Entity ID mapper | Per-instance | ~56 bytes per entity |
| Graph adjacency (LSM-trees) | Per-instance | Mmapped — kernel manages pages |
| Storage adapter | Per-instance | Minimal (file handles) |
Adaptive Resource Manager
Cortex includes an adaptive resource manager that observes system resources and adjusts memory budgets in real time. No configuration required — it detects your VM size, container limits (cgroups v1/v2), and active instance count automatically.
What it observes:
- Total and free system memory (
os.totalmem(),os.freemem()) - Container memory limits (cgroups v2
memory.max, cgroups v1memory.limit_in_bytes) - Process RSS and heap usage
- Number of active Brainy instances
- Per-instance entity counts
What it decides:
- Cache budget — shrinks dynamically as more instances are loaded
- Memory pressure level (normal / elevated / critical)
- Eviction candidates — weighted by idle time × memory usage, so large idle instances are evicted before small active ones
Memory-Mapped Storage (Mmap)
Cortex's MmapFileSystemStorage extends Brainy's filesystem storage with zero-copy binary I/O. The Rust native layer memory-maps data files using the memmap2 crate, letting the Linux kernel manage which pages stay in RAM.
What mmap provides:
- HNSW vector index: The entire HNSW graph (vectors + connections) is stored as a single binary
.hnswfile, memory-mapped for zero-copy reads. Search traverses the graph directly on mmap'd pages — no JSON parsing, no gzip decompression, no heap allocation for vector data. The kernel manages which pages stay in RAM. - Graph adjacency SSTables: 4 LSM-trees for relationship data are memory-mapped. Read-only pages are automatically paged out under memory pressure.
- Dual-mode search: After
flush(), search reads vectors from mmap pages (zero-copy). During active mutations, the existing in-memory engine handles search. This gives maximum density between mutations while preserving mutation speed.
Why this matters for density: On a 16 GB VM with a 200 GB SSD, you can effectively manage 200 GB of brainy data with only the hot working set in RAM. The SSD acts as an extension of memory, managed by the Linux kernel with no application-level complexity. An idle tenant's HNSW data is automatically paged out by the kernel — no explicit eviction needed.
Instant Suspend & Resume
Because each Brainy instance has its own isolated data directory, Cortex can evict instances from memory and reload them later without data loss:
- Eviction:
brain.close()flushes any pending writes to SSD, then frees all in-memory structures - Data persists: The data directory on SSD remains intact
- Reload: Next request for the same workspace/tenant re-initializes from SSD — binary mmap load is a single syscall, typically under 100ms for a 500-entity tenant (PROJECTED — design target; no reload-time benchmark in CI yet; measured numbers ship in
docs/verification-report.md)
This is fundamentally different from traditional databases that require replication or WAL replay. Each Brainy instance is a single-writer local store — suspend/resume is an munmap/mmap away.
Capacity Planning
Per-Instance Memory
PROJECTED — design targets, not measured. The table below is extrapolated from per-component memory analysis (HNSW graph + metadata inverted index + entity ID mapper sizings under their target capacity). No multi-tenant RSS benchmark backs the numbers today; measured per-instance memory at each tier ships in
docs/verification-report.mdas part of Piece 9 of the cortex 3.0 release. PerCLAUDE.md, perf claims without aMEASUREDcitation must carry aPROJECTEDlabel until verified.
| Entity count | Estimated memory | Typical use case |
|---|---|---|
| 100 | ~4 MB | Light workspace or new tenant |
| 500 | ~4.5 MB | Venue after onboarding (time slots, customers, bookings) |
| 2,000 | ~6 MB | Medium workspace with documents and notes |
| 5,000 | ~9 MB | Large workspace with extensive content |
| 10,000 | ~14 MB | Power user with months of accumulated data |
| 50,000 | ~54 MB | Heavy workspace (years of data, many entity types) |
VM Sizing Guide
Fixed overhead per process: ~240 MB (88 MB embedding model + Node.js baseline)
PROJECTED — design targets, not measured. The tables below are extrapolations from the per-instance memory model above plus a fixed per-process overhead estimate. No multi-tenant load test on a real GCE / Hetzner VM backs these tenant-density numbers today; measured tenant capacity at each VM tier ships in
docs/verification-report.md.
Venue deployments (per-tenant, ~500 entities average):
| VM | RAM | Active tenants | Cache | Notes |
|---|---|---|---|---|
| e2-medium | 4 GB | 30–40 | 500 MB | Minimum viable — tight under load |
| e2-standard-4 | 8 GB | 80–100 | 1.5 GB | Recommended — comfortable headroom |
| e2-standard-8 | 16 GB | 200+ | 4 GB | High-traffic — handles spikes easily |
Workshop deployments (per-user, mixed 100–50K entities):
| VM | RAM | Active (small) | Active (large) | Cache | Notes |
|---|---|---|---|---|---|
| e2-standard-4 | 8 GB | 200+ | 20–30 | 1.5 GB | Good for early-stage |
| e2-standard-8 | 16 GB | 500+ | 60–80 | 4 GB | Balanced density and speed |
| n2-standard-8 | 32 GB | 1,000+ | 150–200 | 8 GB | High density for growth |
Zero-Config Scaling
Cortex adapts automatically when the VM is resized:
- Upgrade from 8 GB to 16 GB: Cache budget doubles, more instances stay resident, fewer cold starts
- Downgrade or container limit: Cache shrinks, instances are evicted more aggressively, data stays on SSD
- No config changes required — Cortex reads system memory and cgroup limits at startup
Heterogeneous Density
Unlike fixed-slot allocation, Cortex tracks per-instance memory usage. A VM can simultaneously run:
- 100 small tenants (500 entities each, ~4.5 MB) = 450 MB
- 2 large workspaces (50K entities each, ~54 MB) = 108 MB
- Dynamic cache filling the remainder
The resource manager balances between them. When memory pressure rises, the large idle workspace is evicted first (weighted by idle time × memory usage) — not the small active tenant that was accessed 2 seconds ago.
Billion-Scale Search — DiskANN
For single-instance workloads beyond ~10 million vectors, HNSW's memory cost compounds quickly: at 1 B vectors × 384 dimensions, the float32 vectors alone are ~1.5 TB of RAM, and the HNSW graph metadata adds another ~2 TB. No commodity machine has that.
Cortex ships a 100% pure-Rust DiskANN engine (ADR-002) that targets ~5 ms search latency at billion scale with ~20 GB RAM (PROJECTED — design target; awaiting verification-report.md). The architecture is the Vamana α-pruned graph (Subramanya et al., NeurIPS 2019) plus Product Quantization, and the on-disk file is a single mmap-mappable contiguous layout. None of this requires a separate service or external dependency — it's the same @soulcraft/cortex plugin.
What it actually delivers, by scale
PROJECTED — design targets, not measured. The table below is extrapolated from algorithm math (Vamana traversal cost + PQ ADC table-lookup cost + per-vector storage). The largest DiskANN dataset exercised in CI today is 10k synthetic dim=64 random vectors (
#[ignore]-gated; recall threshold ≥ 0.95). Measured RAM + latency at 100M and 1B on real embedding corpora (E5-large-v2 or BGE-large) on cgroup-limited 32 GB hardware ship indocs/verification-report.mdas part of Piece 9 of the cortex 3.0 release.
| Vectors | RAM with DiskANN | RAM with HNSW | Search latency (warm cache) |
|---|---|---|---|
| 1 M | 0.5–2 GB | 0.5–2 GB | <1 ms |
| 10 M | 1–5 GB | 8–20 GB | 1–3 ms |
| 100 M | 5–20 GB | 80–200 GB (impractical) | 2–5 ms |
| 1 B | 20–70 GB | 1.5+ TB (single-machine impossible) | 5–10 ms |
These numbers are search latency for the index itself. End-to-end query latency at 1 B includes filesystem hydration of the returned entities — see the Operational ceiling section below for the honest full-stack story and the roadmap to close the gaps.
How DiskANN engages
Cortex registers a 'diskann' provider; Brainy's createIndex() consults it at init:
- Explicit opt-in:
config.index.type: 'diskann'— required if the engagement conditions aren't satisfied, otherwise throws. - Auto-engagement when all of:
- The cortex DiskANN provider is registered (you've loaded the plugin).
- The storage adapter exposes a local filesystem path (
getBinaryBlobPath('_diskann/main')). Cloud-storage adapters returnnullhere and stay on HNSW. - The metadata index has a stable
idMapper(Cortex 2.4.0's stable EntityIdMapper).
- Explicit opt-out:
config.index.type: 'hnsw'keeps the historical in-memory index.
import { BrainyData } from '@soulcraft/brainy'
import { register as registerCortex } from '@soulcraft/cortex'
const brain = new BrainyData({
storage: { type: 'filesystem', rootDirectory: '/data/idx' }
})
await registerCortex(brain)
await brain.init()
// → [brainy] DiskANN engaged (path=/data/idx/_diskann/main.bin, dim=384)
const hits = await brain.search(queryVector, 10)All Brainy APIs — add, search, relate, searchSimilarVerbs, find — work unchanged. DiskANN is an HNSW-shaped drop-in.
Migrating an existing index
Existing HNSW-backed Brainy installs do not auto-migrate on upgrade. They keep working as-is. To convert:
const result = await brain.migrateToDiskAnn({
recallTarget: 0.95, // require ≥95% recall vs old index before swapping
paddingFactor: 1.2, // search-time over-fetch for re-rank
verifySampleSize: 100 // sample queries for the recall check
})
// → builds new index in parallel, verifies, swaps atomically
// Reversible:
await brain.migrateToHnsw()Reversibility is a contract — production rollbacks are always available.
Build-time tuning at billion scale
For datasets above ~100 M vectors, the build itself needs a file-backed adjacency (the in-RAM concurrent adjacency would consume ~64 GB of bookkeeping at 1 B nodes):
config.index.diskann = {
pqM: 16, // PQ subspaces; dim must be divisible by m
pqKsub: 256, // centroids per subspace (8-bit codes — standard)
maxDegree: 64, // Vamana R (out-degree per node)
searchListSize: 100, // Vamana L (build-time candidate set)
alpha: 1.2, // α-pruning density factor
useMmapAdjacency: true, // file-backed build adjacency — REQUIRED at >100M
mmapAdjacencyPath: '/data/scratch/diskann-build.adj'
}The mmap adjacency uses atomic-u32 slots with sharded write locks, so concurrent reverse-edge merges from rayon threads contend at row granularity — not whole-graph.
Operational Ceiling at 1 B
DiskANN solves the vector-search bottleneck. Five other Cortex/Brainy subsystems hit their own walls at billion scale and need work to deliver true end-to-end 1 B operation:
| Subsystem | Current ceiling | Next work |
|---|---|---|
| Metadata sparse-field index | ~100 M entities | Native Rust LSM column store (planned — see roadmap) |
| EntityIdMapper persistence | ~500 M entries (JSON I/O) | Native binary mmap'd uuid↔int map (planned) |
| Verb-graph LSM SSTable count | ~500 M edges | Tunable MemTable threshold + range-based level layout (planned) |
| FileSystemStorage sharding | ~2.5 M entities | Configurable shard depth (planned) |
| Search-result hydration | ~10 K results/query | Batch shard-grouped reads via the io:batchReadVectors provider |
All five fixes ship inside Cortex — no external databases, no competing engines. The 5 ms search latency target at 1 B vectors (PROJECTED — awaiting verification-report.md) holds; the full-stack roadmap is to bring end-to-end query latency down to match it.
The DiskANN release moves the headline scale ceiling from ~10 M to ~1 B. The subsequent releases close the gap between search latency and end-to-end latency at that scale.