Scaling & Resource Management

Cortex scales in two dimensions:

Horizontal density — many isolated Brainy databases sharing one VM (the multi-tenant workload).
Vertical scale — a single Brainy instance growing to billion-vector search on commodity hardware (the data-platform workload).

Both modes share the same plugin, the same APIs, and the same on-disk format. Cortex picks the right machinery at init based on what your storage adapter, dataset shape, and index type tell it.

Architecture Overview

A typical deployment runs 10–200 Brainy instances in one Node.js process on a GCE VM. Each instance has its own data directory on an SSD, its own HNSW vector index, metadata index, and graph index. Cortex manages what's shared and what's isolated:

Component	Scope	Memory
Embedding model (all-MiniLM-L6-v2)	Shared singleton	88 MB, loaded once
SIMD distance calculations	Shared	Zero per-instance cost
UnifiedCache (eviction engine)	Shared singleton	Dynamic — adapts to instance count
HNSW vector index	Per-instance	~700 bytes per entity
Metadata inverted index	Per-instance	~300 bytes per entity
Entity ID mapper	Per-instance	~56 bytes per entity
Graph adjacency (LSM-trees)	Per-instance	Mmapped — kernel manages pages
Storage adapter	Per-instance	Minimal (file handles)

Adaptive Resource Manager

Cortex includes an adaptive resource manager that observes system resources and adjusts memory budgets in real time. No configuration required — it detects your VM size, container limits (cgroups v1/v2), and active instance count automatically.

What it observes:

Total and free system memory (os.totalmem(), os.freemem())
Container memory limits (cgroups v2 memory.max, cgroups v1 memory.limit_in_bytes)
Process RSS and heap usage
Number of active Brainy instances
Per-instance entity counts

What it decides:

Cache budget — shrinks dynamically as more instances are loaded
Memory pressure level (normal / elevated / critical)
Eviction candidates — weighted by idle time × memory usage, so large idle instances are evicted before small active ones

Memory-Mapped Storage (Mmap)

Cortex's MmapFileSystemStorage extends Brainy's filesystem storage with zero-copy binary I/O. The Rust native layer memory-maps data files using the memmap2 crate, letting the Linux kernel manage which pages stay in RAM.

What mmap provides:

HNSW vector index: The entire HNSW graph (vectors + connections) is stored as a single binary .hnsw file, memory-mapped for zero-copy reads. Search traverses the graph directly on mmap'd pages — no JSON parsing, no gzip decompression, no heap allocation for vector data. The kernel manages which pages stay in RAM.
Graph adjacency SSTables: 4 LSM-trees for relationship data are memory-mapped. Read-only pages are automatically paged out under memory pressure.
Dual-mode search: After flush(), search reads vectors from mmap pages (zero-copy). During active mutations, the existing in-memory engine handles search. This gives maximum density between mutations while preserving mutation speed.

Why this matters for density: On a 16 GB VM with a 200 GB SSD, you can effectively manage 200 GB of brainy data with only the hot working set in RAM. The SSD acts as an extension of memory, managed by the Linux kernel with no application-level complexity. An idle tenant's HNSW data is automatically paged out by the kernel — no explicit eviction needed.

Instant Suspend & Resume

Because each Brainy instance has its own isolated data directory, Cortex can evict instances from memory and reload them later without data loss:

Eviction: brain.close() flushes any pending writes to SSD, then frees all in-memory structures
Data persists: The data directory on SSD remains intact
Reload: Next request for the same workspace/tenant re-initializes from SSD — binary mmap load is a single syscall, typically under 100ms for a 500-entity tenant (PROJECTED — design target; no reload-time benchmark in CI yet; measured numbers ship in docs/verification-report.md)

This is fundamentally different from traditional databases that require replication or WAL replay. Each Brainy instance is a single-writer local store — suspend/resume is an munmap/mmap away.

Capacity Planning

Per-Instance Memory

PROJECTED — design targets, not measured. The table below is extrapolated from per-component memory analysis (HNSW graph + metadata inverted index + entity ID mapper sizings under their target capacity). No multi-tenant RSS benchmark backs the numbers today; measured per-instance memory at each tier ships in docs/verification-report.md as part of Piece 9 of the cortex 3.0 release. Per CLAUDE.md, perf claims without a MEASURED citation must carry a PROJECTED label until verified.

Entity count	Estimated memory	Typical use case
100	~4 MB	Light workspace or new tenant
500	~4.5 MB	Venue after onboarding (time slots, customers, bookings)
2,000	~6 MB	Medium workspace with documents and notes
5,000	~9 MB	Large workspace with extensive content
10,000	~14 MB	Power user with months of accumulated data
50,000	~54 MB	Heavy workspace (years of data, many entity types)

VM Sizing Guide

Fixed overhead per process: ~240 MB (88 MB embedding model + Node.js baseline)

PROJECTED — design targets, not measured. The tables below are extrapolations from the per-instance memory model above plus a fixed per-process overhead estimate. No multi-tenant load test on a real GCE / Hetzner VM backs these tenant-density numbers today; measured tenant capacity at each VM tier ships in docs/verification-report.md.

Venue deployments (per-tenant, ~500 entities average):

VM	RAM	Active tenants	Cache	Notes
e2-medium	4 GB	30–40	500 MB	Minimum viable — tight under load
e2-standard-4	8 GB	80–100	1.5 GB	Recommended — comfortable headroom
e2-standard-8	16 GB	200+	4 GB	High-traffic — handles spikes easily

Workshop deployments (per-user, mixed 100–50K entities):

VM	RAM	Active (small)	Active (large)	Cache	Notes
e2-standard-4	8 GB	200+	20–30	1.5 GB	Good for early-stage
e2-standard-8	16 GB	500+	60–80	4 GB	Balanced density and speed
n2-standard-8	32 GB	1,000+	150–200	8 GB	High density for growth

Zero-Config Scaling

Cortex adapts automatically when the VM is resized:

Upgrade from 8 GB to 16 GB: Cache budget doubles, more instances stay resident, fewer cold starts
Downgrade or container limit: Cache shrinks, instances are evicted more aggressively, data stays on SSD
No config changes required — Cortex reads system memory and cgroup limits at startup

Heterogeneous Density

Unlike fixed-slot allocation, Cortex tracks per-instance memory usage. A VM can simultaneously run:

100 small tenants (500 entities each, ~4.5 MB) = 450 MB
2 large workspaces (50K entities each, ~54 MB) = 108 MB
Dynamic cache filling the remainder

The resource manager balances between them. When memory pressure rises, the large idle workspace is evicted first (weighted by idle time × memory usage) — not the small active tenant that was accessed 2 seconds ago.

Billion-Scale Search — DiskANN

For single-instance workloads beyond ~10 million vectors, HNSW's memory cost compounds quickly: at 1 B vectors × 384 dimensions, the float32 vectors alone are ~1.5 TB of RAM, and the HNSW graph metadata adds another ~2 TB. No commodity machine has that.

Cortex ships a 100% pure-Rust DiskANN engine (ADR-002) that targets ~5 ms search latency at billion scale with ~20 GB RAM (PROJECTED — design target; awaiting verification-report.md). The architecture is the Vamana α-pruned graph (Subramanya et al., NeurIPS 2019) plus Product Quantization, and the on-disk file is a single mmap-mappable contiguous layout. None of this requires a separate service or external dependency — it's the same @soulcraft/cortex plugin.

What it actually delivers, by scale

PROJECTED — design targets, not measured. The table below is extrapolated from algorithm math (Vamana traversal cost + PQ ADC table-lookup cost + per-vector storage). The largest DiskANN dataset exercised in CI today is 10k synthetic dim=64 random vectors (#[ignore]-gated; recall threshold ≥ 0.95). Measured RAM + latency at 100M and 1B on real embedding corpora (E5-large-v2 or BGE-large) on cgroup-limited 32 GB hardware ship in docs/verification-report.md as part of Piece 9 of the cortex 3.0 release.

Vectors	RAM with DiskANN	RAM with HNSW	Search latency (warm cache)
1 M	0.5–2 GB	0.5–2 GB	<1 ms
10 M	1–5 GB	8–20 GB	1–3 ms
100 M	5–20 GB	80–200 GB (impractical)	2–5 ms
1 B	20–70 GB	1.5+ TB (single-machine impossible)	5–10 ms

These numbers are search latency for the index itself. End-to-end query latency at 1 B includes filesystem hydration of the returned entities — see the Operational ceiling section below for the honest full-stack story and the roadmap to close the gaps.

How DiskANN engages

Cortex registers a 'diskann' provider; Brainy's createIndex() consults it at init:

Explicit opt-in: config.index.type: 'diskann' — required if the engagement conditions aren't satisfied, otherwise throws.
Auto-engagement when all of:
- The cortex DiskANN provider is registered (you've loaded the plugin).
- The storage adapter exposes a local filesystem path (getBinaryBlobPath('_diskann/main')). Cloud-storage adapters return null here and stay on HNSW.
- The metadata index has a stable idMapper (Cortex 2.4.0's stable EntityIdMapper).
Explicit opt-out: config.index.type: 'hnsw' keeps the historical in-memory index.

import { BrainyData } from '@soulcraft/brainy'
import { register as registerCortex } from '@soulcraft/cortex'

const brain = new BrainyData({
  storage: { type: 'filesystem', rootDirectory: '/data/idx' }
})
await registerCortex(brain)
await brain.init()
// → [brainy] DiskANN engaged (path=/data/idx/_diskann/main.bin, dim=384)

const hits = await brain.search(queryVector, 10)

All Brainy APIs — add, search, relate, searchSimilarVerbs, find — work unchanged. DiskANN is an HNSW-shaped drop-in.

Migrating an existing index

Existing HNSW-backed Brainy installs do not auto-migrate on upgrade. They keep working as-is. To convert:

const result = await brain.migrateToDiskAnn({
  recallTarget: 0.95,    // require ≥95% recall vs old index before swapping
  paddingFactor: 1.2,    // search-time over-fetch for re-rank
  verifySampleSize: 100  // sample queries for the recall check
})
// → builds new index in parallel, verifies, swaps atomically

// Reversible:
await brain.migrateToHnsw()

Reversibility is a contract — production rollbacks are always available.

Build-time tuning at billion scale

For datasets above ~100 M vectors, the build itself needs a file-backed adjacency (the in-RAM concurrent adjacency would consume ~64 GB of bookkeeping at 1 B nodes):

config.index.diskann = {
  pqM: 16,                  // PQ subspaces; dim must be divisible by m
  pqKsub: 256,              // centroids per subspace (8-bit codes — standard)
  maxDegree: 64,            // Vamana R (out-degree per node)
  searchListSize: 100,      // Vamana L (build-time candidate set)
  alpha: 1.2,               // α-pruning density factor
  useMmapAdjacency: true,   // file-backed build adjacency — REQUIRED at >100M
  mmapAdjacencyPath: '/data/scratch/diskann-build.adj'
}

The mmap adjacency uses atomic-u32 slots with sharded write locks, so concurrent reverse-edge merges from rayon threads contend at row granularity — not whole-graph.

Operational Ceiling at 1 B

DiskANN solves the vector-search bottleneck. Five other Cortex/Brainy subsystems hit their own walls at billion scale and need work to deliver true end-to-end 1 B operation:

Subsystem	Current ceiling	Next work
Metadata sparse-field index	~100 M entities	Native Rust LSM column store (planned — see roadmap)
EntityIdMapper persistence	~500 M entries (JSON I/O)	Native binary mmap'd `uuid↔int` map (planned)
Verb-graph LSM SSTable count	~500 M edges	Tunable MemTable threshold + range-based level layout (planned)
FileSystemStorage sharding	~2.5 M entities	Configurable shard depth (planned)
Search-result hydration	~10 K results/query	Batch shard-grouped reads via the `io:batchReadVectors` provider

All five fixes ship inside Cortex — no external databases, no competing engines. The 5 ms search latency target at 1 B vectors (PROJECTED — awaiting verification-report.md) holds; the full-stack roadmap is to bring end-to-end query latency down to match it.

The DiskANN release moves the headline scale ceiling from ~10 M to ~1 B. The subsequent releases close the gap between search latency and end-to-end latency at that scale.