Scaling
One engine. Three tiers. Same files, same API. Run it on a laptop for development, a single commodity server for production, or across many servers for global scale — Brainy + Cor is the same code at every step.
There is no managed-service tier and no required coordinator. The engine is local Rust + local files, and "going bigger" means adding hardware, not adopting a different product. This page covers what fits at each tier and how to compose multiple instances when one machine isn't enough.
At a glance
| Tier | Hardware | Capacity |
|---|---|---|
| Single laptop | 16-64 GB RAM, 1-4 TB NVMe | 1 billion+ entities |
| Single commodity server | Ryzen 9 / 64 GB DDR5 / 4 TB NVMe Gen 4 | 10 billion+ entities |
| Multiple machines | Independent servers, hash-sharded | Effectively unbounded |
Moving up a tier is rsync plus restart at the single-machine side, and adding
a routing layer at the horizontal side. The Brainy API doesn't change — your
code looks the same whether you're running one instance or 256.
Single machine
The default unit of deployment is one Brainy instance (the data layer — JSON files on disk) plus one Cor accelerator (the native Rust engine that runs SIMD distance, mmap-backed indexes, and the DiskANN search graph). Together they handle workloads that would normally require a dedicated database cluster.
Everything stays local: data, vectors, metadata, query engine. No network hops at query time. No third party that can take your database down. Your only operational concern is the machine itself.
Multi-tenant density on one machine
A typical deployment runs 10–200 Brainy instances inside one Node.js process on a single VM. Each instance has its own data directory on SSD plus its own indexes, and Cor manages what's shared and what's isolated:
| Component | Scope | Memory |
|---|---|---|
| Embedding model (all-MiniLM-L6-v2) | Shared singleton | 88 MB, loaded once |
| SIMD distance calculations | Shared | Zero per-instance cost |
| UnifiedCache (eviction engine) | Shared singleton | Dynamic — adapts to instance count |
| HNSW vector index | Per-instance | ~700 bytes per entity |
| Metadata inverted index | Per-instance | ~300 bytes per entity |
| Entity ID mapper | Per-instance | ~56 bytes per entity |
| Graph adjacency (LSM-trees) | Per-instance | Mmapped — kernel manages pages |
| Storage adapter | Per-instance | Minimal (file handles) |
Adaptive resource manager
Cor observes system resources and adjusts memory budgets in real time. No configuration required — it detects VM size, container limits (cgroups v1/v2), and active instance count automatically.
What it observes: total and free system memory, container memory limits, process RSS and heap usage, number of active Brainy instances, per-instance entity counts.
What it decides: cache budget (shrinks dynamically as more instances are loaded), memory pressure level (normal / elevated / critical), eviction candidates (weighted by idle time × memory usage, so large idle instances are evicted before small active ones).
Memory-mapped storage
Cor's MmapFileSystemStorage extends Brainy's filesystem storage with
zero-copy binary I/O. The Rust native layer memory-maps data files via
memmap2, letting the Linux kernel manage which pages stay in RAM.
What mmap provides:
- Vector index: the entire HNSW graph (vectors + connections) is stored
as a single binary
.hnswfile, memory-mapped for zero-copy reads. Search traverses the graph directly on mmap'd pages — no JSON parsing, no gzip decompression, no heap allocation. The kernel manages which pages stay in RAM. - Graph adjacency SSTables: four LSM-trees for relationship data are memory-mapped. Read-only pages are automatically paged out under memory pressure.
- Dual-mode search: after
flush(), search reads vectors from mmap pages (zero-copy). During active mutations, the existing in-memory engine handles search. Maximum density between mutations, mutation speed preserved.
On a 16 GB VM with a 200 GB SSD, you can effectively manage 200 GB of brainy data with only the hot working set in RAM. The SSD acts as an extension of memory, managed by the Linux kernel with no application-level complexity. An idle tenant's data is automatically paged out by the kernel — no explicit eviction needed.
Instant suspend & resume
Because each Brainy instance has its own isolated data directory, Cor can evict instances from memory and reload them later without data loss:
- Eviction:
brain.close()flushes any pending writes to SSD, then frees all in-memory structures. - Data persists: the data directory on SSD remains intact.
- Reload: next request for the same workspace/tenant re-initializes from
SSD — binary mmap load is a single syscall, typically under 100 ms for a
500-entity tenant (PROJECTED — design target; reload-time benchmark
ships in
docs/verification-report.md).
This is fundamentally different from traditional databases that need
replication or WAL replay. Each Brainy instance is a single-writer local
store — suspend/resume is an munmap/mmap away.
Capacity planning
Per-instance memory
PROJECTED — design targets, not measured. Extrapolated from per-component memory analysis (HNSW graph + metadata inverted index + entity ID mapper sizings under their target capacity). Measured per-instance memory at each tier ships in
docs/verification-report.mdas part of Piece 9 of the cor 3.0 release.
| Entity count | Estimated memory | Typical use case |
|---|---|---|
| 100 | ~4 MB | Light workspace or new tenant |
| 500 | ~4.5 MB | Tenant after onboarding (time slots, customers, bookings) |
| 2,000 | ~6 MB | Medium workspace with documents and notes |
| 5,000 | ~9 MB | Large workspace with extensive content |
| 10,000 | ~14 MB | Power user with months of accumulated data |
| 50,000 | ~54 MB | Heavy workspace (years of data, many entity types) |
VM sizing
Fixed overhead per process: ~240 MB (88 MB embedding model + Node.js baseline).
PROJECTED — design targets, not measured. Extrapolated from the per-instance memory model above plus a fixed per-process overhead estimate. Measured tenant capacity at each VM tier ships in
docs/verification-report.md.
Per-tenant workloads (~500 entities average):
| VM | RAM | Active tenants | Cache | Notes |
|---|---|---|---|---|
| e2-medium | 4 GB | 30–40 | 500 MB | Minimum viable — tight under load |
| e2-standard-4 | 8 GB | 80–100 | 1.5 GB | Recommended — comfortable headroom |
| e2-standard-8 | 16 GB | 200+ | 4 GB | High-traffic — handles spikes easily |
Per-user workloads (mixed 100–50K entities):
| VM | RAM | Active (small) | Active (large) | Cache | Notes |
|---|---|---|---|---|---|
| e2-standard-4 | 8 GB | 200+ | 20–30 | 1.5 GB | Good for early-stage |
| e2-standard-8 | 16 GB | 500+ | 60–80 | 4 GB | Balanced density and speed |
| n2-standard-8 | 32 GB | 1,000+ | 150–200 | 8 GB | High density for growth |
Zero-config adaptation
Cor adapts automatically when the VM is resized:
- Upgrade from 8 GB to 16 GB: cache budget doubles, more instances stay resident, fewer cold starts.
- Downgrade or container limit: cache shrinks, instances are evicted more aggressively, data stays on SSD.
- No config changes required — Cor reads system memory and cgroup limits at startup.
A VM can simultaneously run heterogeneous workloads: 100 small tenants (500 entities each, ~4.5 MB) totaling 450 MB plus two large workspaces (50K entities each, ~54 MB) totaling 108 MB, with dynamic cache filling the remainder. The resource manager balances between them — when memory pressure rises, the large idle workspace is evicted first (weighted by idle time × memory usage), not the small active tenant accessed two seconds ago.
Billion-scale single-machine search (DiskANN)
For single-instance workloads beyond ~10 million vectors, HNSW's memory cost compounds quickly: at 1 B vectors × 384 dimensions, the float32 vectors alone are ~1.5 TB of RAM, and the HNSW graph metadata adds another ~2 TB. No commodity machine has that.
Cor ships a 100% pure-Rust DiskANN engine
(ADR-002) that targets ~5 ms search latency at
billion scale with ~20 GB RAM (PROJECTED — design target; awaiting
verification-report.md). The architecture is the Vamana α-pruned graph
(Subramanya et al., NeurIPS 2019) plus Product Quantization, and the on-disk
file is a single mmap-mappable contiguous layout. None of this requires a
separate service or external dependency — it's the same @soulcraft/cor
plugin.
PROJECTED — design targets, not measured. Extrapolated from algorithm math (Vamana traversal cost + PQ ADC table-lookup cost + per-vector storage). The largest DiskANN dataset exercised in CI today is 10k synthetic dim=64 random vectors. Measured RAM + latency at 100M and 1B on real embedding corpora on cgroup-limited 32 GB hardware ship in
docs/verification-report.mdas part of Piece 9 of the cor 3.0 release.
| Vectors | RAM with DiskANN | RAM with HNSW | Search latency (warm cache) |
|---|---|---|---|
| 1 M | 0.5–2 GB | 0.5–2 GB | <1 ms |
| 10 M | 1–5 GB | 8–20 GB | 1–3 ms |
| 100 M | 5–20 GB | 80–200 GB (impractical) | 2–5 ms |
| 1 B | 20–70 GB | 1.5+ TB (single-machine impossible) | 5–10 ms |
| 10 B | 200–700 GB total (multi-shard on one box) | — | 5–15 ms |
A single DiskANN shard tops out near u32::MAX slots (~4 B vectors) by
construction — for 10 B+ on one machine the data is split across multiple
DiskANN shards on the same box. Same engine, same files, same API; the shard
boundary is internal to Cor.
These numbers are search latency for the index itself. End-to-end query latency includes filesystem hydration of the returned entities — see the roadmap section below for the remaining full-stack work.
DiskANN engages automatically when conditions are met (registered provider,
local filesystem path available, stable EntityIdMapper) or explicitly via
config.index.type: 'diskann'. All Brainy APIs — add, search, relate,
searchSimilarVerbs, find — work unchanged. DiskANN is an HNSW-shaped
drop-in, with a reversible migration:
import { BrainyData } from '@soulcraft/brainy'
import { register as registerCor } from '@soulcraft/cor'
const brain = new BrainyData({
storage: { type: 'filesystem', rootDirectory: '/data/idx' }
})
await registerCor(brain)
await brain.init()
// → [brainy] DiskANN engaged (path=/data/idx/_diskann/main.bin, dim=384)
const hits = await brain.search(queryVector, 10)Migration is reversible — production rollbacks are always available:
await brain.migrateToDiskAnn({
recallTarget: 0.95, // require ≥95% recall vs old index before swapping
paddingFactor: 1.2,
verifySampleSize: 100
})
// And back if needed:
await brain.migrateToHnsw()Multiple machines (horizontal scaling)
Once a workload exceeds what fits on one box — either entity count past ~10 billion or query traffic past one machine's CPU — Brainy + Cor scale horizontally by running multiple independent instances. There is no built-in coordinator: each instance is self-contained, and the patterns below describe how to compose them.
All five patterns work today without external dependencies. The Datomic-style
multi-region merge (#4) becomes cleanest with the immutable Db API shipping
in Brainy 8.0 / Cor 3.0.
Roadmap — managed distributed cluster (Cor 3.1, fast-follow). The 3.0 release is deliberately a single-node engine that scales from a laptop to the most powerful single server, plus the application-level horizontal patterns below (you wire the shard routing). A built-in distributed cluster — a coordinator that handles shard placement, cross-node consistency, rebalancing, and a single logical query surface across nodes — is the headline of the 3.1 fast-follow, immediately after the 3.0 GA. Until then, "the most advanced cluster of servers" is served by (a) running the same binary on the largest single instance your cloud offers and (b) the self-managed sharding patterns in this section. 3.0's single-node numbers (
verification-report.md) are the per-node performance the 3.1 cluster composes.
1. Shard by entity
Hash the actor or tenant id and route to one of N brain instances:
shard = hash(actorId) mod NEvery node computes the same shard map — no coordinator state. Each user's entities, edges, and history live on one shard. Cross-shard reads fan out in parallel; each individual shard returns sub-millisecond.
Trade-off: Hot accounts (high-follower users in a social workload) create a hot shard. Combine with the edge-cache pattern (#5) when that shows up.
When to use: Multi-tenant SaaS, social workloads, anything with a natural partition key.
2. Read replicas via filesystem sync
Brainy persists everything as JSON files in a directory. Replicate the
directory with rsync, btrfs send, or ZFS snapshots — POSIX primitives,
no application-level replication code. Writes hit the primary; reads scale
linearly across replicas.
Trade-off: Eventually consistent (sync lag = seconds, sometimes minutes depending on tooling). Fine for read-heavy workloads where slightly-stale is acceptable.
When to use: Read-heavy serving, regional read caches, analytics followers that don't accept writes.
3. Functional partitioning
Give each domain its own brain instance, tuned to its access pattern:
| Brain | Workload | Cor mode |
|---|---|---|
| Users / profiles | Small, hot | In-memory — sub-millisecond reads |
| Posts / content | Large, content-heavy | Hybrid — PQ + mmap |
| Social graph | Billion edges, append-heavy | On-disk + LSM verb storage |
| Media metadata | Small, mostly-immutable | In-memory |
Cor's adaptive mode selection picks the right mode per brain automatically based on observed memory. Cross-domain queries do N short hops at the application layer instead of complex joins inside the database.
Trade-off: Application-level composition replaces SQL joins. Subtype indexes on the joining fields keep cross-domain lookups O(1).
When to use: Workloads with clean domain boundaries (the AT Protocol's actor records / repo records / blob storage split is a natural example).
4. Multi-region with eventual merge
Brainy 8.0 / Cor 3.0 ship a Datomic-style immutable Db API: every write
produces a new Db value via brain.transact(tx), and one Db can be
folded into another via Db.with(tx). Each region writes to its local brain
in real time. Regions exchange tx blobs over plain HTTP — no central
coordinator, no global lock.
Trade-off: Eventually consistent across regions. Conflicts resolve with last-writer-wins per entity, or naturally via union-merge on graph edges (follow lists are CRDTs — adding two different follows from two regions produces the same set either way).
When to use: Geographic distribution where intra-region latency matters more than global consistency. Social, content delivery, multi-region SaaS.
5. Edge caches over a hot subset
Each edge node runs a small Brainy + Cor with a curated hot subset — recent posts, popular accounts, viral content, whatever your access pattern makes hot. Cache misses fall back to the primary brains over the same Brainy API used everywhere else.
Trade-off: Eviction policy matters. Cor's UnifiedCache handles
in-memory hot data; a thin persisted brain at the edge handles
"recently popular" with controlled size.
When to use: Skewed access (10% of accounts get 90% of reads), geographic latency requirements, public read APIs at scale.
Massive scale — composing patterns
For workloads at the limits — billion-entity multi-tenant SaaS, global social networks, multi-region compliance — these patterns compose:
| Tier | Recommended composition |
|---|---|
| 10 B+ entities on one region | #1 shard + #2 replicas + #3 functional partitioning |
| 10 B+ entities, hot-account pain | Add #5 edge caches |
| Multi-region | Add #4 Datomic merge between regional clusters |
| 100 B+ entities global | All five composed: regional clusters of sharded primaries with edge caches, merged across regions |
A concrete example for a social network at 30 million users and a few billion posts: 64 sharded primaries (each a $1000 commodity box at ~470 K users per shard) × 4 read replicas per shard = 256 boxes serving the regional workload. Hot accounts get an edge cache layer in front; geographic redundancy adds a second region merging via #4. Total bill of materials is hardware plus operations — no database vendor invoice.
What stays the same across all tiers
| Laptop | Single server | Multiple machines | |
|---|---|---|---|
| API | Same Brainy methods | Same | Same — call the right instance |
| Data files | JSON on local disk | JSON on local disk | JSON on each instance |
| Cor install | npm install @soulcraft/cor |
Same | Same on every instance |
| Upgrade story | npm update and restart |
Same | Same, one instance at a time |
| Operational dependencies | None | None | None — no coordinator, no broker, no managed service |
Moving up the scaling ladder doesn't change your code. You add hardware, configure routing, and the same engine runs at every step.
Path to 10 billion end-to-end
DiskANN solves the vector-search bottleneck. The supporting subsystems that used to cap end-to-end operation at ~1 B have been widened to u64 throughout in the Cor 3.0 release — what remains is the verification report and a handful of bookkeeping items.
| Subsystem | Cortex 2.x ceiling | Cor 3.0 status |
|---|---|---|
| Entity ID space | ~4.3 B (u32) | ✅ u64 throughout (Piece 10) |
| EntityIdMapper persistence | ~500 M entries (JSON I/O) | ✅ Native binary mmap'd uuid↔int map (Piece 1) |
| Metadata sparse-field index | ~100 M entities | ✅ Native Rust LSM column store + Roaring64 widening (Piece E) |
| Verb-graph LSM trees | ~500 M edges | ✅ u64-keyed LSM + pair-value verbs_source for native BFS (Piece D) |
| Verb-id namespace | ~4.3 B verbs (u32) | ✅ u64 widened with IdSpace tagging (Piece B) |
| Verb endpoints store | In-memory HashMap | ✅ Packed-array mmap with adaptive sizing (Piece C) |
| FileSystemStorage sharding | ~2.5 M entities per directory | Configurable shard depth (in progress) |
| Search-result hydration | ~10 K results/query | Batch shard-grouped reads (planned) |
| End-to-end verification | — | docs/verification-report.md ships with the Cor 3.0 GA tag |
All fixes ship inside Cor — no external databases, no competing engines. The 5–15 ms search latency target at 10 B vectors (PROJECTED — awaiting verification-report.md) holds; the remaining work brings filesystem hydration and shard layout up to match.
What we don't ship (deliberately)
Brainy + Cor deliberately does not ship:
- A built-in coordinator for transparent multi-machine clustering.
- A managed-service tier or SaaS deployment.
- A query language that hides machine boundaries.
The architecture is single-box-per-instance, by design. At billion-per-tenant scale, a single commodity machine delivers the workload, and the cost of operating a distributed coordinator outweighs the benefit. Horizontal scaling means running more instances — same engine, no shared state.
If a future workload genuinely requires N-node coherence inside one logical database, we'll revisit. We will not ship that complexity speculatively.