U64 Entity IDs
Cortex maps every entity UUID to a compact integer id internally. Two
parts of the stack key off that integer — the
roaring bitmaps inside the metadata index, and the
mmap-backed BinaryEntityIdMapper that persists the UUID ↔ int
mapping at billion scale. Both of those layers historically used
32-bit ints (Roaring32, u32), which caps a single brain at
4 294 967 295 entities — about 4.29 B.
That ceiling is fine for almost every workload. It is not fine for
the corpora cortex 3.0 targets: the design point of the release is "1
B vectors on a single $1000 box", and the long tail goes well past 4
B. So cortex 3.0 adds a second IdSpace — 'u64' — that lifts the
ceiling to 2⁶⁴ - 1 (≈ 1.84 × 10¹⁹) while keeping every public API the
same.
Picking an IdSpace
The decision is one config flag, set at brain create time:
import { BrainyData } from '@soulcraft/brainy'
import { register as registerCortex } from '@soulcraft/cortex'
const brain = new BrainyData({
storage: { type: 'filesystem', rootDirectory: '/data/idx' },
// Pick once. Default is 'u32'.
entityIdMapper: { idSpace: 'u64' },
})
await registerCortex(brain)
await brain.init()Use U32 (default) when:
- You expect ≤ 4.29 B entities for the lifetime of the brain.
- You want byte-identical wire format with cortex 2.x — existing
.cidxsegments and chunked metadata JSON envelopes are unchanged. - You're migrating an existing brain. U32 is what cortex 2.x wrote.
Use U64 when:
- The brain may exceed 4.29 B entities.
- You're starting fresh and want headroom — the U64 wire format has ~negligible overhead at small scale and is forward-compatible with every cortex 3.x release.
- You want to use the BigInt napi sibling methods directly (e.g. when
a downstream system already speaks
bigintand you want to skip the safe-integer guard).
The two modes are not interchangeable on the same files. You cannot escalate a U32 brain to U64 in place — the on-disk header is authoritative and any mismatched config is rejected as a hard error at open time. Migrating an existing U32 brain to U64 is a re-export
- re-ingest into a new brain.
What changes on the wire
int_to_uuid.bin(binary mapper): v1 header for U32, v2 header for U64. The header carries the IdSpace; the file is self- describing..cidxcolumn-store segments: flags byte'sFLAG_U64_IDSbit (0x01) is set; the entity-id column isu64little-endian. The cross-language fixture insrc/native/cidxU64WireFormatFixture.test.tslocks the byte layout via SHA-256.- Metadata chunk JSON envelope: U64 chunks get a v2 envelope with
"version": 2and"idSpace": "u64"fields. U32 chunks emit the v1-compatible envelope (no extra fields) for backwards compatibility. PostingList(in-memory representation): the U32 brain usescroaring::Bitmap(Roaring32); the U64 brain usescroaring::Treemap(Roaring64). Both expose the same algebra (union_assign,intersect_assign,difference_assign,iter_u64,cardinality), so every query path is variant-agnostic.
What changes in the API
Nothing, for the common case. Every wrapper method that returned a
number continues to return a number. Two refinements:
getAllIntIds()throws in U64 mode. A materialisednumber[]at billion scale would risk OOM, and U64 brains exist precisely because the entity count is large. Use the streaming BigInt iterator on the underlying native binding.EntityIdSpaceExceededis thrown when a U64 brain's number- typed method (e.g.getOrAssign(uuid): number) would return an int aboveNumber.MAX_SAFE_INTEGER(2⁵³ - 1 ≈ 9.007 PB of entities). At that point you must switch to the BigInt sibling methods for the full u64 range:const big: bigint = mapper.getOrAssignBig(uuid) const uuidBack: string | undefined = mapper.getUuidBig(big)The BigInt siblings work in both modes. Use them in any code path that crosses the U32 → U64 escalation point so it doesn't need to branch on
getIdSpace().
BigInt sibling surface (cortex 3.0)
| Number-typed | BigInt sibling | Returns |
|---|---|---|
getOrAssign(uuid: string): number |
getOrAssignBig(uuid: string): bigint |
The allocated int |
getInt(uuid: string): number | undefined |
getIntBig(uuid: string): bigint | undefined |
The int for uuid, or undefined |
getUuid(int: number): string | undefined |
getUuidBig(int: bigint): string | undefined |
The UUID for int |
get size(): number |
sizeBig(): bigint |
Live (non-tombstone) entry count |
| — | nextIntBig(): bigint |
Largest int ever assigned + 1 |
The BigInt methods are mode-independent: they work in U32 brains too,
and return the same values widened to bigint. Mixing
getOrAssign(uuid) and getOrAssignBig(uuid) for the same UUID
returns the same underlying int.
Brainy 8.0 lockstep — `EntityIdSpaceExceeded` (JS fallback)
Brainy 8.0 ships the JS-fallback half of the contract. Without
cortex installed, brainy uses an in-RAM EntityIdMapper that also
caps at u32::MAX — but the JS fallback can't widen to u64 (the
bitmap layer is Roaring32 end-to-end on the JS side). So brainy
8.0's mapper throws EntityIdSpaceExceeded (in
@soulcraft/brainy/internals) at the ceiling and points the caller
at cortex's idSpace: 'u64' mode as the migration path.
The two errors are siblings at different layers:
- brainy's
EntityIdSpaceExceeded(U32_ENTITY_ID_MAX= 2³² − 1): the JS-only path hit the bitmap-width ceiling — install cortex. - cortex's
EntityIdSpaceExceeded(Number.MAX_SAFE_INTEGER= 2⁵³ − 1): a U64 brain's number-typed method overflowed JS's safe-integer range — switch to BigInt siblings.
Wire-format parity gates
Two hash-locked fixtures lock the format across reader implementations:
.cidxU64 segments —src/native/cidxU64WireFormatFixture.test.ts. Three SHA-256 hashes (numeric / float / string segments) lock the byte layout. Brainy 8.0's reader targets the same fixtures.PostingListcross-language — round-tripped via the napiparityWriteU64NumericSegment/parityReadSegmentEntityIdsBigpair. Same write path produces same hash across 100 stress iterations (seesrc/native/cortex30Stress.test.ts).
If a hash drifts, do not silently update the constant — root- cause the byte diff first. A non-deterministic writer or a half-applied format change is exactly the kind of bug those fixtures exist to catch.
See also
- Performance — measured roaring-bitmap ops + SIMD distance speedups.
- Scaling — multi-tenant density + the 1 B operational ceiling.
- DiskANN — the search engine that consumes these entity ids at billion scale.