Guide

U64 Entity IDs

Cortex maps every entity UUID to a compact integer id internally. Two parts of the stack key off that integer — the roaring bitmaps inside the metadata index, and the mmap-backed BinaryEntityIdMapper that persists the UUID ↔ int mapping at billion scale. Both of those layers historically used 32-bit ints (Roaring32, u32), which caps a single brain at 4 294 967 295 entities — about 4.29 B.

That ceiling is fine for almost every workload. It is not fine for the corpora cortex 3.0 targets: the design point of the release is "1 B vectors on a single $1000 box", and the long tail goes well past 4 B. So cortex 3.0 adds a second IdSpace — 'u64' — that lifts the ceiling to 2⁶⁴ - 1 (≈ 1.84 × 10¹⁹) while keeping every public API the same.

Picking an IdSpace

The decision is one config flag, set at brain create time:

import { BrainyData } from '@soulcraft/brainy'
import { register as registerCortex } from '@soulcraft/cortex'

const brain = new BrainyData({
  storage: { type: 'filesystem', rootDirectory: '/data/idx' },
  // Pick once. Default is 'u32'.
  entityIdMapper: { idSpace: 'u64' },
})
await registerCortex(brain)
await brain.init()

Use U32 (default) when:

  • You expect ≤ 4.29 B entities for the lifetime of the brain.
  • You want byte-identical wire format with cortex 2.x — existing .cidx segments and chunked metadata JSON envelopes are unchanged.
  • You're migrating an existing brain. U32 is what cortex 2.x wrote.

Use U64 when:

  • The brain may exceed 4.29 B entities.
  • You're starting fresh and want headroom — the U64 wire format has ~negligible overhead at small scale and is forward-compatible with every cortex 3.x release.
  • You want to use the BigInt napi sibling methods directly (e.g. when a downstream system already speaks bigint and you want to skip the safe-integer guard).

The two modes are not interchangeable on the same files. You cannot escalate a U32 brain to U64 in place — the on-disk header is authoritative and any mismatched config is rejected as a hard error at open time. Migrating an existing U32 brain to U64 is a re-export

  • re-ingest into a new brain.

What changes on the wire

  • int_to_uuid.bin (binary mapper): v1 header for U32, v2 header for U64. The header carries the IdSpace; the file is self- describing.
  • .cidx column-store segments: flags byte's FLAG_U64_IDS bit (0x01) is set; the entity-id column is u64 little-endian. The cross-language fixture in src/native/cidxU64WireFormatFixture.test.ts locks the byte layout via SHA-256.
  • Metadata chunk JSON envelope: U64 chunks get a v2 envelope with "version": 2 and "idSpace": "u64" fields. U32 chunks emit the v1-compatible envelope (no extra fields) for backwards compatibility.
  • PostingList (in-memory representation): the U32 brain uses croaring::Bitmap (Roaring32); the U64 brain uses croaring::Treemap (Roaring64). Both expose the same algebra (union_assign, intersect_assign, difference_assign, iter_u64, cardinality), so every query path is variant-agnostic.

What changes in the API

Nothing, for the common case. Every wrapper method that returned a number continues to return a number. Two refinements:

  1. getAllIntIds() throws in U64 mode. A materialised number[] at billion scale would risk OOM, and U64 brains exist precisely because the entity count is large. Use the streaming BigInt iterator on the underlying native binding.

  2. EntityIdSpaceExceeded is thrown when a U64 brain's number- typed method (e.g. getOrAssign(uuid): number) would return an int above Number.MAX_SAFE_INTEGER (2⁵³ - 1 ≈ 9.007 PB of entities). At that point you must switch to the BigInt sibling methods for the full u64 range:

    const big: bigint = mapper.getOrAssignBig(uuid)
    const uuidBack: string | undefined = mapper.getUuidBig(big)

    The BigInt siblings work in both modes. Use them in any code path that crosses the U32 → U64 escalation point so it doesn't need to branch on getIdSpace().

BigInt sibling surface (cortex 3.0)

Number-typed BigInt sibling Returns
getOrAssign(uuid: string): number getOrAssignBig(uuid: string): bigint The allocated int
getInt(uuid: string): number | undefined getIntBig(uuid: string): bigint | undefined The int for uuid, or undefined
getUuid(int: number): string | undefined getUuidBig(int: bigint): string | undefined The UUID for int
get size(): number sizeBig(): bigint Live (non-tombstone) entry count
nextIntBig(): bigint Largest int ever assigned + 1

The BigInt methods are mode-independent: they work in U32 brains too, and return the same values widened to bigint. Mixing getOrAssign(uuid) and getOrAssignBig(uuid) for the same UUID returns the same underlying int.

Brainy 8.0 lockstep — `EntityIdSpaceExceeded` (JS fallback)

Brainy 8.0 ships the JS-fallback half of the contract. Without cortex installed, brainy uses an in-RAM EntityIdMapper that also caps at u32::MAX — but the JS fallback can't widen to u64 (the bitmap layer is Roaring32 end-to-end on the JS side). So brainy 8.0's mapper throws EntityIdSpaceExceeded (in @soulcraft/brainy/internals) at the ceiling and points the caller at cortex's idSpace: 'u64' mode as the migration path.

The two errors are siblings at different layers:

  • brainy's EntityIdSpaceExceeded (U32_ENTITY_ID_MAX = 2³² − 1): the JS-only path hit the bitmap-width ceiling — install cortex.
  • cortex's EntityIdSpaceExceeded (Number.MAX_SAFE_INTEGER = 2⁵³ − 1): a U64 brain's number-typed method overflowed JS's safe-integer range — switch to BigInt siblings.

Wire-format parity gates

Two hash-locked fixtures lock the format across reader implementations:

  • .cidx U64 segmentssrc/native/cidxU64WireFormatFixture.test.ts. Three SHA-256 hashes (numeric / float / string segments) lock the byte layout. Brainy 8.0's reader targets the same fixtures.
  • PostingList cross-language — round-tripped via the napi parityWriteU64NumericSegment / parityReadSegmentEntityIdsBig pair. Same write path produces same hash across 100 stress iterations (see src/native/cortex30Stress.test.ts).

If a hash drifts, do not silently update the constant — root- cause the byte diff first. A non-deterministic writer or a half-applied format change is exactly the kind of bug those fixtures exist to catch.

See also

  • Performance — measured roaring-bitmap ops + SIMD distance speedups.
  • Scaling — multi-tenant density + the 1 B operational ceiling.
  • DiskANN — the search engine that consumes these entity ids at billion scale.