Skip to content

Merge Vector Search into master branch#837

Closed
eolivelli wants to merge 133 commits into
diennea:masterfrom
eolivelli:vector-index-async
Closed

Merge Vector Search into master branch#837
eolivelli wants to merge 133 commits into
diennea:masterfrom
eolivelli:vector-index-async

Conversation

@eolivelli
Copy link
Copy Markdown
Contributor

This PR is to validate CI before merging the Vector Search branch into master

eolivelli and others added 30 commits March 19, 2026 10:05
…arameters

## Features

### Vector index (ANN search)
- New index type `VECTOR` backed by jvector (Vamana-style ANN graph).
- Indexed column must be of type `floata` (FLOAT ARRAY).
- CREATE / INSERT / UPDATE / DELETE path fully implemented.
- ANN query via `ann_of(col, ?)` scalar function:
  - `ORDER BY ann_of(col, CAST(? AS FLOAT ARRAY)) DESC LIMIT k` is
    intercepted by CalcitePlanner and routed through the new
    VectorANNScanOp planner operator when a vector index exists.
  - Falls back to brute-force cosine similarity when no index is present.
  - Optional WHERE filter applied after PK fetch.

### Configurable hyperparameters via WITH clause
- Parameters `m`, `beamWidth`, `neighborOverflow`, `alpha`, `similarity`
  (cosine / euclidean / dot), and `fusedPQ` can be set at index creation:

    CREATE VECTOR INDEX vidx ON tblspace1.t1(vec)
      WITH m=32 beamWidth=200 similarity=cosine fusedPQ=true;

- Index.java now carries a `Map<String,String> properties` field,
  serialised as format version 2 (version 1 files read with empty map).
- JSQLParserPlanner strips the WITH suffix before JSQLParser sees it
  (JSQLParser does not support this syntax), parses key=value pairs, and
  transfers them to the Index.Builder via a ThreadLocal.

### FusedPQ on-disk format (hybrid architecture)
- When fusedPQ=true (default), dimension ≥ 8, and ≥ 256 active vectors,
  checkpoints use jvector's OnDiskGraphIndex with FusedPQ + InlineVectors
  features for faster approximate scoring at search time.
- On restart the on-disk graph is loaded via a ByteBuffer-backed
  ReaderSupplier; new inserts after restart go to a fresh in-memory
  GraphIndexBuilder.
- Search is hybrid: on-disk graph (FusedPQ scoring + InlineVectors
  reranking) and live in-memory graph are searched independently; results
  are merged by score descending before returning the top-K.
- Deleted on-disk nodes are tracked in onDiskNodeToPk / onDiskPkToNode
  and excluded at search time via an acceptBits filter.
- Falls back to the legacy OnHeapGraphIndex format for small indexes
  (< 256 vectors or dimension < 8) or when fusedPQ=false.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Intercept ORDER BY ann_of(col, ?) in JSQLParserPlanner before planSort()
and route through VectorANNScanOp, matching CalcitePlanner behaviour.

- VectorANNScanOp: support null scanProjection/innerProjection (full table
  row passthrough); null-safe fallback (throws StatementExecutionException
  when no index and no fallback op)
- JSQLParserPlanner: add tryBuildVectorANNSortOp() that detects the
  ann_of() pattern in a single ORDER BY element and builds VectorANNScanOp
  with null projections; falls through to planSort() for all other cases
- Add VectorIndexJSQLParserPlannerTest covering: index present, WHERE +
  ANN filter, and no-index error path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Tuple

- New vector-testings/ module: standalone JDBC benchmark for vector index
  ingestion throughput, index build time, ANN query latency, and recall@K
- Supports SIFT-1M, SIFT-10M, and BIGANN-1B datasets (fvecs + bvecs formats)
- Configurable parallelism, batch size, index parameters, and top-K
- Fix Tuple.serialize() to handle Float values (returned by ann_of())
  by converting to Double before serialization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eolivelli and others added 27 commits April 3, 2026 12:44
Introduce a volatile snapshot swap pattern (ServerSnapshot inner class) to
allow lock-free reads in hot-path methods while supporting safe dynamic
server list updates via updateServers(). Existing channels are reused
when servers are retained, and removed channels are shut down gracefully
in a background daemon thread. Also adds the DynamicServiceClient
interface in herddb-core for generic dynamic update support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation

Add ServiceDiscoveryListener interface and service registration/discovery
methods for indexing services and file servers. The ZooKeeper implementation
uses ephemeral nodes for automatic deregistration on disconnect and watchers
for push-based change notifications. Services are re-registered after ZK
session expiry.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Refactor RemoteFileServiceClient to support dynamic server list updates
via a volatile snapshot swap pattern. This allows changing the set of
remote file servers at runtime without restarting the client.

- Add DynamicServiceClient interface in herddb-core
- Introduce ServerSnapshot inner class holding router + channels
- Replace final router/channels fields with volatile snapshot
- Extract buildChannel() helper for channel creation
- Add updateServers() with channel reuse and graceful shutdown
- Update all stub methods and fan-out operations to read snapshot
- Add RemoteFileServiceClientDynamicTest with 3 test cases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eeper

Enable IndexingServer and RemoteFileServer to register/unregister themselves
in MetadataStorageManager on startup/shutdown, allowing service discovery
via ZooKeeper in cluster mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…service mode

Introduce server.storage.mode property (local|remote|bookkeeper) to independently
select the DataStorageManager. Remove PROPERTY_MODE_REMOTE_FILE_SERVICE.
The remote file storage is now activated via server.storage.mode=remote with any
compatible server.mode (standalone or cluster).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verify that IndexingService instances register in ZooKeeper and can be
discovered dynamically by the HerdDB server. Two test methods cover the
single-instance and two-instance (sharded) scenarios, both performing
ANN queries against orthogonal 4D vectors to confirm correctness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies that a RemoteFileServer registered in ZK is discovered by a
HerdDB server running in cluster mode with server.storage.mode=remote,
and that basic table CRUD operations work with remote page storage.
Includes a two-server variant that tests consistent hashing across
multiple file servers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ServerMain: after server.start(), try ZK discovery for indexing services
  if indexing.service.servers is not statically configured. Register a
  ServiceDiscoveryListener for dynamic updates.
- Server.buildDataStorageManager(): for storage.mode=remote, try ZK discovery
  for file servers if remote.file.servers is not configured. Register listener
  for dynamic updates via DynamicServiceClient.
- Change PROPERTY_REMOTE_FILE_SERVERS_DEFAULT to empty string since the old
  remote-file-service mode was removed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test-start-fileserver-cluster.sh: use server.storage.mode=remote instead
  of server.mode=remote-file-service
- Helm chart: condition file-server resources on server.storageMode=remote
  instead of server.mode=remote-file-service
- REMOTE_FILE_SERVER.md: update configuration examples

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

- Add `client.allowReadsFromFollowers` configuration flag (default
`false`) that enables distributing SELECT queries across all replicas
(leader + followers) when in AUTOCOMMIT mode (`tx == 0`)
- Writes always go to the leader; reads inside transactions always go to
the leader
- On follower read failure, automatic retry picks a different replica
via existing `requestMetadataRefresh` mechanism
- Backward-compatible wire protocol extension using bitmask trailer in
`OpenScanner` PDU

## Changes

### Wire Protocol (`herddb-utils`)
- **`Pdu.java`**: Add `FLAGS_OPENSCANNER_ALLOW_FOLLOWER_READS = 8` (bit
3, composable with existing bit 2)
- **`PduCodec.java`**: Add `OpenScanner.write()` overload with
`allowFollowerReads` param; trailer uses bitmask OR. Add
`isAllowFollowerReads()` decoder

### Client Configuration (`herddb-core`)
- **`ClientConfiguration.java`**: Add `client.allowReadsFromFollowers`
property (works via JDBC URL query params)
- **`HDBClient.java`**: Store and expose `allowReadsFromFollowers` flag

### Metadata Provider
- **`ClientSideMetadataProvider.java`**: Add `getTableSpaceReplicas()`
default method returning `Set<String>`
- **`ZookeeperClientSideMetadataProvider.java`**: Cache
`TableSpace.replicas` alongside leader in `readAsTableSpace()`;
implement `getTableSpaceReplicas()`; clear replica cache on metadata
refresh

### Client Routing
- **`ClientSideConnectionPeer.java`**: Add `executeScan` overload with
`allowFollowerReads` parameter (default method delegates to old
signature)
- **`RoutedClientSideConnection.java`**: Implement new overload, pass
flag to `PduCodec.OpenScanner.write()`
- **`HDBConnection.java`**: Add `getRouteToTableSpaceReplica()` (random
selection from replica set); modify `executeScan()` to route to replicas
when `allowReadsFromFollowers && tx == 0`
- **`NonMarshallingClientSideConnectionPeer.java`**: Override new
`executeScan` for local mode

### Server-Side
- **`Statement.java`**: Move `allowExecutionFromFollower` field from
`ScanStatement` to base class (enables `SQLPlannedOperationStatement` to
carry the flag)
- **`ScanStatement.java`**: Remove duplicate field (now inherited)
- **`ServerSideConnectionPeer.java`**: Read `allowFollowerReads` from
PDU trailer in `handleOpenScanner()`; set flag on translated plan's main
statement. Add overloaded `executeScan()` for local mode
- **`DBManager.java`**: Fix leader check in `executeStatementAsync()` to
respect `statement.getAllowExecutionFromFollower()` — critical for
`SQLPlannedOperationStatement` path through the Calcite planner

## Test Plan

- [ ] **`PduCodecTest.readObjectsTrailerWithFollowerReads`**: Protocol
flag encoding/decoding for all 4 combinations of `keepReadLocks` x
`allowFollowerReads`
- [ ] **`FollowerReadsRoutingTest`** (4 tests): Config defaults, JDBC
URL parsing, routing to replicas when enabled, no follower reads inside
transactions
- [ ] **`FollowerReadsTest`** (4 tests): Cluster integration — read from
follower, failover to leader when follower stops, no follower reads in
transaction, follower promoted to leader with role change
- [ ] **`FollowerReadsVectorIndexTest`**: End-to-end with 1 leader + 1
follower + 1 indexing service — vector ANN search works via follower
reads
- [ ] Verify existing tests pass: `ChangeRoleTest`,
`SimpleFollowerTest`, `ZKDiscoveryVectorIndexTest`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

Introduces a new `server.mode=shared-storage` where data lives on shared
S3 (via the remote file server) and replicas can **serve reads** from a
consistent checkpoint snapshot. Unlike `diskless-cluster` (where
replicas can't serve reads) or `cluster` (where followers replay WAL and
maintain per-node copies), shared-storage replicas consume the leader's
checkpointed state directly from S3. They never write to S3 or
BookKeeper, and are promotable to leader on failover.

## Architecture

```
Leader (server.mode=cluster, server.storage.mode=remote,
        server.checkpoint.publish.to.remote=true):
  WAL          -> BookKeeper
  Pages        -> S3 (existing)
  Checkpoint   -> S3 metadata (TableStatus, IndexStatus, table/index defs)  [NEW]
  LSN notify   -> ZooKeeper                                                  [NEW]

Replica (server.mode=shared-storage):
  ZK watch     -> detects new checkpoint LSN
  Read meta    -> from S3
  Atomic swap  -> rebuild in-memory state under write lock
  Serve reads  -> S3 pages (with per-server local caching)
  On failover  -> promotable to leader, replays BookKeeper WAL from checkpoint
```

## New classes (herddb-remote-file-service)

- **`SharedCheckpointMetadataManager`** — reads/writes checkpoint
metadata on S3 (`TableStatus`, `IndexStatus`, table/index definitions,
checkpoint LSN marker)
- **`ReadReplicaDataStorageManager`** — read-only `DataStorageManager`;
writes throw `UnsupportedOperationException`
- **`PromotableRemoteFileDataStorageManager`** — starts read-only,
promotable to writable `RemoteFileDataStorageManager` on leader election

## Modified (herddb-core + herddb-remote-file-service)

- **`ServerConfiguration`** — `PROPERTY_MODE_SHARED_STORAGE` + replica
config (poll interval, switch timeout, publish-to-remote flag)
- **`RemoteFileDataStorageManager`** — optional
`SharedCheckpointMetadataManager` that publishes metadata to S3 during
checkpoint
- **`MetadataStorageManager`** — adds `publishCheckpointLsn` /
`getCheckpointLsn` / `watchCheckpointLsn` API (no-op defaults)
- **`ZookeeperMetadataStorageManager`** — ZK-based checkpoint LSN
publish/watch at `{basePath}/checkpoints/{tableSpaceUUID}`
- **`TableSpaceManager`** — `CheckpointFollowerThread` (watches ZK,
polls as safety net) + `refreshFromCheckpoint()` (atomic view switch
under write lock) + shared-storage branches in `startAsFollower()` /
`recover()`
- **`DBManager`** — `readOnlyReplica` flag rejects non-read statements
(DDL/DML/transactions) on non-leader shared-storage replicas
- **`Server`** — wires shared-storage mode across metadata / storage /
commit-log / node-id managers

## Consistency model

- Replicas serve reads at **checkpoint granularity** (default ~15 min,
tunable via `server.checkpoint.period`)
- No dirty reads (checkpoints only include committed state)
- No read-your-writes across leader/replica
- On checkpoint switch: write lock blocks new queries; in-flight reads
complete against old state; all cached pages for each table are evicted
to pick up changes

## Edge cases handled

- BLink/BRIN index consistency during switch (closed + rebooted from new
checkpoint)
- Table create/drop/alter detected via checkpoint metadata diff
- Replica boot before first checkpoint: boots empty, picks up state via
ZK watch
- Failover WAL gap: leader doesn't trim ledgers past last checkpoint,
new leader replays the gap
- S3 unavailability: cached pages still served, checkpoint refresh
retries with backoff

## Configuration

**Leader:**
```properties
server.mode=cluster
server.storage.mode=remote
server.checkpoint.publish.to.remote=true   # NEW
```

**Replica:**
```properties
server.mode=shared-storage                 # NEW mode
remote.file.servers=fileserver1:9846,fileserver2:9846
server.zookeeper.address=zk:2181/herddb
server.base.dir=/var/herddb/data           # needed for failover promotion
```

## Test plan

- [ ] Unit-test `SharedCheckpointMetadataManager` round-trip via mock
`RemoteFileServiceClient`
- [ ] Unit-test `ReadReplicaDataStorageManager` (reads delegate, writes
throw UOE)
- [ ] Integration test: leader inserts + checkpoints -> replica sees
rows after refresh
- [ ] Integration test: DDL propagation across checkpoint boundary
- [ ] Integration test: failover from replica to leader, WAL replay of
gap
- [ ] Integration test: concurrent queries during checkpoint switch (no
partial views)
- [ ] Integration test: replica boot before first checkpoint
- [ ] Existing `DisklessClusterTest`, `RemoteFileServiceClientTest`,
`RemoteFileBrinIndexRecoveryTest` still pass (verified locally)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
Addresses production issues from a 1B-vector BigANN ingest (see
@/dev/debug-server): Phase B running 20+ min serially, disk filling up
with ~800M on-disk vectors across 416 unbounded sealed segments plus 417
resident mmap'd tmp files, then ENOSPC + unbounded retry loops.

### P0 — correctness & unblocking
- Track pageIds allocated during Phase B; on failure, roll them back via
new `DataStorageManager#deleteIndexPage` (implemented in
`FileDataStorageManager` + `MemoryDataStorageManager`, no-op default for
BookKeeper/remote).
- Exponential backoff (60s → 30min) in `compactionLoop` keyed on
`consecutiveCheckpointFailures` — ENOSPC retries no longer burn CPU.

### P1 — Phase B wall-clock
- Parallelise Phase B segment builds
(`herddb.vectorindex.phaseBSegmentParallelism`, default 2).
- New `PageStoreReader` with a bounded LRU page cache replaces the
per-segment resident mmap'd tmp file. **No mmap** because mmap regions
are not counted against the heap budget. Observed before: 417
`herddb-vector-*.tmp` files (one per segment) silently doubling on-disk
footprint; after: tmp dir stays near-empty.

### P2 — bound segment count & ingest rate
- `chooseSegmentsToDemote`: when sealed count > threshold (default 32),
demote the smallest sealed segments to be merged in the next Phase B.
- Absolute cap `herddb.vectorindex.maxLiveVectorsPerCheckpoint` breaks
the positive-feedback loop between slow checkpoints and growing live
backlog.

### P3 — observability
- New metrics: `sealed_segment_count`, `phase_b_bytes_written`,
`phase_b_vectors_per_second`, `checkpoint_consecutive_failures`,
`checkpoint_total_failures`, `rolled_back_pages_(total|last)`,
`tmp_dir_bytes`, `free_disk_bytes`.
- Grafana dashboard `indexing-service-dashboard.json` gains two new
rows.

## Test plan
- [x] `PersistentVectorStoreBackoffTest` (5 tests) — pure backoff policy
incl. overflow
- [x] `PageStoreReaderTest` (16 tests) — byte-level round-trip, LRU
eviction, EOF, wrong chunk type, missing page, shared cache, concurrent
readers
- [x] `PersistentVectorStoreMergePolicyTest` (7 tests) — demotion
below/at/above threshold, smallest-first, batch cap
- [x] `PersistentVectorStoreCapComputationTest` (14 tests, +4 new) —
absolute cap interaction with memory-derived cap
- [x] `PersistentVectorStoreFailureRecoveryTest` (9 tests) — injected
DSM failures, page rollback, failure counters, delete-failure tolerance,
metrics, backoff observability
- [x] `PersistentVectorStoreParallelPhaseBTest` (6 tests) —
multi-segment Phase B, reopen recovery, partial failure, merge pressure,
tmp dir clean
- [x] Existing suite regression: `PersistentVectorStoreLifecycleTest`,
`PersistentVectorStoreConcurrentCheckpointTest`,
`PersistentVectorStoreSearchTest`,
`PersistentVectorStoreBLinkCleanupTest`,
`PersistentVectorStoreBackpressureLivelockTest`,
`PersistentVectorStoreConfigTest` — all pass
- [ ] Re-run the BigANN load end-to-end on a scaled-down replica with
tight disk quota to verify (a) no ENOSPC, (b) sealed segment count
stabilises, (c) no lingering `herddb-vector-*.tmp` files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rvices zip (#8)

## Summary

- Packages the `vector-testings` shaded uber-jar into
`herddb-services.zip` (under `vector-testings/`, excluded from `lib/` to
avoid ~58MB class duplication), so the Jib-built Docker image ships
`VectorBench` at `/opt/herddb/vector-testings/vector-testings-<ver>.jar`
plus a new `bin/vector-bench.sh` wrapper.
- Adds an opt-in `vectorBench` Helm section: 1-replica `StatefulSet`
(sleeps `infinity` for `kubectl exec`) with a 100Gi
`volumeClaimTemplates` PVC mounted at `/opt/herddb/vector-datasets`, and
`persistentVolumeClaimRetentionPolicy: {whenDeleted: Retain, whenScaled:
Retain}` so downloaded datasets (sift1m, gist1m, deep-image-96, …)
survive helm upgrades and uninstall/reinstall cycles.
- `Config.java` `--dataset-dir` now falls back to
`$VECTORBENCH_DATASET_DIR` when not passed explicitly, so the
StatefulSet sets the path once via env and users don't have to repeat
`--dataset-dir` on every `kubectl exec`.

Disabled by default (`vectorBench.enabled: false`).

Usage:
```bash
helm upgrade --set vectorBench.enabled=true ...
kubectl exec -it <release>-vector-bench-0 -- \
    /opt/herddb/bin/vector-bench.sh --url "\$HERDDB_JDBC_URL" --dataset sift1m
```

## Test plan

- [x] ``mvn -pl vector-testings,herddb-services -am package
-DskipTests`` — zip contains
``vector-testings/vector-testings-<ver>.jar`` (58MB) +
``bin/vector-bench.sh``, and NO ``lib/vector-testings*.jar``
- [x] ``helm lint --set vectorBench.enabled=true`` — clean
- [x] ``helm template --set vectorBench.enabled=true`` — StatefulSet
renders with correct env, PVC and retention policy
- [ ] End-to-end on k3s: install, confirm
``datasets-<release>-vector-bench-0`` PVC is Bound, run
``vector-bench.sh --dataset sift1m`` inside the pod, then ``helm
upgrade`` and verify dataset files in ``/opt/herddb/vector-datasets``
persist

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
Plumbs together HerdDB server (\`server.storage.mode=remote\`),
RemoteFileServer (\`storage.mode=s3\`), and the IndexingServer
(\`indexing.storage.type=remote\`) so that a single MinIO bucket backs
both table data and vector-index data. Only the commit log, metadata,
and the indexing service's watermark/checkpoint cache stay local.

### Fixes discovered while building the test
- **VectorIndexManager held a stale client after restart.** Capturing
\`RemoteVectorIndexService\` by value at index-open time meant
restarting the IndexingServer (new gRPC client) left ANN search failing
with \"Channel shutdown invoked\". Changed to a
\`Supplier<RemoteVectorIndexService>\` that resolves lazily via
\`DBManager\`.
- **IndexingServer wiped \`RemoteFileDataStorageManager\` state on
restart.** It passed \`dataDir\` as both metadataDir and tmpDir;
\`FileDataStorageManager\` cleans tmpDir on start()/close(), nuking
checkpoint markers. Fixed by splitting into
\`{dataDir}/remote-metadata\` (preserved) and \`{dataDir}/remote-tmp\`
(scratch).
- **\`RemoteFileDataStorageManager.deleteIndexPage\` was missing.** The
Phase-B rollback path from PR #7 was a no-op over S3, leaking objects on
checkpoint failure. Added an override calling \`client.deleteFile\` on
the remote index-page path.

### End-to-end test
\`IndexingServiceWithS3E2ETest\` in \`herddb-services\`:
- \`@ClassRule\` MinIO testcontainer amortised across methods; per-test
S3 prefix cleaned in \`@After\`.
- Full topology per test: RemoteFileServer(s3) + Server(remote) +
IndexingServer(remote), all talking to one MinIO bucket.
- 5 scenarios: search before/after checkpoint, restart-before-checkpoint
(crash + replay), restart-after-checkpoint (300 random vectors, real
FusedPQ checkpoint, stop + restart + verify segments reload from S3),
tmp-dir-stays-local guard rail.

### Known gap (follow-up)
The indexing service still needs \`watermark.dat\` +
\`remote-metadata/\` to survive restart. For truly ephemeral pods (disks
wiped on restart) both need to be reconstructible from S3/ZK/BK.
\`SharedCheckpointMetadataManager\` already publishes the right paths
for the read-replica flow; wiring a bootstrap-from-S3 read path and
moving the watermark off local disk is deferred.

## Test plan
- [x] \`RemoteFileDataStorageManagerBasicTest\` — 3 new cases for delete
+ idempotency
- [x] \`IndexingServiceWithS3E2ETest\` — 5 scenarios, ~45s end-to-end
- [x] Regression: \`StaticDiscoveryRemoteFileIndexingTest\`,
\`ZKDiscoveryRemoteFileIndexingTest\`,
\`PersistentVectorStoreFailureRecoveryTest\`,
\`PersistentVectorStoreLifecycleTest\`,
\`PersistentVectorStoreConcurrentCheckpointTest\`,
\`PersistentVectorStoreSearchTest\` — all green locally
- [ ] CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

Adds OIDC/JWT bearer-token authentication across HerdDB:

- **JDBC client → HerdDB server** via the new **OAUTHBEARER** SASL
mechanism (RFC 7628), running over the existing PDU handshake alongside
PLAIN/DIGEST-MD5/Kerberos
- **HerdDB server → HerdDB server** (leader/follower, inter-cluster)
- **File server (gRPC)** and **Indexing service (gRPC)** — which
previously had **no authentication at all** — now both enforce
`Authorization: Bearer <jwt>` via a shared gRPC interceptor

JWTs are validated against the OIDC provider's JWKS (cached,
auto-refreshing) using Nimbus JOSE+JWT. Clients obtain tokens via the
OAuth2 **client_credentials** grant (with automatic refresh) or can pass
a pre-acquired token directly. Works with Keycloak, Okta, Auth0, etc.

Built as **7 incremental commits**, each with passing unit tests:

1. `herddb-auth-oidc` foundation module — `OidcConfiguration`
(discovery), `OidcTokenValidator`, `OidcTokenProvider`,
`PrincipalExtractor` (20 tests)
2. **OAUTHBEARER** SASL mechanism — `SaslServer`/`SaslClient` impls +
JVM provider + `TokenAuthenticator` callback (7 tests)
3. JDBC client + HerdDB server wiring — `ClientConfiguration` /
`ServerConfiguration` properties, `HDBClient` shared
`OidcTokenProvider`, `Server` validator setup (5 E2E tests)
4. File server gRPC — shared
`JwtAuthServerInterceptor`/`JwtAuthClientInterceptor`,
`RemoteFileServer`/`RemoteFileServiceClient` integration (3 + 6 tests)
5. Indexing service gRPC — same pattern for
`IndexingServer`/`IndexingServiceClient` (3 tests)
6. End-to-end Keycloak test via Testcontainers (`KeycloakOidcE2ETest`) —
real Keycloak container, realm + client provisioned via admin REST API;
skips gracefully when Docker is unavailable
7. `AUTHENTICATION.md` documenting all HerdDB auth modes + JDBC URL
examples + troubleshooting, plus a
`herddb-services/src/main/demo/keycloak/` docker-compose for demos

**46 new tests**, all existing SASL PLAIN / DIGEST-MD5 tests still pass.

## Configuration at a glance

Server:
```properties
oidc.enabled=true
oidc.issuer.url=https://keycloak.example.com/realms/herddb
oidc.audience=herddb-api              # optional
oidc.username.claim=preferred_username # optional
```

JDBC client:
```
jdbc:herddb:server:host:7000?client.auth.mech=OAUTHBEARER&\
oidc.issuer.url=https://keycloak.example.com/realms/herddb&\
oidc.client.id=herddb-jdbc&oidc.client.secret=<secret>
```

## Test plan

- [x] `mvn -pl herddb-auth-oidc test` — 33 tests green (validator,
provider, discovery, OAUTHBEARER SASL roundtrip, gRPC interceptors)
- [x] `mvn -pl herddb-core -Dtest=OidcAuthTest test` — 5 E2E tests green
(valid creds, wrong secret, wrong issuer, missing config fails fast,
pre-acquired token)
- [x] `mvn -pl herddb-remote-file-service
-Dtest=RemoteFileServerOidcTest test` — 3 tests green
- [x] `mvn -pl herddb-indexing-service -Dtest=IndexingServiceOidcTest
test` — 3 tests green
- [x] `mvn -pl herddb-core -Dtest=JASSPLAINTest,JAASMD5Test test` —
existing SASL tests still pass (no regression)
- [x] Full clean build of reactor: `mvn clean install -DskipTests`
across all touched modules
- [ ] Keycloak E2E (`KeycloakOidcE2ETest`) — passes against real
Keycloak container in CI; skipped locally without Docker
- [ ] `docker compose up` in `herddb-services/src/main/demo/keycloak/`
boots a realm with `herddb-jdbc`, `herddb-file`, `herddb-index`,
`herddb-server` service-account clients

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Large FusedPQ graph files (hundreds of GB) can now be stored and retrieved as
multipart files (.multipart/ dirs with numbered 4 MB block files) instead of
thousands of 1 MB HerdDB index pages.

- New RPC `WriteFileBlock(WriteFileBlockRequest) returns (WriteFileBlockResponse)`:
  stores a single block of a multipart file by path + blockIndex
- New RPC `ReadFileRange(ReadFileRangeRequest) returns (ReadFileRangeResponse)`:
  retrieves a byte range from a multipart file (server computes blockIndex from
  offset / blockSize, returns the slice from that block)

- Switched from `ServerBuilder` to `NettyServerBuilder` so we can wire up a
  `NioEventLoopGroup` directly and call `setIoRatio()` on it
- ioRatio defaults to 70, overridable via system property
  `herddb.fileserver.netty.ioRatio`
- blockSize (default 4 MB) read from `block.size` config property;
  `maxInboundMessageSize` set to blockSize + 1 MB
- Required adding both `bossEventLoopGroup` + `channelType(NioServerSocketChannel)`
  alongside `workerEventLoopGroup` to satisfy Netty's "all or none" contract

- Added `MULTIPART_SUFFIX = ".multipart"` constant
- New abstract methods: `writeBlock`, `readRange`, `deleteLogical`, `listLogical`
- `LocalObjectStorage`: blocks stored as `{path}.multipart/{blockIndex}` files;
  `readRange` loads the block file and slices; `deleteLogical` recursively removes
  the `.multipart/` directory; `listLogical` deduplicates via LinkedHashSet
- `S3ObjectStorage`: each block is a separate S3 object at key
  `{prefix}/{path}.multipart/{blockIndex}`; `readRange` downloads the whole block
  object and returns the requested slice; `deleteLogical` removes all block objects
  via deleteByPrefix + deletes the single-file key in parallel
- `CachingObjectStorage`: `writeBlock` writes to inner and caches the block;
  `readRange` serves from cache (block key = `{path}.multipart/{blockIndex}`);
  `deleteLogical` invalidates all matching cache entries then delegates

- New `writeFileBlock` handler: calls `storage.writeBlock()` with metrics
  (writeblock_requests, writeblock_errors, writeblock_bytes, writeblock_latency)
- New `readFileRange` handler: calls `storage.readRange()` with metrics
  (readrange_requests, readrange_errors, readrange_bytes, readrange_latency)
- `deleteFile` now calls `storage.deleteLogical()` (handles both single files and
  multipart dirs)
- `listFiles` now calls `storage.listLogical()` (returns collapsed logical paths)

- Block routing key = `path + "#block" + blockIndex` for consistent hash, so
  blocks of the same logical file distribute across file servers
- `writeFileBlock(Async)`: unary RPC routed by block key
- `readFileRange(Async)`: handles cross-block requests by splitting into two
  sequential calls (each routed independently)
- `writeMultipartFile(path, InputStream, blockSize)`: reads InputStream in
  blockSize chunks, calls `writeFileBlock` sequentially, returns total bytes written
- `listFiles` aggregation uses `LinkedHashSet` to deduplicate logical paths returned
  from multiple servers
- `buildChannel` sets `maxInboundMessageSize(5 MB)` so large block responses fit

- Implements `io.github.jbellis.jvector.disk.RandomAccessReader`
- Buffers one block at a time; `ensureBlockLoaded()` fetches via `readFileRange`
  only when the current block changes (amortises network round trips for sequential
  reads within a block)
- Cross-block `readFully` handled transparently in a loop
- All primitive decode methods (`readInt`, `readFloat`, `readLong`) implemented via
  `readFully` to get big-endian byte reads
- Nested `Supplier implements ReaderSupplier` creates independent reader instances
  per thread (jvector calls `get()` once per search thread)

- `writeMultipartIndexFile(tableSpace, uuid, fileType, tempFile)`: streams the
  temp file to remote servers via `client.writeMultipartFile()`; returns logical path
- `multipartIndexReaderSupplier(tableSpace, uuid, fileType, fileSize)`: returns a
  `RemoteRandomAccessReader.Supplier` backed by the client and logical path
- `deleteMultipartIndexFile(tableSpace, uuid, fileType)`: calls `client.deleteFile()`
  on the logical path (which routes through `deleteLogical` on the server)
- Config property `remote.file.multipart.block.size` (default 4 MB)

- Added default no-op methods `writeMultipartIndexFile`, `multipartIndexReaderSupplier`,
  `deleteMultipartIndexFile` so non-remote implementations are unaffected

- Added `graphFilePath`, `graphFileSize`, `mapFilePath`, `mapFileSize` fields for
  multipart mode (populated by `PersistentVectorStore` after a multipart write)

- `SegmentWriteResult` gains a multipart constructor (paths + sizes instead of page
  ID lists) and `isMultipart()` predicate
- `writeFusedPQGraphToTempFile` / `writeFusedPQMapDataToTempFile`: build the jvector
  graph / BLink map to a local temp file and return the Path
- `writeOneSegmentData`: tries multipart first (if DSM supports it), falls back to
  page-based chunks
- Metadata format v3 (`METADATA_VERSION_MULTI_SEGMENT`): each segment now serialized
  with a `useMultipart` byte flag; multipart segments store `graphFilePath`,
  `graphFileSize`, `mapFilePath`, `mapFileSize` via `DataOutputStream.writeUTF()`;
  page-based segments still store the page ID arrays
- `loadMultiSegmentFormat` reads the new format via `DataInputStream`
- `loadFusedPQSegment` uses `multipartIndexReaderSupplier` when `seg.graphFilePath`
  is set, otherwise falls back to the `PageStoreReader` path
- `readMultipartMapDataToTempFile`: downloads multipart map data via
  `RemoteRandomAccessReader` to a local temp file for BLink loading

- `LocalObjectStorageTest`: writeBlock/readRange/deleteLogical/listLogical (4 new)
- `RemoteFileServiceTest`: WriteFileBlock+ReadFileRange gRPC round trip, writeMultipartFile
  end-to-end, multipart list+delete (3 new)
- `RemoteRandomAccessReaderTest` (new file): seek+readFully, cross-block read,
  readInt/readFloat/readLong, Supplier creates independent readers, length() (7 tests)
- `CachingObjectStorageTest.FakeObjectStorage` updated to implement new abstract methods

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… configured

VectorIndexManager.doStart() now calls remoteService() eagerly so that
CREATE VECTOR INDEX fails immediately with a clear error message
("RemoteVectorIndexService is required") instead of succeeding silently
and only crashing later when the first operation is attempted.

Fixes VectorIndexTest.testFailsWithoutRemoteService.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… PromotableRemoteFileDataStorageManager (#11)

ReadReplicaDataStorageManager now implements multipartIndexReaderSupplier so that
shared-storage read replicas can serve vector search queries on indexes stored as
multipart files (written by the leader via RemoteFileDataStorageManager).

- ReadReplicaDataStorageManager.multipartIndexReaderSupplier: constructs a
  RemoteRandomAccessReader.Supplier using client.getBlockSize() and the same
  {tableSpace}/{uuid}/multipart/{fileType} path convention as the writer

- PromotableRemoteFileDataStorageManager: added overrides for all three multipart
  methods (writeMultipartIndexFile, multipartIndexReaderSupplier,
  deleteMultipartIndexFile) that delegate to activeDelegate, so that the correct
  implementation is used regardless of whether the node is currently a replica or
  a promoted leader

- RemoteRandomAccessReaderTest: added testReadReplicaMultipartIndexReaderSupplier
  which creates a ReadReplicaDataStorageManager with a small block size (64 bytes),
  writes a 3-block multipart file, and verifies: full sequential read, correct
  total length, and two independent readers at different seek positions

Fixes a bug where read replicas would use the no-op base-class multipartIndexReaderSupplier
(returning null) and therefore fail to open vector index segments stored as multipart files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
#12)

## Summary
- **VectorIndexEndToEndTest**: Fixed tablespace UUID mismatch between
Server and EmbeddedIndexingService that caused `waitForCatchUp()` to
loop forever — the indexing service was tailing a non-existent txlog
directory
- **NettyChannelAcceptor.close()**: Fixed JVM hang after tests complete
by using `shutdownNow()` + `awaitTermination()` for the callback
executor and bounded graceful shutdown for Netty event loops
- **IndexingServiceClient.waitForInstanceCatchUp()**: Added 5-minute
timeout to prevent infinite polling loops

## Test plan
- [x] herddb-jdbc: 78 tests pass locally
- [x] herddb-net: 7 tests pass locally  
- [x] herddb-services: 25 tests pass locally (including
VectorIndexEndToEndTest which previously hung)
- [ ] Monitor CI workflow `pr-validation-test-other.yml`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

Makes the indexing service's local disk fully reconstructible from S3
when running with `storage.type=remote`, so a Kubernetes pod can restart
with an empty volume and resume from the last published checkpoint.

## What this fixes

Two pieces of local state were previously unrecoverable after a disk
wipe:

1. **`watermark.dat`** — last-processed commit-log LSN, written only to
the local disk.
2. **`remote-metadata/`** — IndexStatus checkpoint markers that
`PersistentVectorStore` reads on startup to find its sealed S3 segments.
If wiped, the store can't discover what it had already checkpointed to
S3.

There was also a latent bug: `IndexingServiceEngine` only configured the
`PersistentVectorStore` factory for `storage.type=file`, silently
falling back to `InMemoryVectorStore` for `storage.type=remote`.

## Changes

- **`WatermarkStore`** is now an interface; `LocalWatermarkStore` keeps
the previous disk impl, and new **`S3WatermarkStore`** persists the
watermark to `{tableSpace}/_indexing/{instanceId}/watermark.lsn` with an
XXHash64 footer. Monotonicity is enforced on save.
- **Save contract**: the watermark is written **only** after a
successful `indexCheckpoint` — never on a timer, entry count, or
shutdown. This keeps the stored LSN always covered by a durable
IndexStatus on S3 and bounds S3 writes to ≤ 1 per checkpoint.
`IndexingServiceEngine.checkpointAndSaveWatermark()` triggers a
checkpoint every 5000 entries and only advances the watermark after all
stores succeed.
- **`SharedCheckpointMetadataManager.hydrateLocalMetadataDir`**
downloads every `_metadata/*.{index,table}status` object from S3 into a
staging dir and atomically moves it into place. The byte-for-byte
on-disk format matches `FileDataStorageManager`'s, so no deserialization
is needed.
- **`IndexingServer`** now wires a `SharedCheckpointMetadataManager`
onto its `RemoteFileDataStorageManager` (so every `indexCheckpoint`
publishes its IndexStatus to S3), and runs hydrate + installs the
`S3WatermarkStore` via a new `afterTableSpaceResolved` hook on the
engine.
- **Commit-log tailer** always starts from `START_OF_TIME` so DDL
entries are reprocessed on every boot and the in-memory `SchemaTracker`
is rebuilt. DML is replayed too but is idempotent (`addVector` replaces
by PK); previously-sealed segments load directly from S3 via
`PageStoreReader`, so replay only has to re-add uncheckpointed vectors.
A future PR can skip DML with LSN ≤ watermark while still reprocessing
DDL.
- **`IndexingServiceEngine`** now configures the `PersistentVectorStore`
factory for both `storage.type=file` AND `storage.type=remote`.

## Tests

- `S3WatermarkStoreTest` (new) — 10 unit tests with fault injection:
absent-returns-START_OF_TIME, roundtrip, path format, monotonicity,
equal-LSN idempotency, corruption detection, truncation detection,
overwrite-after-corruption, read-failure, write-failure.
- `IndexingServiceWithS3E2ETest` — updated to wipe the entire indexing
data dir (watermark.dat AND remote-metadata/) between stop and restart,
exercising the hydrate + watermark-from-S3 path. Also asserts
`watermark.lsn` exists on S3, not locally.
- All 141 indexing-service tests and 85 remote-file-service tests still
pass.

## Known follow-up

- DML replay from START_OF_TIME on every restart is correct but
wasteful; a future PR can persist a lightweight schema snapshot
alongside the watermark and skip already-applied DML.
- Full cluster E2E test (ZK + BK + 2 servers in cluster mode + indexing
+ restart everything except ZK/BK) was scoped out of this PR to keep it
reviewable.

## Test plan

- [x] `mvn -pl herddb-indexing-service test -Dtest=S3WatermarkStoreTest`
- [x] `mvn -pl herddb-services test -Dtest=IndexingServiceWithS3E2ETest`
- [x] `mvn -pl herddb-indexing-service test` (all 141 pass)
- [x] `mvn -pl herddb-remote-file-service test` (all 85 pass)
- [x] `mvn -pl herddb-services test
-Dtest='StaticDiscoveryRemoteFileIndexingTest,ZKDiscoveryRemoteFileIndexingTest'`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- **CheckpointExecuteTest.checkpointRequiresTablespaceParam**: the error
message changed to "CHECKPOINT requires 1 or 2 parameters" when the
optional timeout parameter was added, but the test still expected the
old "requires one parameter" message.
- **SysindexstatusTest.testSysindexstatusVector**: since commit
82efba7, embedded vector indexing is no longer supported —
`VectorIndexManager.doStart()` now fails fast if
`RemoteVectorIndexService` is not configured. The test now provides a
`MockRemoteVectorIndexService` and asserts on the remote
`IndexStatusInfo` fields (`vectorCount`, `segmentCount`, `status`)
instead of the old embedded properties.

The other two CI failures (SimpleFollowerTest,
MultipleConcurrentUpdatesTest) pass consistently locally and appear to
be CI-environment flakiness rather than deterministic bugs.

## Test plan
- [x] `CheckpointExecuteTest` — all methods pass
- [x] `SysindexstatusTest` — all methods pass (including vector test)
- [x] `MultipleConcurrentUpdatesTest` — passes 3/3 runs locally
- [x] `SimpleFollowerTest` — passes locally

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- **BookkeeperCommitLog bug fix**: In
`BKFollowerContext.ensureOpenReader()`, the "find next ledger" loop at
line 919 was assigning the found ledger ID to `ledgerToTail` instead of
`nextLedger`. This meant `nextLedger` always remained `-1`, causing the
follower to null out its `currentLedger` and take the "no more ledgers"
path every time it needed to switch between closed/fully-read ledgers.
This bug could cause the follower to reopen the same ledger repeatedly
with incorrect `nextEntryToRead` offsets or fail to advance to new
ledgers promptly.
- **SimpleFollowerTest stabilization**: The test's auto-transaction uses
`sync=false` (deferred) BK writes, meaning entries may not be visible to
the follower immediately. Added a sync write (`NO_TRANSACTION` INSERT)
after the transaction to guarantee BK visibility before the wait begins.
Also added a fail-fast `isFailed()` check inside the wait condition and
a descriptive timeout message for better CI diagnostics.

## Test plan
- [x] `SimpleFollowerTest` — all 5 tests pass
- [x] Verified the ledger switching fix is correct by code review
- [ ] CI should confirm the SimpleFollowerTest no longer times out

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Fix `estimatedSizeBytes` in the page-based write path to include both
graph and map pages (was only counting graph pages), which caused
segments to not seal properly and underestimated `avgBytesPerVector` in
subsequent checkpoints
- Add on-disk index size metric (`ondisk_estimated_size_bytes`) and
per-segment size distribution metrics
(`segment_size_min/median/max_bytes`) for Grafana observability
- Rename misleading `estimated_size_bytes` metric to
`live_vectors_estimated_memory_bytes` (it reports in-memory usage, not
disk size)
- Add two new Grafana panels: "On-Disk Index Size" and "Segment Size
Distribution" (min/p50/max)

## Test plan
- [x] `PersistentVectorStoreParallelPhaseBTest` (6 tests) — passes
- [x] `PersistentVectorStoreSearchTest` (12 tests) — passes
- [ ] Verify new metrics appear in Prometheus after deploying the
indexing service
- [ ] Verify Grafana dashboard renders the new panels correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eolivelli eolivelli closed this Apr 7, 2026
@eolivelli
Copy link
Copy Markdown
Contributor Author

I have sent the PR to the wrong repo :-)
please ignore

@eolivelli eolivelli deleted the vector-index-async branch April 7, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant