Merge Vector Search into master branch by eolivelli · Pull Request #837 · diennea/herddb

eolivelli · 2026-04-07T09:59:06Z

This PR is to validate CI before merging the Vector Search branch into master

…arameters ## Features ### Vector index (ANN search) - New index type `VECTOR` backed by jvector (Vamana-style ANN graph). - Indexed column must be of type `floata` (FLOAT ARRAY). - CREATE / INSERT / UPDATE / DELETE path fully implemented. - ANN query via `ann_of(col, ?)` scalar function: - `ORDER BY ann_of(col, CAST(? AS FLOAT ARRAY)) DESC LIMIT k` is intercepted by CalcitePlanner and routed through the new VectorANNScanOp planner operator when a vector index exists. - Falls back to brute-force cosine similarity when no index is present. - Optional WHERE filter applied after PK fetch. ### Configurable hyperparameters via WITH clause - Parameters `m`, `beamWidth`, `neighborOverflow`, `alpha`, `similarity` (cosine / euclidean / dot), and `fusedPQ` can be set at index creation: CREATE VECTOR INDEX vidx ON tblspace1.t1(vec) WITH m=32 beamWidth=200 similarity=cosine fusedPQ=true; - Index.java now carries a `Map<String,String> properties` field, serialised as format version 2 (version 1 files read with empty map). - JSQLParserPlanner strips the WITH suffix before JSQLParser sees it (JSQLParser does not support this syntax), parses key=value pairs, and transfers them to the Index.Builder via a ThreadLocal. ### FusedPQ on-disk format (hybrid architecture) - When fusedPQ=true (default), dimension ≥ 8, and ≥ 256 active vectors, checkpoints use jvector's OnDiskGraphIndex with FusedPQ + InlineVectors features for faster approximate scoring at search time. - On restart the on-disk graph is loaded via a ByteBuffer-backed ReaderSupplier; new inserts after restart go to a fresh in-memory GraphIndexBuilder. - Search is hybrid: on-disk graph (FusedPQ scoring + InlineVectors reranking) and live in-memory graph are searched independently; results are merged by score descending before returning the top-K. - Deleted on-disk nodes are tracked in onDiskNodeToPk / onDiskPkToNode and excluded at search time via an acceptBits filter. - Falls back to the legacy OnHeapGraphIndex format for small indexes (< 256 vectors or dimension < 8) or when fusedPQ=false. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Intercept ORDER BY ann_of(col, ?) in JSQLParserPlanner before planSort() and route through VectorANNScanOp, matching CalcitePlanner behaviour. - VectorANNScanOp: support null scanProjection/innerProjection (full table row passthrough); null-safe fallback (throws StatementExecutionException when no index and no fallback op) - JSQLParserPlanner: add tryBuildVectorANNSortOp() that detects the ann_of() pattern in a single ORDER BY element and builds VectorANNScanOp with null projections; falls through to planSort() for all other cases - Add VectorIndexJSQLParserPlannerTest covering: index present, WHERE + ANN filter, and no-index error path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…Tuple - New vector-testings/ module: standalone JDBC benchmark for vector index ingestion throughput, index build time, ANN query latency, and recall@K - Supports SIFT-1M, SIFT-10M, and BIGANN-1B datasets (fvecs + bvecs formats) - Configurable parallelism, batch size, index parameters, and top-K - Fix Tuple.serialize() to handle Float values (returned by ann_of()) by converting to Double before serialization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduce a volatile snapshot swap pattern (ServerSnapshot inner class) to allow lock-free reads in hot-path methods while supporting safe dynamic server list updates via updateServers(). Existing channels are reused when servers are retained, and removed channels are shut down gracefully in a background daemon thread. Also adds the DynamicServiceClient interface in herddb-core for generic dynamic update support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ation Add ServiceDiscoveryListener interface and service registration/discovery methods for indexing services and file servers. The ZooKeeper implementation uses ephemeral nodes for automatic deregistration on disconnect and watchers for push-based change notifications. Services are re-registered after ZK session expiry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Refactor RemoteFileServiceClient to support dynamic server list updates via a volatile snapshot swap pattern. This allows changing the set of remote file servers at runtime without restarting the client. - Add DynamicServiceClient interface in herddb-core - Introduce ServerSnapshot inner class holding router + channels - Replace final router/channels fields with volatile snapshot - Extract buildChannel() helper for channel creation - Add updateServers() with channel reuse and graceful shutdown - Update all stub methods and fan-out operations to read snapshot - Add RemoteFileServiceClientDynamicTest with 3 test cases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eeper Enable IndexingServer and RemoteFileServer to register/unregister themselves in MetadataStorageManager on startup/shutdown, allowing service discovery via ZooKeeper in cluster mode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…service mode Introduce server.storage.mode property (local|remote|bookkeeper) to independently select the DataStorageManager. Remove PROPERTY_MODE_REMOTE_FILE_SERVICE. The remote file storage is now activated via server.storage.mode=remote with any compatible server.mode (standalone or cluster). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Verify that IndexingService instances register in ZooKeeper and can be discovered dynamically by the HerdDB server. Two test methods cover the single-instance and two-instance (sharded) scenarios, both performing ANN queries against orthogonal 4D vectors to confirm correctness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Verifies that a RemoteFileServer registered in ZK is discovered by a HerdDB server running in cluster mode with server.storage.mode=remote, and that basic table CRUD operations work with remote page storage. Includes a two-server variant that tests consistent hashing across multiple file servers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- ServerMain: after server.start(), try ZK discovery for indexing services if indexing.service.servers is not statically configured. Register a ServiceDiscoveryListener for dynamic updates. - Server.buildDataStorageManager(): for storage.mode=remote, try ZK discovery for file servers if remote.file.servers is not configured. Register listener for dynamic updates via DynamicServiceClient. - Change PROPERTY_REMOTE_FILE_SERVERS_DEFAULT to empty string since the old remote-file-service mode was removed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- test-start-fileserver-cluster.sh: use server.storage.mode=remote instead of server.mode=remote-file-service - Helm chart: condition file-server resources on server.storageMode=remote instead of server.mode=remote-file-service - REMOTE_FILE_SERVER.md: update configuration examples Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary - Add `client.allowReadsFromFollowers` configuration flag (default `false`) that enables distributing SELECT queries across all replicas (leader + followers) when in AUTOCOMMIT mode (`tx == 0`) - Writes always go to the leader; reads inside transactions always go to the leader - On follower read failure, automatic retry picks a different replica via existing `requestMetadataRefresh` mechanism - Backward-compatible wire protocol extension using bitmask trailer in `OpenScanner` PDU ## Changes ### Wire Protocol (`herddb-utils`) - **`Pdu.java`**: Add `FLAGS_OPENSCANNER_ALLOW_FOLLOWER_READS = 8` (bit 3, composable with existing bit 2) - **`PduCodec.java`**: Add `OpenScanner.write()` overload with `allowFollowerReads` param; trailer uses bitmask OR. Add `isAllowFollowerReads()` decoder ### Client Configuration (`herddb-core`) - **`ClientConfiguration.java`**: Add `client.allowReadsFromFollowers` property (works via JDBC URL query params) - **`HDBClient.java`**: Store and expose `allowReadsFromFollowers` flag ### Metadata Provider - **`ClientSideMetadataProvider.java`**: Add `getTableSpaceReplicas()` default method returning `Set<String>` - **`ZookeeperClientSideMetadataProvider.java`**: Cache `TableSpace.replicas` alongside leader in `readAsTableSpace()`; implement `getTableSpaceReplicas()`; clear replica cache on metadata refresh ### Client Routing - **`ClientSideConnectionPeer.java`**: Add `executeScan` overload with `allowFollowerReads` parameter (default method delegates to old signature) - **`RoutedClientSideConnection.java`**: Implement new overload, pass flag to `PduCodec.OpenScanner.write()` - **`HDBConnection.java`**: Add `getRouteToTableSpaceReplica()` (random selection from replica set); modify `executeScan()` to route to replicas when `allowReadsFromFollowers && tx == 0` - **`NonMarshallingClientSideConnectionPeer.java`**: Override new `executeScan` for local mode ### Server-Side - **`Statement.java`**: Move `allowExecutionFromFollower` field from `ScanStatement` to base class (enables `SQLPlannedOperationStatement` to carry the flag) - **`ScanStatement.java`**: Remove duplicate field (now inherited) - **`ServerSideConnectionPeer.java`**: Read `allowFollowerReads` from PDU trailer in `handleOpenScanner()`; set flag on translated plan's main statement. Add overloaded `executeScan()` for local mode - **`DBManager.java`**: Fix leader check in `executeStatementAsync()` to respect `statement.getAllowExecutionFromFollower()` — critical for `SQLPlannedOperationStatement` path through the Calcite planner ## Test Plan - [ ] **`PduCodecTest.readObjectsTrailerWithFollowerReads`**: Protocol flag encoding/decoding for all 4 combinations of `keepReadLocks` x `allowFollowerReads` - [ ] **`FollowerReadsRoutingTest`** (4 tests): Config defaults, JDBC URL parsing, routing to replicas when enabled, no follower reads inside transactions - [ ] **`FollowerReadsTest`** (4 tests): Cluster integration — read from follower, failover to leader when follower stops, no follower reads in transaction, follower promoted to leader with role change - [ ] **`FollowerReadsVectorIndexTest`**: End-to-end with 1 leader + 1 follower + 1 indexing service — vector ANN search works via follower reads - [ ] Verify existing tests pass: `ChangeRoleTest`, `SimpleFollowerTest`, `ZKDiscoveryVectorIndexTest` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary Introduces a new `server.mode=shared-storage` where data lives on shared S3 (via the remote file server) and replicas can **serve reads** from a consistent checkpoint snapshot. Unlike `diskless-cluster` (where replicas can't serve reads) or `cluster` (where followers replay WAL and maintain per-node copies), shared-storage replicas consume the leader's checkpointed state directly from S3. They never write to S3 or BookKeeper, and are promotable to leader on failover. ## Architecture ``` Leader (server.mode=cluster, server.storage.mode=remote, server.checkpoint.publish.to.remote=true): WAL -> BookKeeper Pages -> S3 (existing) Checkpoint -> S3 metadata (TableStatus, IndexStatus, table/index defs) [NEW] LSN notify -> ZooKeeper [NEW] Replica (server.mode=shared-storage): ZK watch -> detects new checkpoint LSN Read meta -> from S3 Atomic swap -> rebuild in-memory state under write lock Serve reads -> S3 pages (with per-server local caching) On failover -> promotable to leader, replays BookKeeper WAL from checkpoint ``` ## New classes (herddb-remote-file-service) - **`SharedCheckpointMetadataManager`** — reads/writes checkpoint metadata on S3 (`TableStatus`, `IndexStatus`, table/index definitions, checkpoint LSN marker) - **`ReadReplicaDataStorageManager`** — read-only `DataStorageManager`; writes throw `UnsupportedOperationException` - **`PromotableRemoteFileDataStorageManager`** — starts read-only, promotable to writable `RemoteFileDataStorageManager` on leader election ## Modified (herddb-core + herddb-remote-file-service) - **`ServerConfiguration`** — `PROPERTY_MODE_SHARED_STORAGE` + replica config (poll interval, switch timeout, publish-to-remote flag) - **`RemoteFileDataStorageManager`** — optional `SharedCheckpointMetadataManager` that publishes metadata to S3 during checkpoint - **`MetadataStorageManager`** — adds `publishCheckpointLsn` / `getCheckpointLsn` / `watchCheckpointLsn` API (no-op defaults) - **`ZookeeperMetadataStorageManager`** — ZK-based checkpoint LSN publish/watch at `{basePath}/checkpoints/{tableSpaceUUID}` - **`TableSpaceManager`** — `CheckpointFollowerThread` (watches ZK, polls as safety net) + `refreshFromCheckpoint()` (atomic view switch under write lock) + shared-storage branches in `startAsFollower()` / `recover()` - **`DBManager`** — `readOnlyReplica` flag rejects non-read statements (DDL/DML/transactions) on non-leader shared-storage replicas - **`Server`** — wires shared-storage mode across metadata / storage / commit-log / node-id managers ## Consistency model - Replicas serve reads at **checkpoint granularity** (default ~15 min, tunable via `server.checkpoint.period`) - No dirty reads (checkpoints only include committed state) - No read-your-writes across leader/replica - On checkpoint switch: write lock blocks new queries; in-flight reads complete against old state; all cached pages for each table are evicted to pick up changes ## Edge cases handled - BLink/BRIN index consistency during switch (closed + rebooted from new checkpoint) - Table create/drop/alter detected via checkpoint metadata diff - Replica boot before first checkpoint: boots empty, picks up state via ZK watch - Failover WAL gap: leader doesn't trim ledgers past last checkpoint, new leader replays the gap - S3 unavailability: cached pages still served, checkpoint refresh retries with backoff ## Configuration **Leader:** ```properties server.mode=cluster server.storage.mode=remote server.checkpoint.publish.to.remote=true # NEW ``` **Replica:** ```properties server.mode=shared-storage # NEW mode remote.file.servers=fileserver1:9846,fileserver2:9846 server.zookeeper.address=zk:2181/herddb server.base.dir=/var/herddb/data # needed for failover promotion ``` ## Test plan - [ ] Unit-test `SharedCheckpointMetadataManager` round-trip via mock `RemoteFileServiceClient` - [ ] Unit-test `ReadReplicaDataStorageManager` (reads delegate, writes throw UOE) - [ ] Integration test: leader inserts + checkpoints -> replica sees rows after refresh - [ ] Integration test: DDL propagation across checkpoint boundary - [ ] Integration test: failover from replica to leader, WAL replay of gap - [ ] Integration test: concurrent queries during checkpoint switch (no partial views) - [ ] Integration test: replica boot before first checkpoint - [ ] Existing `DisklessClusterTest`, `RemoteFileServiceClientTest`, `RemoteFileBrinIndexRecoveryTest` still pass (verified locally) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary Addresses production issues from a 1B-vector BigANN ingest (see @/dev/debug-server): Phase B running 20+ min serially, disk filling up with ~800M on-disk vectors across 416 unbounded sealed segments plus 417 resident mmap'd tmp files, then ENOSPC + unbounded retry loops. ### P0 — correctness & unblocking - Track pageIds allocated during Phase B; on failure, roll them back via new `DataStorageManager#deleteIndexPage` (implemented in `FileDataStorageManager` + `MemoryDataStorageManager`, no-op default for BookKeeper/remote). - Exponential backoff (60s → 30min) in `compactionLoop` keyed on `consecutiveCheckpointFailures` — ENOSPC retries no longer burn CPU. ### P1 — Phase B wall-clock - Parallelise Phase B segment builds (`herddb.vectorindex.phaseBSegmentParallelism`, default 2). - New `PageStoreReader` with a bounded LRU page cache replaces the per-segment resident mmap'd tmp file. **No mmap** because mmap regions are not counted against the heap budget. Observed before: 417 `herddb-vector-*.tmp` files (one per segment) silently doubling on-disk footprint; after: tmp dir stays near-empty. ### P2 — bound segment count & ingest rate - `chooseSegmentsToDemote`: when sealed count > threshold (default 32), demote the smallest sealed segments to be merged in the next Phase B. - Absolute cap `herddb.vectorindex.maxLiveVectorsPerCheckpoint` breaks the positive-feedback loop between slow checkpoints and growing live backlog. ### P3 — observability - New metrics: `sealed_segment_count`, `phase_b_bytes_written`, `phase_b_vectors_per_second`, `checkpoint_consecutive_failures`, `checkpoint_total_failures`, `rolled_back_pages_(total|last)`, `tmp_dir_bytes`, `free_disk_bytes`. - Grafana dashboard `indexing-service-dashboard.json` gains two new rows. ## Test plan - [x] `PersistentVectorStoreBackoffTest` (5 tests) — pure backoff policy incl. overflow - [x] `PageStoreReaderTest` (16 tests) — byte-level round-trip, LRU eviction, EOF, wrong chunk type, missing page, shared cache, concurrent readers - [x] `PersistentVectorStoreMergePolicyTest` (7 tests) — demotion below/at/above threshold, smallest-first, batch cap - [x] `PersistentVectorStoreCapComputationTest` (14 tests, +4 new) — absolute cap interaction with memory-derived cap - [x] `PersistentVectorStoreFailureRecoveryTest` (9 tests) — injected DSM failures, page rollback, failure counters, delete-failure tolerance, metrics, backoff observability - [x] `PersistentVectorStoreParallelPhaseBTest` (6 tests) — multi-segment Phase B, reopen recovery, partial failure, merge pressure, tmp dir clean - [x] Existing suite regression: `PersistentVectorStoreLifecycleTest`, `PersistentVectorStoreConcurrentCheckpointTest`, `PersistentVectorStoreSearchTest`, `PersistentVectorStoreBLinkCleanupTest`, `PersistentVectorStoreBackpressureLivelockTest`, `PersistentVectorStoreConfigTest` — all pass - [ ] Re-run the BigANN load end-to-end on a scaled-down replica with tight disk quota to verify (a) no ENOSPC, (b) sealed segment count stabilises, (c) no lingering `herddb-vector-*.tmp` files 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rvices zip (#8) ## Summary - Packages the `vector-testings` shaded uber-jar into `herddb-services.zip` (under `vector-testings/`, excluded from `lib/` to avoid ~58MB class duplication), so the Jib-built Docker image ships `VectorBench` at `/opt/herddb/vector-testings/vector-testings-<ver>.jar` plus a new `bin/vector-bench.sh` wrapper. - Adds an opt-in `vectorBench` Helm section: 1-replica `StatefulSet` (sleeps `infinity` for `kubectl exec`) with a 100Gi `volumeClaimTemplates` PVC mounted at `/opt/herddb/vector-datasets`, and `persistentVolumeClaimRetentionPolicy: {whenDeleted: Retain, whenScaled: Retain}` so downloaded datasets (sift1m, gist1m, deep-image-96, …) survive helm upgrades and uninstall/reinstall cycles. - `Config.java` `--dataset-dir` now falls back to `$VECTORBENCH_DATASET_DIR` when not passed explicitly, so the StatefulSet sets the path once via env and users don't have to repeat `--dataset-dir` on every `kubectl exec`. Disabled by default (`vectorBench.enabled: false`). Usage: ```bash helm upgrade --set vectorBench.enabled=true ... kubectl exec -it <release>-vector-bench-0 -- \ /opt/herddb/bin/vector-bench.sh --url "\$HERDDB_JDBC_URL" --dataset sift1m ``` ## Test plan - [x] ``mvn -pl vector-testings,herddb-services -am package -DskipTests`` — zip contains ``vector-testings/vector-testings-<ver>.jar`` (58MB) + ``bin/vector-bench.sh``, and NO ``lib/vector-testings*.jar`` - [x] ``helm lint --set vectorBench.enabled=true`` — clean - [x] ``helm template --set vectorBench.enabled=true`` — StatefulSet renders with correct env, PVC and retention policy - [ ] End-to-end on k3s: install, confirm ``datasets-<release>-vector-bench-0`` PVC is Bound, run ``vector-bench.sh --dataset sift1m`` inside the pod, then ``helm upgrade`` and verify dataset files in ``/opt/herddb/vector-datasets`` persist 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary Plumbs together HerdDB server (\`server.storage.mode=remote\`), RemoteFileServer (\`storage.mode=s3\`), and the IndexingServer (\`indexing.storage.type=remote\`) so that a single MinIO bucket backs both table data and vector-index data. Only the commit log, metadata, and the indexing service's watermark/checkpoint cache stay local. ### Fixes discovered while building the test - **VectorIndexManager held a stale client after restart.** Capturing \`RemoteVectorIndexService\` by value at index-open time meant restarting the IndexingServer (new gRPC client) left ANN search failing with \"Channel shutdown invoked\". Changed to a \`Supplier<RemoteVectorIndexService>\` that resolves lazily via \`DBManager\`. - **IndexingServer wiped \`RemoteFileDataStorageManager\` state on restart.** It passed \`dataDir\` as both metadataDir and tmpDir; \`FileDataStorageManager\` cleans tmpDir on start()/close(), nuking checkpoint markers. Fixed by splitting into \`{dataDir}/remote-metadata\` (preserved) and \`{dataDir}/remote-tmp\` (scratch). - **\`RemoteFileDataStorageManager.deleteIndexPage\` was missing.** The Phase-B rollback path from PR #7 was a no-op over S3, leaking objects on checkpoint failure. Added an override calling \`client.deleteFile\` on the remote index-page path. ### End-to-end test \`IndexingServiceWithS3E2ETest\` in \`herddb-services\`: - \`@ClassRule\` MinIO testcontainer amortised across methods; per-test S3 prefix cleaned in \`@After\`. - Full topology per test: RemoteFileServer(s3) + Server(remote) + IndexingServer(remote), all talking to one MinIO bucket. - 5 scenarios: search before/after checkpoint, restart-before-checkpoint (crash + replay), restart-after-checkpoint (300 random vectors, real FusedPQ checkpoint, stop + restart + verify segments reload from S3), tmp-dir-stays-local guard rail. ### Known gap (follow-up) The indexing service still needs \`watermark.dat\` + \`remote-metadata/\` to survive restart. For truly ephemeral pods (disks wiped on restart) both need to be reconstructible from S3/ZK/BK. \`SharedCheckpointMetadataManager\` already publishes the right paths for the read-replica flow; wiring a bootstrap-from-S3 read path and moving the watermark off local disk is deferred. ## Test plan - [x] \`RemoteFileDataStorageManagerBasicTest\` — 3 new cases for delete + idempotency - [x] \`IndexingServiceWithS3E2ETest\` — 5 scenarios, ~45s end-to-end - [x] Regression: \`StaticDiscoveryRemoteFileIndexingTest\`, \`ZKDiscoveryRemoteFileIndexingTest\`, \`PersistentVectorStoreFailureRecoveryTest\`, \`PersistentVectorStoreLifecycleTest\`, \`PersistentVectorStoreConcurrentCheckpointTest\`, \`PersistentVectorStoreSearchTest\` — all green locally - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary Adds OIDC/JWT bearer-token authentication across HerdDB: - **JDBC client → HerdDB server** via the new **OAUTHBEARER** SASL mechanism (RFC 7628), running over the existing PDU handshake alongside PLAIN/DIGEST-MD5/Kerberos - **HerdDB server → HerdDB server** (leader/follower, inter-cluster) - **File server (gRPC)** and **Indexing service (gRPC)** — which previously had **no authentication at all** — now both enforce `Authorization: Bearer <jwt>` via a shared gRPC interceptor JWTs are validated against the OIDC provider's JWKS (cached, auto-refreshing) using Nimbus JOSE+JWT. Clients obtain tokens via the OAuth2 **client_credentials** grant (with automatic refresh) or can pass a pre-acquired token directly. Works with Keycloak, Okta, Auth0, etc. Built as **7 incremental commits**, each with passing unit tests: 1. `herddb-auth-oidc` foundation module — `OidcConfiguration` (discovery), `OidcTokenValidator`, `OidcTokenProvider`, `PrincipalExtractor` (20 tests) 2. **OAUTHBEARER** SASL mechanism — `SaslServer`/`SaslClient` impls + JVM provider + `TokenAuthenticator` callback (7 tests) 3. JDBC client + HerdDB server wiring — `ClientConfiguration` / `ServerConfiguration` properties, `HDBClient` shared `OidcTokenProvider`, `Server` validator setup (5 E2E tests) 4. File server gRPC — shared `JwtAuthServerInterceptor`/`JwtAuthClientInterceptor`, `RemoteFileServer`/`RemoteFileServiceClient` integration (3 + 6 tests) 5. Indexing service gRPC — same pattern for `IndexingServer`/`IndexingServiceClient` (3 tests) 6. End-to-end Keycloak test via Testcontainers (`KeycloakOidcE2ETest`) — real Keycloak container, realm + client provisioned via admin REST API; skips gracefully when Docker is unavailable 7. `AUTHENTICATION.md` documenting all HerdDB auth modes + JDBC URL examples + troubleshooting, plus a `herddb-services/src/main/demo/keycloak/` docker-compose for demos **46 new tests**, all existing SASL PLAIN / DIGEST-MD5 tests still pass. ## Configuration at a glance Server: ```properties oidc.enabled=true oidc.issuer.url=https://keycloak.example.com/realms/herddb oidc.audience=herddb-api # optional oidc.username.claim=preferred_username # optional ``` JDBC client: ``` jdbc:herddb:server:host:7000?client.auth.mech=OAUTHBEARER&\ oidc.issuer.url=https://keycloak.example.com/realms/herddb&\ oidc.client.id=herddb-jdbc&oidc.client.secret=<secret> ``` ## Test plan - [x] `mvn -pl herddb-auth-oidc test` — 33 tests green (validator, provider, discovery, OAUTHBEARER SASL roundtrip, gRPC interceptors) - [x] `mvn -pl herddb-core -Dtest=OidcAuthTest test` — 5 E2E tests green (valid creds, wrong secret, wrong issuer, missing config fails fast, pre-acquired token) - [x] `mvn -pl herddb-remote-file-service -Dtest=RemoteFileServerOidcTest test` — 3 tests green - [x] `mvn -pl herddb-indexing-service -Dtest=IndexingServiceOidcTest test` — 3 tests green - [x] `mvn -pl herddb-core -Dtest=JASSPLAINTest,JAASMD5Test test` — existing SASL tests still pass (no regression) - [x] Full clean build of reactor: `mvn clean install -DskipTests` across all touched modules - [ ] Keycloak E2E (`KeycloakOidcE2ETest`) — passes against real Keycloak container in CI; skipped locally without Docker - [ ] `docker compose up` in `herddb-services/src/main/demo/keycloak/` boots a realm with `herddb-jdbc`, `herddb-file`, `herddb-index`, `herddb-server` service-account clients 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Large FusedPQ graph files (hundreds of GB) can now be stored and retrieved as multipart files (.multipart/ dirs with numbered 4 MB block files) instead of thousands of 1 MB HerdDB index pages. - New RPC `WriteFileBlock(WriteFileBlockRequest) returns (WriteFileBlockResponse)`: stores a single block of a multipart file by path + blockIndex - New RPC `ReadFileRange(ReadFileRangeRequest) returns (ReadFileRangeResponse)`: retrieves a byte range from a multipart file (server computes blockIndex from offset / blockSize, returns the slice from that block) - Switched from `ServerBuilder` to `NettyServerBuilder` so we can wire up a `NioEventLoopGroup` directly and call `setIoRatio()` on it - ioRatio defaults to 70, overridable via system property `herddb.fileserver.netty.ioRatio` - blockSize (default 4 MB) read from `block.size` config property; `maxInboundMessageSize` set to blockSize + 1 MB - Required adding both `bossEventLoopGroup` + `channelType(NioServerSocketChannel)` alongside `workerEventLoopGroup` to satisfy Netty's "all or none" contract - Added `MULTIPART_SUFFIX = ".multipart"` constant - New abstract methods: `writeBlock`, `readRange`, `deleteLogical`, `listLogical` - `LocalObjectStorage`: blocks stored as `{path}.multipart/{blockIndex}` files; `readRange` loads the block file and slices; `deleteLogical` recursively removes the `.multipart/` directory; `listLogical` deduplicates via LinkedHashSet - `S3ObjectStorage`: each block is a separate S3 object at key `{prefix}/{path}.multipart/{blockIndex}`; `readRange` downloads the whole block object and returns the requested slice; `deleteLogical` removes all block objects via deleteByPrefix + deletes the single-file key in parallel - `CachingObjectStorage`: `writeBlock` writes to inner and caches the block; `readRange` serves from cache (block key = `{path}.multipart/{blockIndex}`); `deleteLogical` invalidates all matching cache entries then delegates - New `writeFileBlock` handler: calls `storage.writeBlock()` with metrics (writeblock_requests, writeblock_errors, writeblock_bytes, writeblock_latency) - New `readFileRange` handler: calls `storage.readRange()` with metrics (readrange_requests, readrange_errors, readrange_bytes, readrange_latency) - `deleteFile` now calls `storage.deleteLogical()` (handles both single files and multipart dirs) - `listFiles` now calls `storage.listLogical()` (returns collapsed logical paths) - Block routing key = `path + "#block" + blockIndex` for consistent hash, so blocks of the same logical file distribute across file servers - `writeFileBlock(Async)`: unary RPC routed by block key - `readFileRange(Async)`: handles cross-block requests by splitting into two sequential calls (each routed independently) - `writeMultipartFile(path, InputStream, blockSize)`: reads InputStream in blockSize chunks, calls `writeFileBlock` sequentially, returns total bytes written - `listFiles` aggregation uses `LinkedHashSet` to deduplicate logical paths returned from multiple servers - `buildChannel` sets `maxInboundMessageSize(5 MB)` so large block responses fit - Implements `io.github.jbellis.jvector.disk.RandomAccessReader` - Buffers one block at a time; `ensureBlockLoaded()` fetches via `readFileRange` only when the current block changes (amortises network round trips for sequential reads within a block) - Cross-block `readFully` handled transparently in a loop - All primitive decode methods (`readInt`, `readFloat`, `readLong`) implemented via `readFully` to get big-endian byte reads - Nested `Supplier implements ReaderSupplier` creates independent reader instances per thread (jvector calls `get()` once per search thread) - `writeMultipartIndexFile(tableSpace, uuid, fileType, tempFile)`: streams the temp file to remote servers via `client.writeMultipartFile()`; returns logical path - `multipartIndexReaderSupplier(tableSpace, uuid, fileType, fileSize)`: returns a `RemoteRandomAccessReader.Supplier` backed by the client and logical path - `deleteMultipartIndexFile(tableSpace, uuid, fileType)`: calls `client.deleteFile()` on the logical path (which routes through `deleteLogical` on the server) - Config property `remote.file.multipart.block.size` (default 4 MB) - Added default no-op methods `writeMultipartIndexFile`, `multipartIndexReaderSupplier`, `deleteMultipartIndexFile` so non-remote implementations are unaffected - Added `graphFilePath`, `graphFileSize`, `mapFilePath`, `mapFileSize` fields for multipart mode (populated by `PersistentVectorStore` after a multipart write) - `SegmentWriteResult` gains a multipart constructor (paths + sizes instead of page ID lists) and `isMultipart()` predicate - `writeFusedPQGraphToTempFile` / `writeFusedPQMapDataToTempFile`: build the jvector graph / BLink map to a local temp file and return the Path - `writeOneSegmentData`: tries multipart first (if DSM supports it), falls back to page-based chunks - Metadata format v3 (`METADATA_VERSION_MULTI_SEGMENT`): each segment now serialized with a `useMultipart` byte flag; multipart segments store `graphFilePath`, `graphFileSize`, `mapFilePath`, `mapFileSize` via `DataOutputStream.writeUTF()`; page-based segments still store the page ID arrays - `loadMultiSegmentFormat` reads the new format via `DataInputStream` - `loadFusedPQSegment` uses `multipartIndexReaderSupplier` when `seg.graphFilePath` is set, otherwise falls back to the `PageStoreReader` path - `readMultipartMapDataToTempFile`: downloads multipart map data via `RemoteRandomAccessReader` to a local temp file for BLink loading - `LocalObjectStorageTest`: writeBlock/readRange/deleteLogical/listLogical (4 new) - `RemoteFileServiceTest`: WriteFileBlock+ReadFileRange gRPC round trip, writeMultipartFile end-to-end, multipart list+delete (3 new) - `RemoteRandomAccessReaderTest` (new file): seek+readFully, cross-block read, readInt/readFloat/readLong, Supplier creates independent readers, length() (7 tests) - `CachingObjectStorageTest.FakeObjectStorage` updated to implement new abstract methods Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… configured VectorIndexManager.doStart() now calls remoteService() eagerly so that CREATE VECTOR INDEX fails immediately with a clear error message ("RemoteVectorIndexService is required") instead of succeeding silently and only crashing later when the first operation is attempted. Fixes VectorIndexTest.testFailsWithoutRemoteService. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… PromotableRemoteFileDataStorageManager (#11) ReadReplicaDataStorageManager now implements multipartIndexReaderSupplier so that shared-storage read replicas can serve vector search queries on indexes stored as multipart files (written by the leader via RemoteFileDataStorageManager). - ReadReplicaDataStorageManager.multipartIndexReaderSupplier: constructs a RemoteRandomAccessReader.Supplier using client.getBlockSize() and the same {tableSpace}/{uuid}/multipart/{fileType} path convention as the writer - PromotableRemoteFileDataStorageManager: added overrides for all three multipart methods (writeMultipartIndexFile, multipartIndexReaderSupplier, deleteMultipartIndexFile) that delegate to activeDelegate, so that the correct implementation is used regardless of whether the node is currently a replica or a promoted leader - RemoteRandomAccessReaderTest: added testReadReplicaMultipartIndexReaderSupplier which creates a ReadReplicaDataStorageManager with a small block size (64 bytes), writes a 3-block multipart file, and verifies: full sequential read, correct total length, and two independent readers at different seek positions Fixes a bug where read replicas would use the no-op base-class multipartIndexReaderSupplier (returning null) and therefore fail to open vector index segments stored as multipart files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

#12) ## Summary - **VectorIndexEndToEndTest**: Fixed tablespace UUID mismatch between Server and EmbeddedIndexingService that caused `waitForCatchUp()` to loop forever — the indexing service was tailing a non-existent txlog directory - **NettyChannelAcceptor.close()**: Fixed JVM hang after tests complete by using `shutdownNow()` + `awaitTermination()` for the callback executor and bounded graceful shutdown for Netty event loops - **IndexingServiceClient.waitForInstanceCatchUp()**: Added 5-minute timeout to prevent infinite polling loops ## Test plan - [x] herddb-jdbc: 78 tests pass locally - [x] herddb-net: 7 tests pass locally - [x] herddb-services: 25 tests pass locally (including VectorIndexEndToEndTest which previously hung) - [ ] Monitor CI workflow `pr-validation-test-other.yml` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary Makes the indexing service's local disk fully reconstructible from S3 when running with `storage.type=remote`, so a Kubernetes pod can restart with an empty volume and resume from the last published checkpoint. ## What this fixes Two pieces of local state were previously unrecoverable after a disk wipe: 1. **`watermark.dat`** — last-processed commit-log LSN, written only to the local disk. 2. **`remote-metadata/`** — IndexStatus checkpoint markers that `PersistentVectorStore` reads on startup to find its sealed S3 segments. If wiped, the store can't discover what it had already checkpointed to S3. There was also a latent bug: `IndexingServiceEngine` only configured the `PersistentVectorStore` factory for `storage.type=file`, silently falling back to `InMemoryVectorStore` for `storage.type=remote`. ## Changes - **`WatermarkStore`** is now an interface; `LocalWatermarkStore` keeps the previous disk impl, and new **`S3WatermarkStore`** persists the watermark to `{tableSpace}/_indexing/{instanceId}/watermark.lsn` with an XXHash64 footer. Monotonicity is enforced on save. - **Save contract**: the watermark is written **only** after a successful `indexCheckpoint` — never on a timer, entry count, or shutdown. This keeps the stored LSN always covered by a durable IndexStatus on S3 and bounds S3 writes to ≤ 1 per checkpoint. `IndexingServiceEngine.checkpointAndSaveWatermark()` triggers a checkpoint every 5000 entries and only advances the watermark after all stores succeed. - **`SharedCheckpointMetadataManager.hydrateLocalMetadataDir`** downloads every `_metadata/*.{index,table}status` object from S3 into a staging dir and atomically moves it into place. The byte-for-byte on-disk format matches `FileDataStorageManager`'s, so no deserialization is needed. - **`IndexingServer`** now wires a `SharedCheckpointMetadataManager` onto its `RemoteFileDataStorageManager` (so every `indexCheckpoint` publishes its IndexStatus to S3), and runs hydrate + installs the `S3WatermarkStore` via a new `afterTableSpaceResolved` hook on the engine. - **Commit-log tailer** always starts from `START_OF_TIME` so DDL entries are reprocessed on every boot and the in-memory `SchemaTracker` is rebuilt. DML is replayed too but is idempotent (`addVector` replaces by PK); previously-sealed segments load directly from S3 via `PageStoreReader`, so replay only has to re-add uncheckpointed vectors. A future PR can skip DML with LSN ≤ watermark while still reprocessing DDL. - **`IndexingServiceEngine`** now configures the `PersistentVectorStore` factory for both `storage.type=file` AND `storage.type=remote`. ## Tests - `S3WatermarkStoreTest` (new) — 10 unit tests with fault injection: absent-returns-START_OF_TIME, roundtrip, path format, monotonicity, equal-LSN idempotency, corruption detection, truncation detection, overwrite-after-corruption, read-failure, write-failure. - `IndexingServiceWithS3E2ETest` — updated to wipe the entire indexing data dir (watermark.dat AND remote-metadata/) between stop and restart, exercising the hydrate + watermark-from-S3 path. Also asserts `watermark.lsn` exists on S3, not locally. - All 141 indexing-service tests and 85 remote-file-service tests still pass. ## Known follow-up - DML replay from START_OF_TIME on every restart is correct but wasteful; a future PR can persist a lightweight schema snapshot alongside the watermark and skip already-applied DML. - Full cluster E2E test (ZK + BK + 2 servers in cluster mode + indexing + restart everything except ZK/BK) was scoped out of this PR to keep it reviewable. ## Test plan - [x] `mvn -pl herddb-indexing-service test -Dtest=S3WatermarkStoreTest` - [x] `mvn -pl herddb-services test -Dtest=IndexingServiceWithS3E2ETest` - [x] `mvn -pl herddb-indexing-service test` (all 141 pass) - [x] `mvn -pl herddb-remote-file-service test` (all 85 pass) - [x] `mvn -pl herddb-services test -Dtest='StaticDiscoveryRemoteFileIndexingTest,ZKDiscoveryRemoteFileIndexingTest'` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary - **CheckpointExecuteTest.checkpointRequiresTablespaceParam**: the error message changed to "CHECKPOINT requires 1 or 2 parameters" when the optional timeout parameter was added, but the test still expected the old "requires one parameter" message. - **SysindexstatusTest.testSysindexstatusVector**: since commit 82efba7, embedded vector indexing is no longer supported — `VectorIndexManager.doStart()` now fails fast if `RemoteVectorIndexService` is not configured. The test now provides a `MockRemoteVectorIndexService` and asserts on the remote `IndexStatusInfo` fields (`vectorCount`, `segmentCount`, `status`) instead of the old embedded properties. The other two CI failures (SimpleFollowerTest, MultipleConcurrentUpdatesTest) pass consistently locally and appear to be CI-environment flakiness rather than deterministic bugs. ## Test plan - [x] `CheckpointExecuteTest` — all methods pass - [x] `SysindexstatusTest` — all methods pass (including vector test) - [x] `MultipleConcurrentUpdatesTest` — passes 3/3 runs locally - [x] `SimpleFollowerTest` — passes locally 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary - **BookkeeperCommitLog bug fix**: In `BKFollowerContext.ensureOpenReader()`, the "find next ledger" loop at line 919 was assigning the found ledger ID to `ledgerToTail` instead of `nextLedger`. This meant `nextLedger` always remained `-1`, causing the follower to null out its `currentLedger` and take the "no more ledgers" path every time it needed to switch between closed/fully-read ledgers. This bug could cause the follower to reopen the same ledger repeatedly with incorrect `nextEntryToRead` offsets or fail to advance to new ledgers promptly. - **SimpleFollowerTest stabilization**: The test's auto-transaction uses `sync=false` (deferred) BK writes, meaning entries may not be visible to the follower immediately. Added a sync write (`NO_TRANSACTION` INSERT) after the transaction to guarantee BK visibility before the wait begins. Also added a fail-fast `isFailed()` check inside the wait condition and a descriptive timeout message for better CI diagnostics. ## Test plan - [x] `SimpleFollowerTest` — all 5 tests pass - [x] Verified the ledger switching fix is correct by code review - [ ] CI should confirm the SimpleFollowerTest no longer times out 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary - Fix `estimatedSizeBytes` in the page-based write path to include both graph and map pages (was only counting graph pages), which caused segments to not seal properly and underestimated `avgBytesPerVector` in subsequent checkpoints - Add on-disk index size metric (`ondisk_estimated_size_bytes`) and per-segment size distribution metrics (`segment_size_min/median/max_bytes`) for Grafana observability - Rename misleading `estimated_size_bytes` metric to `live_vectors_estimated_memory_bytes` (it reports in-memory usage, not disk size) - Add two new Grafana panels: "On-Disk Index Size" and "Segment Size Distribution" (min/p50/max) ## Test plan - [x] `PersistentVectorStoreParallelPhaseBTest` (6 tests) — passes - [x] `PersistentVectorStoreSearchTest` (12 tests) — passes - [ ] Verify new metrics appear in Prometheus after deploying the indexing service - [ ] Verify Grafana dashboard renders the new panels correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eolivelli · 2026-04-07T09:59:38Z

I have sent the PR to the wrong repo :-)
please ignore

eolivelli and others added 30 commits March 19, 2026 10:05

Add vector-testings/datasets/ to .gitignore

dc6c661

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Do not unpack vector datasets

79a6333

Add service RemoteFileServer

17c3871

Add helm chart

dc7738e

Improvements to RemoteFileServer

9130ef8

Add logging to RemoteFileServer for debug

6f83783

Stop JVM in Vector benchmark

65f52e8

Move CI to JDK22

7fd93a0

Implement CHECKPOINT command

f9d0f89

Improve Vector Testing tool

3436d38

Change default port for FileServer

5a361ad

Add initial documentation about the REMOTE FILE SERVER

359f0c9

Fixes to VectorIndexManager and testing framework

d075e32

Make file server async

23e8c4d

Make file server async

4bd8114

Make file server async

b6472f7

Improvements to vector indexing

4a61dd1

RemoteFileServer: add support for S3

6f5ff35

Improve Helm chart with S3 and Minio support

f3b0ec9

Fix OpenJPA test build on JDK22

9b9f760

Improve Vector Benchmark

41a00f5

Optimize ORDER BY + LIMIT

05654bf

Vector index: do not go OOM

cb5e67d

Improve code utils about float vectors

716e7fb

Update test scripts

c46c41c

Vector: improve resilience to OOM

0475e9c

Reduce memory copies in FileBackedVectorValues

cf7dc7f

eolivelli and others added 27 commits April 3, 2026 12:44

Remove logs and improve dashboards

bc36414

Kubernetes: add first tests with Indexing Service and with File Server

32dd9d8

Vector + FileServer: first integration

9c2c751

Vector: fix storage leak

80ef08e

eolivelli closed this Apr 7, 2026

eolivelli deleted the vector-index-async branch April 7, 2026 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge Vector Search into master branch#837

Merge Vector Search into master branch#837
eolivelli wants to merge 133 commits into
diennea:masterfrom
eolivelli:vector-index-async

eolivelli commented Apr 7, 2026

Uh oh!

eolivelli commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eolivelli commented Apr 7, 2026

Uh oh!

eolivelli commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant