Merge Vector Search into master branch#837
Closed
eolivelli wants to merge 133 commits into
Closed
Conversation
…arameters
## Features
### Vector index (ANN search)
- New index type `VECTOR` backed by jvector (Vamana-style ANN graph).
- Indexed column must be of type `floata` (FLOAT ARRAY).
- CREATE / INSERT / UPDATE / DELETE path fully implemented.
- ANN query via `ann_of(col, ?)` scalar function:
- `ORDER BY ann_of(col, CAST(? AS FLOAT ARRAY)) DESC LIMIT k` is
intercepted by CalcitePlanner and routed through the new
VectorANNScanOp planner operator when a vector index exists.
- Falls back to brute-force cosine similarity when no index is present.
- Optional WHERE filter applied after PK fetch.
### Configurable hyperparameters via WITH clause
- Parameters `m`, `beamWidth`, `neighborOverflow`, `alpha`, `similarity`
(cosine / euclidean / dot), and `fusedPQ` can be set at index creation:
CREATE VECTOR INDEX vidx ON tblspace1.t1(vec)
WITH m=32 beamWidth=200 similarity=cosine fusedPQ=true;
- Index.java now carries a `Map<String,String> properties` field,
serialised as format version 2 (version 1 files read with empty map).
- JSQLParserPlanner strips the WITH suffix before JSQLParser sees it
(JSQLParser does not support this syntax), parses key=value pairs, and
transfers them to the Index.Builder via a ThreadLocal.
### FusedPQ on-disk format (hybrid architecture)
- When fusedPQ=true (default), dimension ≥ 8, and ≥ 256 active vectors,
checkpoints use jvector's OnDiskGraphIndex with FusedPQ + InlineVectors
features for faster approximate scoring at search time.
- On restart the on-disk graph is loaded via a ByteBuffer-backed
ReaderSupplier; new inserts after restart go to a fresh in-memory
GraphIndexBuilder.
- Search is hybrid: on-disk graph (FusedPQ scoring + InlineVectors
reranking) and live in-memory graph are searched independently; results
are merged by score descending before returning the top-K.
- Deleted on-disk nodes are tracked in onDiskNodeToPk / onDiskPkToNode
and excluded at search time via an acceptBits filter.
- Falls back to the legacy OnHeapGraphIndex format for small indexes
(< 256 vectors or dimension < 8) or when fusedPQ=false.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Intercept ORDER BY ann_of(col, ?) in JSQLParserPlanner before planSort() and route through VectorANNScanOp, matching CalcitePlanner behaviour. - VectorANNScanOp: support null scanProjection/innerProjection (full table row passthrough); null-safe fallback (throws StatementExecutionException when no index and no fallback op) - JSQLParserPlanner: add tryBuildVectorANNSortOp() that detects the ann_of() pattern in a single ORDER BY element and builds VectorANNScanOp with null projections; falls through to planSort() for all other cases - Add VectorIndexJSQLParserPlannerTest covering: index present, WHERE + ANN filter, and no-index error path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Tuple - New vector-testings/ module: standalone JDBC benchmark for vector index ingestion throughput, index build time, ANN query latency, and recall@K - Supports SIFT-1M, SIFT-10M, and BIGANN-1B datasets (fvecs + bvecs formats) - Configurable parallelism, batch size, index parameters, and top-K - Fix Tuple.serialize() to handle Float values (returned by ann_of()) by converting to Double before serialization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduce a volatile snapshot swap pattern (ServerSnapshot inner class) to allow lock-free reads in hot-path methods while supporting safe dynamic server list updates via updateServers(). Existing channels are reused when servers are retained, and removed channels are shut down gracefully in a background daemon thread. Also adds the DynamicServiceClient interface in herddb-core for generic dynamic update support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation Add ServiceDiscoveryListener interface and service registration/discovery methods for indexing services and file servers. The ZooKeeper implementation uses ephemeral nodes for automatic deregistration on disconnect and watchers for push-based change notifications. Services are re-registered after ZK session expiry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Refactor RemoteFileServiceClient to support dynamic server list updates via a volatile snapshot swap pattern. This allows changing the set of remote file servers at runtime without restarting the client. - Add DynamicServiceClient interface in herddb-core - Introduce ServerSnapshot inner class holding router + channels - Replace final router/channels fields with volatile snapshot - Extract buildChannel() helper for channel creation - Add updateServers() with channel reuse and graceful shutdown - Update all stub methods and fan-out operations to read snapshot - Add RemoteFileServiceClientDynamicTest with 3 test cases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eeper Enable IndexingServer and RemoteFileServer to register/unregister themselves in MetadataStorageManager on startup/shutdown, allowing service discovery via ZooKeeper in cluster mode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…service mode Introduce server.storage.mode property (local|remote|bookkeeper) to independently select the DataStorageManager. Remove PROPERTY_MODE_REMOTE_FILE_SERVICE. The remote file storage is now activated via server.storage.mode=remote with any compatible server.mode (standalone or cluster). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verify that IndexingService instances register in ZooKeeper and can be discovered dynamically by the HerdDB server. Two test methods cover the single-instance and two-instance (sharded) scenarios, both performing ANN queries against orthogonal 4D vectors to confirm correctness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies that a RemoteFileServer registered in ZK is discovered by a HerdDB server running in cluster mode with server.storage.mode=remote, and that basic table CRUD operations work with remote page storage. Includes a two-server variant that tests consistent hashing across multiple file servers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ServerMain: after server.start(), try ZK discovery for indexing services if indexing.service.servers is not statically configured. Register a ServiceDiscoveryListener for dynamic updates. - Server.buildDataStorageManager(): for storage.mode=remote, try ZK discovery for file servers if remote.file.servers is not configured. Register listener for dynamic updates via DynamicServiceClient. - Change PROPERTY_REMOTE_FILE_SERVERS_DEFAULT to empty string since the old remote-file-service mode was removed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test-start-fileserver-cluster.sh: use server.storage.mode=remote instead of server.mode=remote-file-service - Helm chart: condition file-server resources on server.storageMode=remote instead of server.mode=remote-file-service - REMOTE_FILE_SERVER.md: update configuration examples Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - Add `client.allowReadsFromFollowers` configuration flag (default `false`) that enables distributing SELECT queries across all replicas (leader + followers) when in AUTOCOMMIT mode (`tx == 0`) - Writes always go to the leader; reads inside transactions always go to the leader - On follower read failure, automatic retry picks a different replica via existing `requestMetadataRefresh` mechanism - Backward-compatible wire protocol extension using bitmask trailer in `OpenScanner` PDU ## Changes ### Wire Protocol (`herddb-utils`) - **`Pdu.java`**: Add `FLAGS_OPENSCANNER_ALLOW_FOLLOWER_READS = 8` (bit 3, composable with existing bit 2) - **`PduCodec.java`**: Add `OpenScanner.write()` overload with `allowFollowerReads` param; trailer uses bitmask OR. Add `isAllowFollowerReads()` decoder ### Client Configuration (`herddb-core`) - **`ClientConfiguration.java`**: Add `client.allowReadsFromFollowers` property (works via JDBC URL query params) - **`HDBClient.java`**: Store and expose `allowReadsFromFollowers` flag ### Metadata Provider - **`ClientSideMetadataProvider.java`**: Add `getTableSpaceReplicas()` default method returning `Set<String>` - **`ZookeeperClientSideMetadataProvider.java`**: Cache `TableSpace.replicas` alongside leader in `readAsTableSpace()`; implement `getTableSpaceReplicas()`; clear replica cache on metadata refresh ### Client Routing - **`ClientSideConnectionPeer.java`**: Add `executeScan` overload with `allowFollowerReads` parameter (default method delegates to old signature) - **`RoutedClientSideConnection.java`**: Implement new overload, pass flag to `PduCodec.OpenScanner.write()` - **`HDBConnection.java`**: Add `getRouteToTableSpaceReplica()` (random selection from replica set); modify `executeScan()` to route to replicas when `allowReadsFromFollowers && tx == 0` - **`NonMarshallingClientSideConnectionPeer.java`**: Override new `executeScan` for local mode ### Server-Side - **`Statement.java`**: Move `allowExecutionFromFollower` field from `ScanStatement` to base class (enables `SQLPlannedOperationStatement` to carry the flag) - **`ScanStatement.java`**: Remove duplicate field (now inherited) - **`ServerSideConnectionPeer.java`**: Read `allowFollowerReads` from PDU trailer in `handleOpenScanner()`; set flag on translated plan's main statement. Add overloaded `executeScan()` for local mode - **`DBManager.java`**: Fix leader check in `executeStatementAsync()` to respect `statement.getAllowExecutionFromFollower()` — critical for `SQLPlannedOperationStatement` path through the Calcite planner ## Test Plan - [ ] **`PduCodecTest.readObjectsTrailerWithFollowerReads`**: Protocol flag encoding/decoding for all 4 combinations of `keepReadLocks` x `allowFollowerReads` - [ ] **`FollowerReadsRoutingTest`** (4 tests): Config defaults, JDBC URL parsing, routing to replicas when enabled, no follower reads inside transactions - [ ] **`FollowerReadsTest`** (4 tests): Cluster integration — read from follower, failover to leader when follower stops, no follower reads in transaction, follower promoted to leader with role change - [ ] **`FollowerReadsVectorIndexTest`**: End-to-end with 1 leader + 1 follower + 1 indexing service — vector ANN search works via follower reads - [ ] Verify existing tests pass: `ChangeRoleTest`, `SimpleFollowerTest`, `ZKDiscoveryVectorIndexTest` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
Introduces a new `server.mode=shared-storage` where data lives on shared
S3 (via the remote file server) and replicas can **serve reads** from a
consistent checkpoint snapshot. Unlike `diskless-cluster` (where
replicas can't serve reads) or `cluster` (where followers replay WAL and
maintain per-node copies), shared-storage replicas consume the leader's
checkpointed state directly from S3. They never write to S3 or
BookKeeper, and are promotable to leader on failover.
## Architecture
```
Leader (server.mode=cluster, server.storage.mode=remote,
server.checkpoint.publish.to.remote=true):
WAL -> BookKeeper
Pages -> S3 (existing)
Checkpoint -> S3 metadata (TableStatus, IndexStatus, table/index defs) [NEW]
LSN notify -> ZooKeeper [NEW]
Replica (server.mode=shared-storage):
ZK watch -> detects new checkpoint LSN
Read meta -> from S3
Atomic swap -> rebuild in-memory state under write lock
Serve reads -> S3 pages (with per-server local caching)
On failover -> promotable to leader, replays BookKeeper WAL from checkpoint
```
## New classes (herddb-remote-file-service)
- **`SharedCheckpointMetadataManager`** — reads/writes checkpoint
metadata on S3 (`TableStatus`, `IndexStatus`, table/index definitions,
checkpoint LSN marker)
- **`ReadReplicaDataStorageManager`** — read-only `DataStorageManager`;
writes throw `UnsupportedOperationException`
- **`PromotableRemoteFileDataStorageManager`** — starts read-only,
promotable to writable `RemoteFileDataStorageManager` on leader election
## Modified (herddb-core + herddb-remote-file-service)
- **`ServerConfiguration`** — `PROPERTY_MODE_SHARED_STORAGE` + replica
config (poll interval, switch timeout, publish-to-remote flag)
- **`RemoteFileDataStorageManager`** — optional
`SharedCheckpointMetadataManager` that publishes metadata to S3 during
checkpoint
- **`MetadataStorageManager`** — adds `publishCheckpointLsn` /
`getCheckpointLsn` / `watchCheckpointLsn` API (no-op defaults)
- **`ZookeeperMetadataStorageManager`** — ZK-based checkpoint LSN
publish/watch at `{basePath}/checkpoints/{tableSpaceUUID}`
- **`TableSpaceManager`** — `CheckpointFollowerThread` (watches ZK,
polls as safety net) + `refreshFromCheckpoint()` (atomic view switch
under write lock) + shared-storage branches in `startAsFollower()` /
`recover()`
- **`DBManager`** — `readOnlyReplica` flag rejects non-read statements
(DDL/DML/transactions) on non-leader shared-storage replicas
- **`Server`** — wires shared-storage mode across metadata / storage /
commit-log / node-id managers
## Consistency model
- Replicas serve reads at **checkpoint granularity** (default ~15 min,
tunable via `server.checkpoint.period`)
- No dirty reads (checkpoints only include committed state)
- No read-your-writes across leader/replica
- On checkpoint switch: write lock blocks new queries; in-flight reads
complete against old state; all cached pages for each table are evicted
to pick up changes
## Edge cases handled
- BLink/BRIN index consistency during switch (closed + rebooted from new
checkpoint)
- Table create/drop/alter detected via checkpoint metadata diff
- Replica boot before first checkpoint: boots empty, picks up state via
ZK watch
- Failover WAL gap: leader doesn't trim ledgers past last checkpoint,
new leader replays the gap
- S3 unavailability: cached pages still served, checkpoint refresh
retries with backoff
## Configuration
**Leader:**
```properties
server.mode=cluster
server.storage.mode=remote
server.checkpoint.publish.to.remote=true # NEW
```
**Replica:**
```properties
server.mode=shared-storage # NEW mode
remote.file.servers=fileserver1:9846,fileserver2:9846
server.zookeeper.address=zk:2181/herddb
server.base.dir=/var/herddb/data # needed for failover promotion
```
## Test plan
- [ ] Unit-test `SharedCheckpointMetadataManager` round-trip via mock
`RemoteFileServiceClient`
- [ ] Unit-test `ReadReplicaDataStorageManager` (reads delegate, writes
throw UOE)
- [ ] Integration test: leader inserts + checkpoints -> replica sees
rows after refresh
- [ ] Integration test: DDL propagation across checkpoint boundary
- [ ] Integration test: failover from replica to leader, WAL replay of
gap
- [ ] Integration test: concurrent queries during checkpoint switch (no
partial views)
- [ ] Integration test: replica boot before first checkpoint
- [ ] Existing `DisklessClusterTest`, `RemoteFileServiceClientTest`,
`RemoteFileBrinIndexRecoveryTest` still pass (verified locally)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary Addresses production issues from a 1B-vector BigANN ingest (see @/dev/debug-server): Phase B running 20+ min serially, disk filling up with ~800M on-disk vectors across 416 unbounded sealed segments plus 417 resident mmap'd tmp files, then ENOSPC + unbounded retry loops. ### P0 — correctness & unblocking - Track pageIds allocated during Phase B; on failure, roll them back via new `DataStorageManager#deleteIndexPage` (implemented in `FileDataStorageManager` + `MemoryDataStorageManager`, no-op default for BookKeeper/remote). - Exponential backoff (60s → 30min) in `compactionLoop` keyed on `consecutiveCheckpointFailures` — ENOSPC retries no longer burn CPU. ### P1 — Phase B wall-clock - Parallelise Phase B segment builds (`herddb.vectorindex.phaseBSegmentParallelism`, default 2). - New `PageStoreReader` with a bounded LRU page cache replaces the per-segment resident mmap'd tmp file. **No mmap** because mmap regions are not counted against the heap budget. Observed before: 417 `herddb-vector-*.tmp` files (one per segment) silently doubling on-disk footprint; after: tmp dir stays near-empty. ### P2 — bound segment count & ingest rate - `chooseSegmentsToDemote`: when sealed count > threshold (default 32), demote the smallest sealed segments to be merged in the next Phase B. - Absolute cap `herddb.vectorindex.maxLiveVectorsPerCheckpoint` breaks the positive-feedback loop between slow checkpoints and growing live backlog. ### P3 — observability - New metrics: `sealed_segment_count`, `phase_b_bytes_written`, `phase_b_vectors_per_second`, `checkpoint_consecutive_failures`, `checkpoint_total_failures`, `rolled_back_pages_(total|last)`, `tmp_dir_bytes`, `free_disk_bytes`. - Grafana dashboard `indexing-service-dashboard.json` gains two new rows. ## Test plan - [x] `PersistentVectorStoreBackoffTest` (5 tests) — pure backoff policy incl. overflow - [x] `PageStoreReaderTest` (16 tests) — byte-level round-trip, LRU eviction, EOF, wrong chunk type, missing page, shared cache, concurrent readers - [x] `PersistentVectorStoreMergePolicyTest` (7 tests) — demotion below/at/above threshold, smallest-first, batch cap - [x] `PersistentVectorStoreCapComputationTest` (14 tests, +4 new) — absolute cap interaction with memory-derived cap - [x] `PersistentVectorStoreFailureRecoveryTest` (9 tests) — injected DSM failures, page rollback, failure counters, delete-failure tolerance, metrics, backoff observability - [x] `PersistentVectorStoreParallelPhaseBTest` (6 tests) — multi-segment Phase B, reopen recovery, partial failure, merge pressure, tmp dir clean - [x] Existing suite regression: `PersistentVectorStoreLifecycleTest`, `PersistentVectorStoreConcurrentCheckpointTest`, `PersistentVectorStoreSearchTest`, `PersistentVectorStoreBLinkCleanupTest`, `PersistentVectorStoreBackpressureLivelockTest`, `PersistentVectorStoreConfigTest` — all pass - [ ] Re-run the BigANN load end-to-end on a scaled-down replica with tight disk quota to verify (a) no ENOSPC, (b) sealed segment count stabilises, (c) no lingering `herddb-vector-*.tmp` files 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rvices zip (#8) ## Summary - Packages the `vector-testings` shaded uber-jar into `herddb-services.zip` (under `vector-testings/`, excluded from `lib/` to avoid ~58MB class duplication), so the Jib-built Docker image ships `VectorBench` at `/opt/herddb/vector-testings/vector-testings-<ver>.jar` plus a new `bin/vector-bench.sh` wrapper. - Adds an opt-in `vectorBench` Helm section: 1-replica `StatefulSet` (sleeps `infinity` for `kubectl exec`) with a 100Gi `volumeClaimTemplates` PVC mounted at `/opt/herddb/vector-datasets`, and `persistentVolumeClaimRetentionPolicy: {whenDeleted: Retain, whenScaled: Retain}` so downloaded datasets (sift1m, gist1m, deep-image-96, …) survive helm upgrades and uninstall/reinstall cycles. - `Config.java` `--dataset-dir` now falls back to `$VECTORBENCH_DATASET_DIR` when not passed explicitly, so the StatefulSet sets the path once via env and users don't have to repeat `--dataset-dir` on every `kubectl exec`. Disabled by default (`vectorBench.enabled: false`). Usage: ```bash helm upgrade --set vectorBench.enabled=true ... kubectl exec -it <release>-vector-bench-0 -- \ /opt/herddb/bin/vector-bench.sh --url "\$HERDDB_JDBC_URL" --dataset sift1m ``` ## Test plan - [x] ``mvn -pl vector-testings,herddb-services -am package -DskipTests`` — zip contains ``vector-testings/vector-testings-<ver>.jar`` (58MB) + ``bin/vector-bench.sh``, and NO ``lib/vector-testings*.jar`` - [x] ``helm lint --set vectorBench.enabled=true`` — clean - [x] ``helm template --set vectorBench.enabled=true`` — StatefulSet renders with correct env, PVC and retention policy - [ ] End-to-end on k3s: install, confirm ``datasets-<release>-vector-bench-0`` PVC is Bound, run ``vector-bench.sh --dataset sift1m`` inside the pod, then ``helm upgrade`` and verify dataset files in ``/opt/herddb/vector-datasets`` persist 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
Plumbs together HerdDB server (\`server.storage.mode=remote\`),
RemoteFileServer (\`storage.mode=s3\`), and the IndexingServer
(\`indexing.storage.type=remote\`) so that a single MinIO bucket backs
both table data and vector-index data. Only the commit log, metadata,
and the indexing service's watermark/checkpoint cache stay local.
### Fixes discovered while building the test
- **VectorIndexManager held a stale client after restart.** Capturing
\`RemoteVectorIndexService\` by value at index-open time meant
restarting the IndexingServer (new gRPC client) left ANN search failing
with \"Channel shutdown invoked\". Changed to a
\`Supplier<RemoteVectorIndexService>\` that resolves lazily via
\`DBManager\`.
- **IndexingServer wiped \`RemoteFileDataStorageManager\` state on
restart.** It passed \`dataDir\` as both metadataDir and tmpDir;
\`FileDataStorageManager\` cleans tmpDir on start()/close(), nuking
checkpoint markers. Fixed by splitting into
\`{dataDir}/remote-metadata\` (preserved) and \`{dataDir}/remote-tmp\`
(scratch).
- **\`RemoteFileDataStorageManager.deleteIndexPage\` was missing.** The
Phase-B rollback path from PR #7 was a no-op over S3, leaking objects on
checkpoint failure. Added an override calling \`client.deleteFile\` on
the remote index-page path.
### End-to-end test
\`IndexingServiceWithS3E2ETest\` in \`herddb-services\`:
- \`@ClassRule\` MinIO testcontainer amortised across methods; per-test
S3 prefix cleaned in \`@After\`.
- Full topology per test: RemoteFileServer(s3) + Server(remote) +
IndexingServer(remote), all talking to one MinIO bucket.
- 5 scenarios: search before/after checkpoint, restart-before-checkpoint
(crash + replay), restart-after-checkpoint (300 random vectors, real
FusedPQ checkpoint, stop + restart + verify segments reload from S3),
tmp-dir-stays-local guard rail.
### Known gap (follow-up)
The indexing service still needs \`watermark.dat\` +
\`remote-metadata/\` to survive restart. For truly ephemeral pods (disks
wiped on restart) both need to be reconstructible from S3/ZK/BK.
\`SharedCheckpointMetadataManager\` already publishes the right paths
for the read-replica flow; wiring a bootstrap-from-S3 read path and
moving the watermark off local disk is deferred.
## Test plan
- [x] \`RemoteFileDataStorageManagerBasicTest\` — 3 new cases for delete
+ idempotency
- [x] \`IndexingServiceWithS3E2ETest\` — 5 scenarios, ~45s end-to-end
- [x] Regression: \`StaticDiscoveryRemoteFileIndexingTest\`,
\`ZKDiscoveryRemoteFileIndexingTest\`,
\`PersistentVectorStoreFailureRecoveryTest\`,
\`PersistentVectorStoreLifecycleTest\`,
\`PersistentVectorStoreConcurrentCheckpointTest\`,
\`PersistentVectorStoreSearchTest\` — all green locally
- [ ] CI
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary Adds OIDC/JWT bearer-token authentication across HerdDB: - **JDBC client → HerdDB server** via the new **OAUTHBEARER** SASL mechanism (RFC 7628), running over the existing PDU handshake alongside PLAIN/DIGEST-MD5/Kerberos - **HerdDB server → HerdDB server** (leader/follower, inter-cluster) - **File server (gRPC)** and **Indexing service (gRPC)** — which previously had **no authentication at all** — now both enforce `Authorization: Bearer <jwt>` via a shared gRPC interceptor JWTs are validated against the OIDC provider's JWKS (cached, auto-refreshing) using Nimbus JOSE+JWT. Clients obtain tokens via the OAuth2 **client_credentials** grant (with automatic refresh) or can pass a pre-acquired token directly. Works with Keycloak, Okta, Auth0, etc. Built as **7 incremental commits**, each with passing unit tests: 1. `herddb-auth-oidc` foundation module — `OidcConfiguration` (discovery), `OidcTokenValidator`, `OidcTokenProvider`, `PrincipalExtractor` (20 tests) 2. **OAUTHBEARER** SASL mechanism — `SaslServer`/`SaslClient` impls + JVM provider + `TokenAuthenticator` callback (7 tests) 3. JDBC client + HerdDB server wiring — `ClientConfiguration` / `ServerConfiguration` properties, `HDBClient` shared `OidcTokenProvider`, `Server` validator setup (5 E2E tests) 4. File server gRPC — shared `JwtAuthServerInterceptor`/`JwtAuthClientInterceptor`, `RemoteFileServer`/`RemoteFileServiceClient` integration (3 + 6 tests) 5. Indexing service gRPC — same pattern for `IndexingServer`/`IndexingServiceClient` (3 tests) 6. End-to-end Keycloak test via Testcontainers (`KeycloakOidcE2ETest`) — real Keycloak container, realm + client provisioned via admin REST API; skips gracefully when Docker is unavailable 7. `AUTHENTICATION.md` documenting all HerdDB auth modes + JDBC URL examples + troubleshooting, plus a `herddb-services/src/main/demo/keycloak/` docker-compose for demos **46 new tests**, all existing SASL PLAIN / DIGEST-MD5 tests still pass. ## Configuration at a glance Server: ```properties oidc.enabled=true oidc.issuer.url=https://keycloak.example.com/realms/herddb oidc.audience=herddb-api # optional oidc.username.claim=preferred_username # optional ``` JDBC client: ``` jdbc:herddb:server:host:7000?client.auth.mech=OAUTHBEARER&\ oidc.issuer.url=https://keycloak.example.com/realms/herddb&\ oidc.client.id=herddb-jdbc&oidc.client.secret=<secret> ``` ## Test plan - [x] `mvn -pl herddb-auth-oidc test` — 33 tests green (validator, provider, discovery, OAUTHBEARER SASL roundtrip, gRPC interceptors) - [x] `mvn -pl herddb-core -Dtest=OidcAuthTest test` — 5 E2E tests green (valid creds, wrong secret, wrong issuer, missing config fails fast, pre-acquired token) - [x] `mvn -pl herddb-remote-file-service -Dtest=RemoteFileServerOidcTest test` — 3 tests green - [x] `mvn -pl herddb-indexing-service -Dtest=IndexingServiceOidcTest test` — 3 tests green - [x] `mvn -pl herddb-core -Dtest=JASSPLAINTest,JAASMD5Test test` — existing SASL tests still pass (no regression) - [x] Full clean build of reactor: `mvn clean install -DskipTests` across all touched modules - [ ] Keycloak E2E (`KeycloakOidcE2ETest`) — passes against real Keycloak container in CI; skipped locally without Docker - [ ] `docker compose up` in `herddb-services/src/main/demo/keycloak/` boots a realm with `herddb-jdbc`, `herddb-file`, `herddb-index`, `herddb-server` service-account clients 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Large FusedPQ graph files (hundreds of GB) can now be stored and retrieved as
multipart files (.multipart/ dirs with numbered 4 MB block files) instead of
thousands of 1 MB HerdDB index pages.
- New RPC `WriteFileBlock(WriteFileBlockRequest) returns (WriteFileBlockResponse)`:
stores a single block of a multipart file by path + blockIndex
- New RPC `ReadFileRange(ReadFileRangeRequest) returns (ReadFileRangeResponse)`:
retrieves a byte range from a multipart file (server computes blockIndex from
offset / blockSize, returns the slice from that block)
- Switched from `ServerBuilder` to `NettyServerBuilder` so we can wire up a
`NioEventLoopGroup` directly and call `setIoRatio()` on it
- ioRatio defaults to 70, overridable via system property
`herddb.fileserver.netty.ioRatio`
- blockSize (default 4 MB) read from `block.size` config property;
`maxInboundMessageSize` set to blockSize + 1 MB
- Required adding both `bossEventLoopGroup` + `channelType(NioServerSocketChannel)`
alongside `workerEventLoopGroup` to satisfy Netty's "all or none" contract
- Added `MULTIPART_SUFFIX = ".multipart"` constant
- New abstract methods: `writeBlock`, `readRange`, `deleteLogical`, `listLogical`
- `LocalObjectStorage`: blocks stored as `{path}.multipart/{blockIndex}` files;
`readRange` loads the block file and slices; `deleteLogical` recursively removes
the `.multipart/` directory; `listLogical` deduplicates via LinkedHashSet
- `S3ObjectStorage`: each block is a separate S3 object at key
`{prefix}/{path}.multipart/{blockIndex}`; `readRange` downloads the whole block
object and returns the requested slice; `deleteLogical` removes all block objects
via deleteByPrefix + deletes the single-file key in parallel
- `CachingObjectStorage`: `writeBlock` writes to inner and caches the block;
`readRange` serves from cache (block key = `{path}.multipart/{blockIndex}`);
`deleteLogical` invalidates all matching cache entries then delegates
- New `writeFileBlock` handler: calls `storage.writeBlock()` with metrics
(writeblock_requests, writeblock_errors, writeblock_bytes, writeblock_latency)
- New `readFileRange` handler: calls `storage.readRange()` with metrics
(readrange_requests, readrange_errors, readrange_bytes, readrange_latency)
- `deleteFile` now calls `storage.deleteLogical()` (handles both single files and
multipart dirs)
- `listFiles` now calls `storage.listLogical()` (returns collapsed logical paths)
- Block routing key = `path + "#block" + blockIndex` for consistent hash, so
blocks of the same logical file distribute across file servers
- `writeFileBlock(Async)`: unary RPC routed by block key
- `readFileRange(Async)`: handles cross-block requests by splitting into two
sequential calls (each routed independently)
- `writeMultipartFile(path, InputStream, blockSize)`: reads InputStream in
blockSize chunks, calls `writeFileBlock` sequentially, returns total bytes written
- `listFiles` aggregation uses `LinkedHashSet` to deduplicate logical paths returned
from multiple servers
- `buildChannel` sets `maxInboundMessageSize(5 MB)` so large block responses fit
- Implements `io.github.jbellis.jvector.disk.RandomAccessReader`
- Buffers one block at a time; `ensureBlockLoaded()` fetches via `readFileRange`
only when the current block changes (amortises network round trips for sequential
reads within a block)
- Cross-block `readFully` handled transparently in a loop
- All primitive decode methods (`readInt`, `readFloat`, `readLong`) implemented via
`readFully` to get big-endian byte reads
- Nested `Supplier implements ReaderSupplier` creates independent reader instances
per thread (jvector calls `get()` once per search thread)
- `writeMultipartIndexFile(tableSpace, uuid, fileType, tempFile)`: streams the
temp file to remote servers via `client.writeMultipartFile()`; returns logical path
- `multipartIndexReaderSupplier(tableSpace, uuid, fileType, fileSize)`: returns a
`RemoteRandomAccessReader.Supplier` backed by the client and logical path
- `deleteMultipartIndexFile(tableSpace, uuid, fileType)`: calls `client.deleteFile()`
on the logical path (which routes through `deleteLogical` on the server)
- Config property `remote.file.multipart.block.size` (default 4 MB)
- Added default no-op methods `writeMultipartIndexFile`, `multipartIndexReaderSupplier`,
`deleteMultipartIndexFile` so non-remote implementations are unaffected
- Added `graphFilePath`, `graphFileSize`, `mapFilePath`, `mapFileSize` fields for
multipart mode (populated by `PersistentVectorStore` after a multipart write)
- `SegmentWriteResult` gains a multipart constructor (paths + sizes instead of page
ID lists) and `isMultipart()` predicate
- `writeFusedPQGraphToTempFile` / `writeFusedPQMapDataToTempFile`: build the jvector
graph / BLink map to a local temp file and return the Path
- `writeOneSegmentData`: tries multipart first (if DSM supports it), falls back to
page-based chunks
- Metadata format v3 (`METADATA_VERSION_MULTI_SEGMENT`): each segment now serialized
with a `useMultipart` byte flag; multipart segments store `graphFilePath`,
`graphFileSize`, `mapFilePath`, `mapFileSize` via `DataOutputStream.writeUTF()`;
page-based segments still store the page ID arrays
- `loadMultiSegmentFormat` reads the new format via `DataInputStream`
- `loadFusedPQSegment` uses `multipartIndexReaderSupplier` when `seg.graphFilePath`
is set, otherwise falls back to the `PageStoreReader` path
- `readMultipartMapDataToTempFile`: downloads multipart map data via
`RemoteRandomAccessReader` to a local temp file for BLink loading
- `LocalObjectStorageTest`: writeBlock/readRange/deleteLogical/listLogical (4 new)
- `RemoteFileServiceTest`: WriteFileBlock+ReadFileRange gRPC round trip, writeMultipartFile
end-to-end, multipart list+delete (3 new)
- `RemoteRandomAccessReaderTest` (new file): seek+readFully, cross-block read,
readInt/readFloat/readLong, Supplier creates independent readers, length() (7 tests)
- `CachingObjectStorageTest.FakeObjectStorage` updated to implement new abstract methods
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… configured
VectorIndexManager.doStart() now calls remoteService() eagerly so that
CREATE VECTOR INDEX fails immediately with a clear error message
("RemoteVectorIndexService is required") instead of succeeding silently
and only crashing later when the first operation is attempted.
Fixes VectorIndexTest.testFailsWithoutRemoteService.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… PromotableRemoteFileDataStorageManager (#11) ReadReplicaDataStorageManager now implements multipartIndexReaderSupplier so that shared-storage read replicas can serve vector search queries on indexes stored as multipart files (written by the leader via RemoteFileDataStorageManager). - ReadReplicaDataStorageManager.multipartIndexReaderSupplier: constructs a RemoteRandomAccessReader.Supplier using client.getBlockSize() and the same {tableSpace}/{uuid}/multipart/{fileType} path convention as the writer - PromotableRemoteFileDataStorageManager: added overrides for all three multipart methods (writeMultipartIndexFile, multipartIndexReaderSupplier, deleteMultipartIndexFile) that delegate to activeDelegate, so that the correct implementation is used regardless of whether the node is currently a replica or a promoted leader - RemoteRandomAccessReaderTest: added testReadReplicaMultipartIndexReaderSupplier which creates a ReadReplicaDataStorageManager with a small block size (64 bytes), writes a 3-block multipart file, and verifies: full sequential read, correct total length, and two independent readers at different seek positions Fixes a bug where read replicas would use the no-op base-class multipartIndexReaderSupplier (returning null) and therefore fail to open vector index segments stored as multipart files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
#12) ## Summary - **VectorIndexEndToEndTest**: Fixed tablespace UUID mismatch between Server and EmbeddedIndexingService that caused `waitForCatchUp()` to loop forever — the indexing service was tailing a non-existent txlog directory - **NettyChannelAcceptor.close()**: Fixed JVM hang after tests complete by using `shutdownNow()` + `awaitTermination()` for the callback executor and bounded graceful shutdown for Netty event loops - **IndexingServiceClient.waitForInstanceCatchUp()**: Added 5-minute timeout to prevent infinite polling loops ## Test plan - [x] herddb-jdbc: 78 tests pass locally - [x] herddb-net: 7 tests pass locally - [x] herddb-services: 25 tests pass locally (including VectorIndexEndToEndTest which previously hung) - [ ] Monitor CI workflow `pr-validation-test-other.yml` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
Makes the indexing service's local disk fully reconstructible from S3
when running with `storage.type=remote`, so a Kubernetes pod can restart
with an empty volume and resume from the last published checkpoint.
## What this fixes
Two pieces of local state were previously unrecoverable after a disk
wipe:
1. **`watermark.dat`** — last-processed commit-log LSN, written only to
the local disk.
2. **`remote-metadata/`** — IndexStatus checkpoint markers that
`PersistentVectorStore` reads on startup to find its sealed S3 segments.
If wiped, the store can't discover what it had already checkpointed to
S3.
There was also a latent bug: `IndexingServiceEngine` only configured the
`PersistentVectorStore` factory for `storage.type=file`, silently
falling back to `InMemoryVectorStore` for `storage.type=remote`.
## Changes
- **`WatermarkStore`** is now an interface; `LocalWatermarkStore` keeps
the previous disk impl, and new **`S3WatermarkStore`** persists the
watermark to `{tableSpace}/_indexing/{instanceId}/watermark.lsn` with an
XXHash64 footer. Monotonicity is enforced on save.
- **Save contract**: the watermark is written **only** after a
successful `indexCheckpoint` — never on a timer, entry count, or
shutdown. This keeps the stored LSN always covered by a durable
IndexStatus on S3 and bounds S3 writes to ≤ 1 per checkpoint.
`IndexingServiceEngine.checkpointAndSaveWatermark()` triggers a
checkpoint every 5000 entries and only advances the watermark after all
stores succeed.
- **`SharedCheckpointMetadataManager.hydrateLocalMetadataDir`**
downloads every `_metadata/*.{index,table}status` object from S3 into a
staging dir and atomically moves it into place. The byte-for-byte
on-disk format matches `FileDataStorageManager`'s, so no deserialization
is needed.
- **`IndexingServer`** now wires a `SharedCheckpointMetadataManager`
onto its `RemoteFileDataStorageManager` (so every `indexCheckpoint`
publishes its IndexStatus to S3), and runs hydrate + installs the
`S3WatermarkStore` via a new `afterTableSpaceResolved` hook on the
engine.
- **Commit-log tailer** always starts from `START_OF_TIME` so DDL
entries are reprocessed on every boot and the in-memory `SchemaTracker`
is rebuilt. DML is replayed too but is idempotent (`addVector` replaces
by PK); previously-sealed segments load directly from S3 via
`PageStoreReader`, so replay only has to re-add uncheckpointed vectors.
A future PR can skip DML with LSN ≤ watermark while still reprocessing
DDL.
- **`IndexingServiceEngine`** now configures the `PersistentVectorStore`
factory for both `storage.type=file` AND `storage.type=remote`.
## Tests
- `S3WatermarkStoreTest` (new) — 10 unit tests with fault injection:
absent-returns-START_OF_TIME, roundtrip, path format, monotonicity,
equal-LSN idempotency, corruption detection, truncation detection,
overwrite-after-corruption, read-failure, write-failure.
- `IndexingServiceWithS3E2ETest` — updated to wipe the entire indexing
data dir (watermark.dat AND remote-metadata/) between stop and restart,
exercising the hydrate + watermark-from-S3 path. Also asserts
`watermark.lsn` exists on S3, not locally.
- All 141 indexing-service tests and 85 remote-file-service tests still
pass.
## Known follow-up
- DML replay from START_OF_TIME on every restart is correct but
wasteful; a future PR can persist a lightweight schema snapshot
alongside the watermark and skip already-applied DML.
- Full cluster E2E test (ZK + BK + 2 servers in cluster mode + indexing
+ restart everything except ZK/BK) was scoped out of this PR to keep it
reviewable.
## Test plan
- [x] `mvn -pl herddb-indexing-service test -Dtest=S3WatermarkStoreTest`
- [x] `mvn -pl herddb-services test -Dtest=IndexingServiceWithS3E2ETest`
- [x] `mvn -pl herddb-indexing-service test` (all 141 pass)
- [x] `mvn -pl herddb-remote-file-service test` (all 85 pass)
- [x] `mvn -pl herddb-services test
-Dtest='StaticDiscoveryRemoteFileIndexingTest,ZKDiscoveryRemoteFileIndexingTest'`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - **CheckpointExecuteTest.checkpointRequiresTablespaceParam**: the error message changed to "CHECKPOINT requires 1 or 2 parameters" when the optional timeout parameter was added, but the test still expected the old "requires one parameter" message. - **SysindexstatusTest.testSysindexstatusVector**: since commit 82efba7, embedded vector indexing is no longer supported — `VectorIndexManager.doStart()` now fails fast if `RemoteVectorIndexService` is not configured. The test now provides a `MockRemoteVectorIndexService` and asserts on the remote `IndexStatusInfo` fields (`vectorCount`, `segmentCount`, `status`) instead of the old embedded properties. The other two CI failures (SimpleFollowerTest, MultipleConcurrentUpdatesTest) pass consistently locally and appear to be CI-environment flakiness rather than deterministic bugs. ## Test plan - [x] `CheckpointExecuteTest` — all methods pass - [x] `SysindexstatusTest` — all methods pass (including vector test) - [x] `MultipleConcurrentUpdatesTest` — passes 3/3 runs locally - [x] `SimpleFollowerTest` — passes locally 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - **BookkeeperCommitLog bug fix**: In `BKFollowerContext.ensureOpenReader()`, the "find next ledger" loop at line 919 was assigning the found ledger ID to `ledgerToTail` instead of `nextLedger`. This meant `nextLedger` always remained `-1`, causing the follower to null out its `currentLedger` and take the "no more ledgers" path every time it needed to switch between closed/fully-read ledgers. This bug could cause the follower to reopen the same ledger repeatedly with incorrect `nextEntryToRead` offsets or fail to advance to new ledgers promptly. - **SimpleFollowerTest stabilization**: The test's auto-transaction uses `sync=false` (deferred) BK writes, meaning entries may not be visible to the follower immediately. Added a sync write (`NO_TRANSACTION` INSERT) after the transaction to guarantee BK visibility before the wait begins. Also added a fail-fast `isFailed()` check inside the wait condition and a descriptive timeout message for better CI diagnostics. ## Test plan - [x] `SimpleFollowerTest` — all 5 tests pass - [x] Verified the ledger switching fix is correct by code review - [ ] CI should confirm the SimpleFollowerTest no longer times out 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - Fix `estimatedSizeBytes` in the page-based write path to include both graph and map pages (was only counting graph pages), which caused segments to not seal properly and underestimated `avgBytesPerVector` in subsequent checkpoints - Add on-disk index size metric (`ondisk_estimated_size_bytes`) and per-segment size distribution metrics (`segment_size_min/median/max_bytes`) for Grafana observability - Rename misleading `estimated_size_bytes` metric to `live_vectors_estimated_memory_bytes` (it reports in-memory usage, not disk size) - Add two new Grafana panels: "On-Disk Index Size" and "Segment Size Distribution" (min/p50/max) ## Test plan - [x] `PersistentVectorStoreParallelPhaseBTest` (6 tests) — passes - [x] `PersistentVectorStoreSearchTest` (12 tests) — passes - [ ] Verify new metrics appear in Prometheus after deploying the indexing service - [ ] Verify Grafana dashboard renders the new panels correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
I have sent the PR to the wrong repo :-) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is to validate CI before merging the Vector Search branch into master