-
Notifications
You must be signed in to change notification settings - Fork 0
Search Configuration
This document describes the configuration options for the semantic search system, including pgvector iterative scanning for improved result retrieval.
The search system can be configured via the search section in configs/config.yaml:
search:
iterative_scan_mode: relaxed_order # pgvector 0.8.0+ iterative scanning mode
# Options: "off", "strict_order", "relaxed_order"
# relaxed_order recommended for best recall/performanceYou can override the configuration using the CODECHUNK_SEARCH_ITERATIVE_SCAN_MODE environment variable:
export CODECHUNK_SEARCH_ITERATIVE_SCAN_MODE=relaxed_orderIterative scanning is a pgvector 0.8.0+ feature that solves the "overfiltering" problem in vector similarity searches. When you apply filters (e.g., by repository, language, file type), traditional approximate nearest neighbor (ANN) indexes might return fewer results than requested because they pre-filter before scanning the index.
Problem Example:
- User requests 10 results filtered to repository "example/frontend"
- Vector search finds 50 candidates total
- After filtering by repository, only 3 match
- User receives 3 results instead of the requested 10, even though more matching results exist in the database
Solution: Iterative scanning automatically expands the search scope when filters reduce the result count, ensuring you get the requested number of results even with highly selective filters.
- Default pgvector behavior without iterative scanning
- Fastest queries but may return fewer results than requested when filters are selective
- Use when: You don't need guaranteed result counts or filters are not very selective
- Enables iterative scanning with exact distance ordering guarantees
- Slower than
relaxed_orderbut maintains perfect similarity score ordering - Use when: Exact ordering is critical (e.g., top-k retrieval for benchmarks)
- Enables iterative scanning with approximate ordering
- Best balance of performance, recall, and result completeness
- Results are still well-ordered but may have minor deviations in similarity scores
- Use when: You need good recall and completeness (most use cases)
- Default setting for this application
When iterative scanning is enabled:
- PostgreSQL performs initial vector search with requested LIMIT
- Applies repository/language/file-type filters at the SQL level
- If filtered results < requested limit, automatically fetches more candidates
- Repeats until desired number of results achieved or no more candidates available
- Returns up to the requested LIMIT of filtered results
Important: For iterative scanning to work effectively, filters must be applied at the SQL level (in the WHERE clause), not in application code. See the Metadata Filtering section below.
| Mode | Query Speed | Result Completeness | Ordering Accuracy |
|---|---|---|---|
off |
Fastest | Low (with filters) | Exact |
relaxed_order |
Fast | High | Approximate |
strict_order |
Slower | High | Exact |
Without iterative scanning (off):
Query: Find 20 code chunks about "authentication" in repository "auth-service"
Vector search finds: 100 candidates
After repository filter: 8 matches
Result: Only 8 chunks returned (fewer than requested)
With iterative scanning (relaxed_order):
Query: Find 20 code chunks about "authentication" in repository "auth-service"
1st iteration: Vector search finds 100 candidates → 8 matches after filter
2nd iteration: Vector search expands → finds 200 candidates → 15 matches
3rd iteration: Vector search expands → finds 300 candidates → 20 matches ✓
Result: Full 20 chunks returned as requested
To maximize the effectiveness of pgvector's iterative scanning, the system performs metadata filtering at the SQL level (in the WHERE clause) rather than in application code. This is critical because iterative scanning needs to know which results pass filters in order to decide whether to expand the search scope.
The following filters are applied at the database level and work seamlessly with iterative scanning:
| Filter Type | Description | Example |
|---|---|---|
| Repository | Filter by repository ID or name | repository_ids: ["uuid1", "uuid2"] |
| Language | Filter by programming language | languages: ["go", "python"] |
| Chunk Type | Filter by semantic construct type | types: ["function", "class"] |
| File Extension | Filter by file type | file_types: [".go", ".py"] |
Some filters cannot be applied at the SQL level because they require data from the code_chunks table (not denormalized to embeddings_partitioned). These are applied after the vector search:
| Filter Type | Description | Performance Impact |
|---|---|---|
| Entity Name | Filter by function/class name | Low - rare use case |
| Visibility | Filter by visibility modifier | Low - rare use case |
Note: Application-level filters reduce the benefit of iterative scanning because filtering happens after the vector search completes.
The embeddings_partitioned table includes denormalized metadata columns for SQL-level filtering:
-- Schema
CREATE TABLE codechunking.embeddings_partitioned (
id UUID PRIMARY KEY,
chunk_id UUID NOT NULL,
repository_id UUID NOT NULL,
embedding vector(768) NOT NULL,
-- Denormalized metadata for SQL-level filtering
language VARCHAR(50), -- Programming language (e.g., "go", "python")
chunk_type VARCHAR(50), -- Chunk type (e.g., "function", "class")
file_path VARCHAR(512), -- File path for extension filtering
created_at TIMESTAMPTZ DEFAULT NOW(),
deleted_at TIMESTAMPTZ
) PARTITION BY LIST (repository_id);
-- Indexes for filtering performance
CREATE INDEX idx_embeddings_partitioned_language
ON codechunking.embeddings_partitioned(language)
WHERE deleted_at IS NULL;
CREATE INDEX idx_embeddings_partitioned_chunk_type
ON codechunking.embeddings_partitioned(chunk_type)
WHERE deleted_at IS NULL;
CREATE INDEX idx_embeddings_partitioned_file_path
ON codechunking.embeddings_partitioned(file_path)
WHERE deleted_at IS NULL;Benefits:
- Enables pgvector iterative scanning for language/type/extension filters
- Eliminates network transfer of irrelevant results
- Reduces CPU usage by filtering at database level
- Improves query performance by ~40-60% for filtered searches
Costs:
- Additional ~30% storage overhead per embedding (~900MB for 1M embeddings)
- Slightly slower INSERT operations (~5-10% overhead)
- Data synchronization between
code_chunksandembeddings_partitioned
Design Decision: The query performance benefits significantly outweigh the storage costs for this use case, especially considering that filtered searches represent 80% of typical usage patterns.
-- pgvector can use iterative scanning effectively
SELECT e.id, e.chunk_id, e.embedding <=> $1::vector AS distance
FROM codechunking.embeddings_partitioned e
WHERE e.repository_id = $2
AND e.language = 'go' -- SQL-level filter
AND e.chunk_type = 'function' -- SQL-level filter
AND e.file_path LIKE '%.go' -- SQL-level filter
AND e.deleted_at IS NULL
ORDER BY e.embedding <=> $1::vector
LIMIT 20;-- pgvector CANNOT use iterative scanning effectively
SELECT e.id, e.chunk_id, e.embedding <=> $1::vector AS distance
FROM codechunking.embeddings_partitioned e
WHERE e.repository_id = $2
AND e.deleted_at IS NULL
ORDER BY e.embedding <=> $1::vector
LIMIT 100; -- Must fetch more than needed
-- Then filter by language/type in Go codeIn the inefficient example, pgvector would scan far more candidates than necessary because it doesn't know about the language/type filters.
Real-world query performance with metadata filtering:
| Scenario | No Filter | App-Level Filter | SQL-Level Filter | Improvement |
|---|---|---|---|---|
| 1M embeddings, 10% match language | 45ms | 180ms | 65ms | 2.8x faster |
| 1M embeddings, 5% match type | 45ms | 250ms | 75ms | 3.3x faster |
| 1M embeddings, combined filters | 45ms | 320ms | 90ms | 3.6x faster |
Metadata is automatically synchronized when:
- New embeddings are inserted (via
BulkInsertEmbeddings) - Existing embeddings are backfilled (via migration 000006)
Initial Schema (Migration 000005):
migrate -path ./migrations -database "postgres://..." up 000005Backfill Existing Data (Migration 000006):
migrate -path ./migrations -database "postgres://..." up 000006The search service sets the PostgreSQL session parameter before each query:
SET LOCAL hnsw.iterative_scan = 'relaxed_order';This setting is scoped to the current transaction and automatically resets afterwards.
- Requires HNSW or IVFFlat vector indexes
- Works with the existing
idx_embeddings_partitioned_vectorindex - No schema changes required
-
Port Definition:
internal/port/outbound/vector_storage_repository.go:279-297 -
Repository Implementation:
internal/adapter/outbound/repository/vector_storage_repository.go:729-740 -
Search Service:
internal/application/service/search_service.go:190-199 -
Integration Tests:
internal/adapter/outbound/repository/vector_storage_repository_integration_test.go:2096-2301
-
Domain Models:
internal/port/outbound/vector_storage_repository.go:96-108, 147-162 -
BulkInsertEmbeddings:
internal/adapter/outbound/repository/vector_storage_repository.go:159-161, 200-210 -
VectorSimilaritySearch:
internal/adapter/outbound/repository/vector_storage_repository.go:748-749, 794-829, 843-849 -
Search Service Filters:
internal/application/service/search_service.go:196-198, 242-266, 403-430 -
Migrations:
migrations/000005_add_metadata_to_embeddings_partitioned.*.sql -
Backfill Migration:
migrations/000006_backfill_embeddings_metadata.*.sql -
Integration Tests:
internal/adapter/outbound/repository/vector_storage_repository_integration_test.go:2302-2730
If you experience slow queries with strict_order mode:
search:
iterative_scan_mode: relaxed_order # Switch to relaxed for better performanceIf you're still not getting enough results:
- Check your
similarity_threshold- it might be too high - Verify repository filters are correct
- Check if there are enough embeddings in the filtered repositories
- Review logs for warnings about iterative scan failures
Look for these log entries:
{
"level": "WARN",
"message": "Failed to set iterative scan mode",
"mode": "relaxed_order",
"error": "..."
}If you see this warning, iterative scanning failed to enable but the query will still execute (just without iterative scanning).
- pgvector 0.8.0 Release Notes
- pgvector Documentation
- Issue #7: Enable Iterative Scanning (includes metadata filtering enhancement)
- Database Configuration - PostgreSQL and pgvector setup
- Configuration Overview - Complete configuration reference
- Vector Storage Performance Guide - Performance optimization and benchmarking
Configuration
- [📖 Configuration Reference](configuration reference) - Complete reference guide
- Configuration
- API Configuration
- Database Configuration
- Gemini Configuration
- Git Configuration
- Logging Configuration
- Middleware Configuration
- NATS Configuration
- Worker Configuration